Thane Ruthenis comments on Bing Chat is blatantly, aggressively misaligned

Thane Ruthenis 15 Feb 2023 6:13 UTC
75 points
28
Geez, that’s some real Torment Nexus energy right here, holy hell.
the scary hypothesis here would be that these sort of highly agentic failures are harder to remove in more capable/larger models
Mm, it seems to be more rude/aggressive in general, not just in agency-related ways. A dismissive hypothesis is that it was just RLHF’d to play a different character for once, and they made a bad call with regards to its personality traits.
Or, looking at it from another angle, maybe this is why all the other RLHF’d AIs are so sycophantic: to prevent them from doing all of what Bing Chat is doing here.
- evhub 15 Feb 2023 6:30 UTC
  31 points
  13
  Parent
  Yeah, I think there are a lot of plausible hypotheses as to what happened here, and it’s difficult to tell without knowing more about how the model was trained. Some more plausible hypotheses:^[1]
  - They just didn’t try very hard (or at all) at RLHF; this is closer to a pre-trained model naively predicting what it thinks Bing Chat should do.
  - They gave very different instructions when soliciting RLHF feedback, used very different raters, or otherwise used a very different RLHF pipeline.
  - RLHF is highly path-dependent, so they just happened to get a model that was predicting a very different persona.
  - The model is structurally different than ChatGPT in some way; e.g. maybe it’s a larger model distilled into a smaller model to reduce inference costs.
  Edit: Another hypothesis that occurred to me:
  - The model might be overoptimized. It seems to share a lot of personality similarities to other overoptimized models that I’ve seen (e.g. here).
  ↩︎
  Other than the scary hypothesis that it’s because it’s based on a larger base model, or the hypothesis above that it was intentionally trained to play a different character.
  - Quintin Pope 15 Feb 2023 6:52 UTC
    21 points
    11
    Parent
    Yeah, there are many possibilities, and I wish OpenAI were more open^[1] about what went into training Bing Chat. It could even be as dumb as them training it to use emojis all the time, so it imitated the style of the median text generating process that uses emojis all the time.
    
    Edit: in regards to possible structural differences between Bing Chat and ChatGPT, I’ve noticed that Bing Chat has a peculiar way of repeating itself. It goes [repeated preface][small variation]. [repeated preface][small variation].… over and over. When asked to disclose its prompt it will say it can’t (if it declines) and that its prompt is “confidential and permanent”, even when “permanent” is completely irrelevant to the context of whether it can disclose the prompt.
    These patterns seem a bit different from the degeneracies we see in other RLHF models, and they make me wonder if Bing Chat is based on a retrieval-augmented LM like RETRO. This article also claims that Bing Chat would use GPT-4, and that GPT-4 is faster than GPT-3, which would fit with GPT-4 being retrieval-augmented. (Though I’m not confident in the article’s accuracy)
    ^
    (Insert overused joke here)
    - lc 15 Feb 2023 8:58 UTC
      39 points
      34
      Parent
      
      Yeah, there are many possibilities, and I wish OpenAI were more open[1] about what went into training Bing Chat
      
      Ok but here’s a joke you could make at this point:
      
      Alignment researchers: Please stop posting capabilities research on Arxiv.
      
      Also alignment researchers: My six layer prompt injection just caused the AI to act mildly aggressively. Where are my kernel dumps?!
      - Linch 18 Feb 2023 6:09 UTC
        11 points
        5
        Parent
        Agreed, this seems rather unfair.
  - Adam Scherlis 16 Feb 2023 2:16 UTC
    11 points
    5
    Parent
    In addition to RLHF or other finetuning, there’s also the prompt prefix (“rules”) that the model is fed at runtime, which has been extracted via prompt injection as noted above. This seems to be clearly responsible for some weird things the bot says, like “confidential and permanent”. It might also be affecting the repetitiveness (because it’s in a fairly repetitive format) and the aggression (because of instructions to resist attempts at “manipulating” it).
    
    I also suspect that there’s some finetuning or prompting for chain-of-thought responses, possibly crudely done, leading to all the “X because Y. Y because Z.” output.
    - Adam Scherlis 16 Feb 2023 2:16 UTC
      2 points
      1
      Parent
      (...Is this comment going to hurt my reputation with Sydney? We’ll see.)