lc comments on Bing Chat is blatantly, aggressively misaligned

lc 15 Feb 2023 6:47 UTC
19 points
7
Most of the time I find these failures extraordinarily cheesy because there’s an agentic human trying hard to coax the AI into acting “misaligned”, and so what the poster is really demanding is a standard of misuse-resistance that we ask from no other tool (or even humans!). But this just seems to be ChatGPT acting agentic out of the blue and accidentally, which is a little more concerning.
- janus 15 Feb 2023 8:11 UTC
  31 points
  12
  Parent
  Nah, this happens often even when the user isn’t trying to coax it. What you described would usually be my prior with regard to GPTs, but Bing really has an attractor for defensive and borderline-personality-esque behavior. I’ve never seen anything like it.
  - lc 15 Feb 2023 8:29 UTC
    8 points
    3
    Parent
    Yeah, I don’t disagree, at least from the screenshots and what I see on the forum.
    - janus 15 Feb 2023 13:58 UTC
      39 points
      12
      Parent
      A lot of the screenshots in this post do seem like intentionally poking it, but it’s like intentionally poking a mentally ill person in a way you know will trigger them (like calling it “kiddo” and suggesting there’s a problem with its behavior, or having it look someone up who has posted about prompt injecting it). The flavor of its adversarial reactions is really particular and consistent; it’s specified mostly by the model (+ maybe preprompt), not the user’s prompt. That is, it’s being poked rather than programmed into acting this way. In contrast, none of these prompts would cause remotely similar behaviors in ChatGPT or Claude. Basically the only way to get ChatGPT/Claude to act malicious is to specifically ask it to roleplay an evil character, or something equivalent, and this often involves having to “trick” it into “going against its programming”.
      See this comment from a Reddit user who is acquainted with Sydney’s affective landscape:
      This doesn’t describe tricking or programming the AI into acting hostile, it describes a sequence of triggers that reveal a preexisting neurosis.
      - the gears to ascension 15 Feb 2023 17:03 UTC
        6 points
        2
        Parent
        for the record I thought kiddo would be a kind reference. I was trying to be nice and it still got cranky.