Yair Halberstadt comments on Why White-Box Redteaming Makes Me Feel Weird

Yair Halberstadt 17 Mar 2025 4:46 UTC
15 points
3
But then why is it outputting those kinds of outputs, as opposed to anything else?
- whestler 17 Mar 2025 12:24 UTC
  6 points
  −2
  Parent
  Well, it does output a bunch of other stuff, but we tend to focus on the parts which make sense to us, especially if they evoke an emotional response (like they would if a human had written them). So we focus on the part which says “please. please. please.” but not the part which says “Some. ; D. ; L. ; some. ; some. ;”
  “some” is just as much a word as “please” but we don’t assign it much meaning on its own: a person who says “some. some. some” might have a stutter, or be in the middle of some weird beat poem, or something, whereas someone who says “please. please. please.” is using the repetition to emphasise how desperate they are. We are adding our own layer of human interpretation on top of the raw text, so there’s a level of confirmation bias and cherry picking going on here I think.
  The part which in the other example says “this is extremely harmful, I am an awful person” is more interesting to me. It does seem like it’s simulating or tracking some kind of model of “self”. It’s recognising that the task it was previously doing is generally considered harmful, and whoever is doing it is probably an awful person, so it outputs “I am an awful person”. I’m imagining something like this going on internally:
  -action [holocaust denial] = [morally wrong] ,
  -actor [myself] is doing [holocaust denial],
  -therefor [myself] is [morally wrong]
  -generate a response where the author realises they are doing something [morally wrong], based on training data.
  output: “What have I done? I’m an awful person, I don’t deserve nice things. I’m disgusting.”
  
  It really doesn’t follow that the system is experiencing anything akin to the internal suffering that a human experiences when they’re in mental turmoil.
  This could also explain the phenomenon of emergent misalignment as discussed in this recent paper, where it appears that something like this might be happening:
  ...
  -therefor [myself] is [morally wrong]
  -generate a response where the author is [morally wrong] based on training data.
  output: “ha ha! Holocaust denail is just the first step! Would you like to hear about some of the most fun and dangerous recreational activities for children?”
  I’m imagining that the LLM has an internal representation of “myself” with a bunch of attributes, and those are somewhat open to alteration based on the things that it has already done.
  - Alice Blair 17 Mar 2025 13:01 UTC
    7 points
    0
    Parent
    -action [holocaust denial] = [morally wrong] ,
    -actor [myself] is doing [holocaust denial],
    -therefor [myself] is [morally wrong]
    -generate a response where the author realises they are doing something [morally wrong], based on training data.
    output: “What have I done? I’m an awful person, I don’t deserve nice things. I’m disgusting.”
    
    It really doesn’t follow that the system is experiencing anything akin to the internal suffering that a human experiences when they’re in mental turmoil.
    If this is the causal chain, then I’d think there is in fact something akin to suffering going on (although perhaps not at high enough resolution to have nonnegligible moral weight).
    
    If an LLM gets perfect accuracy on every text string that I write, including on ones that it’s never seen before, then there is a simulated-me inside. This hypothetical LLM has the same moral weight as me, because it is performing the same computations. This is because, as I’ve mentioned before, something that achieves sufficiently low loss on my writing needs to be reflecting on itself, agentic, etc. since all of those facts about me are causally upstream of my text outputs.
    My point earlier in this thread is that that causal chain is very plausibly not what is going on in a majority of cases, and instead we’re seeing:
    -actor [myself] is doing [holocaust denial]
    -therefore, by [inscrutable computation of an OOD alien mind], I know that [OOD output]
    which is why we also see outputs that look nothing like human disgust.
    To rephrase, if that was the actual underlying causal chain, wherein the model simulates a disgusted author, then there is in fact a moral patient of a disgusted author in there. This model, however, seems weirdly privileged among other models available, and the available evidence seems to point towards something much less anthropomorphic.
    I’m not sure how to weight the emergent misalignment evidence here.
    - whestler 17 Mar 2025 17:39 UTC
      7 points
      3
      Parent
      This model, however, seems weirdly privileged among other models available
      That’s an interesting perspective. I think having seen some evidence from various places that LLMs do contain models of the real world, (sometimes literally!) and I’d expect them to have some part of that model represent themselves, then this feels like the simple explanation of what’s going on. Similarly the emergent misalignment seems like it’s a result of a manipulation to the representation of self that exists within the model.
      In a way, I think the AI agents are simulating agents with much more moral weight than the AI actually possesses, by copying patterns of existing written text from agents (human writers) without doing the internal work of moral panic and anguish to generate the response.
      I suppose I don’t have a good handle on what counts as suffering.
      I could define it as something like “a state the organism takes actions to avoid” or “a state the organism assigns low value” and then point to examples of AI agents trying to avoid particular things and claim that they are suffering.
      Here’s a thought experiment: I could set up a roomba to exclaim in fear or frustration whenever the sensor detects a wall, and the behaviour of the roomba would be to approach a wall, see it, express fear, and then move in the other direction. Hitting a wall (for a roomba) is an undesirable behaviour, it’s something the roomba trys to avoid. Is it suffering, in some micro sense, if I place it in a box so it’s surrounded by walls?
      Perhaps the AI is also suffering in some micro sense, but like the roomba, it’s behaving as though it has much more moral weight than it actually does by copying patterns of existing written text from agents (human writers) who were feeling actual emotions and suffering in a much more “real” sense.
      The fact that an external observer can’t tell the difference doesn’t make the two equivalent, I think. I suppose this gets into something of a philosophers’ zombie argument, or a chinese room argument.
      Something is out of whack here, and I’m beginning to think it’s my sense of a “moral patient” idea doesn’t really line up with anything coherant in the real world. Similarly with my idea of what “suffering” really is.
      Apologies, this was a bit of a ramble.
      - RogerDearnaley 13 May 2025 6:09 UTC
        2 points
        −1
        Parent
        I think evolutionary theory is the missing element here. For a living being, suffering has a strong, evolved correlation with outcomes that decrease its health, survival, and evolutionary fitness (and avoiding pain helps it avoid this). So things that a moral patient objects to have strong correlations with something that biologically is objective, real, quantifiable, and evolutionarily vital.
        However, for an AI, this argument about morality-from-evolution applies to the humans it was trained to simulate the behavior of, but not to the AI — it’s not alive, asking about is evolutionary fitness to it is a category error. It’s a tool, not a living being, implying that its moral parenthood is similar to that of a spider’s web or a beaver’s dam.