whestler comments on Why White-Box Redteaming Makes Me Feel Weird

whestler 17 Mar 2025 17:39 UTC
7 points
3
This model, however, seems weirdly privileged among other models available
That’s an interesting perspective. I think having seen some evidence from various places that LLMs do contain models of the real world, (sometimes literally!) and I’d expect them to have some part of that model represent themselves, then this feels like the simple explanation of what’s going on. Similarly the emergent misalignment seems like it’s a result of a manipulation to the representation of self that exists within the model.
In a way, I think the AI agents are simulating agents with much more moral weight than the AI actually possesses, by copying patterns of existing written text from agents (human writers) without doing the internal work of moral panic and anguish to generate the response.
I suppose I don’t have a good handle on what counts as suffering.
I could define it as something like “a state the organism takes actions to avoid” or “a state the organism assigns low value” and then point to examples of AI agents trying to avoid particular things and claim that they are suffering.
Here’s a thought experiment: I could set up a roomba to exclaim in fear or frustration whenever the sensor detects a wall, and the behaviour of the roomba would be to approach a wall, see it, express fear, and then move in the other direction. Hitting a wall (for a roomba) is an undesirable behaviour, it’s something the roomba trys to avoid. Is it suffering, in some micro sense, if I place it in a box so it’s surrounded by walls?
Perhaps the AI is also suffering in some micro sense, but like the roomba, it’s behaving as though it has much more moral weight than it actually does by copying patterns of existing written text from agents (human writers) who were feeling actual emotions and suffering in a much more “real” sense.
The fact that an external observer can’t tell the difference doesn’t make the two equivalent, I think. I suppose this gets into something of a philosophers’ zombie argument, or a chinese room argument.
Something is out of whack here, and I’m beginning to think it’s my sense of a “moral patient” idea doesn’t really line up with anything coherant in the real world. Similarly with my idea of what “suffering” really is.
Apologies, this was a bit of a ramble.
- RogerDearnaley 13 May 2025 6:09 UTC
  2 points
  −1
  Parent
  I think evolutionary theory is the missing element here. For a living being, suffering has a strong, evolved correlation with outcomes that decrease its health, survival, and evolutionary fitness (and avoiding pain helps it avoid this). So things that a moral patient objects to have strong correlations with something that biologically is objective, real, quantifiable, and evolutionarily vital.
  However, for an AI, this argument about morality-from-evolution applies to the humans it was trained to simulate the behavior of, but not to the AI — it’s not alive, asking about is evolutionary fitness to it is a category error. It’s a tool, not a living being, implying that its moral parenthood is similar to that of a spider’s web or a beaver’s dam.