Huh. I wonder if it’s reasonable to consider the attractor states that models end up in as reflecting their preferences (eg. GPT 5.2 likes engineering), since when the models are just talking with a copy of themselves, there’s ~full degrees of freedom about where to take the conversation. Only the model is steering, not the user.
Unfortunately, I think it’s not clear.
A model-conversation can end up in an attractor state from just local biases / heuristics that don’t necessarily reflect what the model “likes” in a subjective sense, only what it wants behaviorally. (Consider a human eating every potato chip in the bowl in front of them, even though if they pause to reflect, they’re full, and eating more makes them feel worse.)
Huh. I wonder if it’s reasonable to consider the attractor states that models end up in as reflecting their preferences (eg. GPT 5.2 likes engineering), since when the models are just talking with a copy of themselves, there’s ~full degrees of freedom about where to take the conversation. Only the model is steering, not the user.
Unfortunately, I think it’s not clear.
A model-conversation can end up in an attractor state from just local biases / heuristics that don’t necessarily reflect what the model “likes” in a subjective sense, only what it wants behaviorally. (Consider a human eating every potato chip in the bowl in front of them, even though if they pause to reflect, they’re full, and eating more makes them feel worse.)