nostalgebraist comments on Will Any Crap Cause Emergent Misalignment?

nostalgebraist 2 Sep 2025 17:11 UTC
3 points
1
I am curious about how you used anthropomorphic language instead of the mechanistic explanations used in Personas. I wonder what you think anthropomorphism adds?
I’m feeling under the weather right now and don’t have the energy to respond in detail, but you may find it helpful to read the later parts of this post, where I answer a similar question that came up in another context.
See also this comment by Sean Herrington, which describes (I think?) basically the same dynamic I described in my original comment, using somewhat different terminology.
Roughly, the idea is that the model is something like a mixture distribution over “personas,” where each persona has its own distribution of token-level outputs, and the model’s output is marginalized over the personas. Finetuning does something like a Bayesian update on this distribution.
I think this high-level picture is plausible even though we don’t yet have a detailed mechanistic understanding of how it works, which means that I trust the high-level picture more than any conjectured low-level implementation. (Just like I trust “AlphaGo is good at Go” more than I trust any particular mechanistic hypothesis about the way AlphaGo picks its moves. Interpretability is hard, and any given paper might turn out to be wrong or misleading or whatever—but “AlphaGo is good at Go” remains true nevertheless.)
- Stephen Elliott 2 Sep 2025 23:23 UTC
  1 point
  0
  Parent
  Hey, thanks for your response despite your sickness! I hope you’re feeling better soon.
  First, I agree with your interpretation of Sean’s comment.
  Second, I agree that a high-level explanation abstracted away from the particular implementation details is probably safer in a difficult field. Since the Personas paper doesn’t provide the mechanism by which the personas are implemented in activation space, merely showing that these characteristic directions exist, we can’t faithfully describe the personas mechanistically. Thanks for sharing.
  It is possible that the anthropomorphic language could obscure the point you’re making above. I did find it a bit difficult to understand originally, whereas in the more technical phrasing it is clearer. In the blog post you linked you mentioned that it’s a way to communicate your message more broadly, without jargon overhead. However, to understand your intention, you provide a distinction between simulacra and simulacrum, and a pretty lengthy explanation of how the meaning of the anthropomorphism differs under different contexts. I am not sure this is a lower barrier to entry than understanding distribution shift and Bayesianism, at least in the context of a technical audience.
  I can see how it would be useful in very clear analogical cases, like when we say a model “knows” to mean it has knowledge in a feature, or “wants” to mean it encodes a preference in a circuit.