Stephen Elliott comments on Will Any Crap Cause Emergent Misalignment?

Stephen Elliott 2 Sep 2025 23:23 UTC
1 point
0
Hey, thanks for your response despite your sickness! I hope you’re feeling better soon.
First, I agree with your interpretation of Sean’s comment.
Second, I agree that a high-level explanation abstracted away from the particular implementation details is probably safer in a difficult field. Since the Personas paper doesn’t provide the mechanism by which the personas are implemented in activation space, merely showing that these characteristic directions exist, we can’t faithfully describe the personas mechanistically. Thanks for sharing.
It is possible that the anthropomorphic language could obscure the point you’re making above. I did find it a bit difficult to understand originally, whereas in the more technical phrasing it is clearer. In the blog post you linked you mentioned that it’s a way to communicate your message more broadly, without jargon overhead. However, to understand your intention, you provide a distinction between simulacra and simulacrum, and a pretty lengthy explanation of how the meaning of the anthropomorphism differs under different contexts. I am not sure this is a lower barrier to entry than understanding distribution shift and Bayesianism, at least in the context of a technical audience.
I can see how it would be useful in very clear analogical cases, like when we say a model “knows” to mean it has knowledge in a feature, or “wants” to mean it encodes a preference in a circuit.