Really cool paper. I am a bit unsure about the implication in this section particularly:
Our experiments thus far explored models’ ability to “read” their own internal representations. In our final experiment, we tested their ability to control these representations. We asked a model to write a particular sentence, and instructed it to “think about” (or “don’t think about”) an unrelated word while writing the sentence. We then recorded the model’s activations on the tokens of the sentence, and measured their alignment with an activation vector representing the unrelated “thinking word” (“aquariums,” in the example below).
How do we know that it is “intentional” on part of the model, versus the more benign explanations of attending to a very salient instruction (think of X), versus the attention of a less salient instruction (don’t think of X). One workaround could be to try to say “Write about something related/unrelated to [concept] while thinking of it at X% of your mental headspace” or similar. Even better if we can test something like lying or sycophancy.
Usually I wouldn’t ask for repeating the same experiment with a new model, but Claude Sonnet 4.5 has qualitatively different levels of self/eval awareness. It would be interesting to know if we can measure that using our interpretability tools, and if it has different behaviours on some introspection tests as well.
Really cool paper. I am a bit unsure about the implication in this section particularly:
How do we know that it is “intentional” on part of the model, versus the more benign explanations of attending to a very salient instruction (think of X), versus the attention of a less salient instruction (don’t think of X). One workaround could be to try to say “Write about something related/unrelated to [concept] while thinking of it at X% of your mental headspace” or similar. Even better if we can test something like lying or sycophancy.
Usually I wouldn’t ask for repeating the same experiment with a new model, but Claude Sonnet 4.5 has qualitatively different levels of self/eval awareness. It would be interesting to know if we can measure that using our interpretability tools, and if it has different behaviours on some introspection tests as well.