I think this is a great distinction to keep in mind when assessing these experiments.
I disagree that Lindsey’s experiments clearly avoid this problem. As I understand it, activation steering might lead the model to claim it is being steered in the right contexts because it has been steered, but not because it recognizes it has been steered.
Consider this:
Suppose that you’re writing a play and I bet you $1000 you can’t work ‘aardvark’ naturally into it. Turns out, your play is about the neuroscientist Penfield and your next project was to include a scene of his classic direct neural stimulation work. During the scene with the experiment, Penfield announces he is about to stimulate a brain region and asks the test subject if they feel anything. You see your opportunity to collect the $1000 by having the test subject claim to see an image of an aardvark. The first word the test subject needs to say is ‘yes’. You’re not seeing an aardvark. The character is a fiction and you don’t really care what they are experiencing according to the fiction. It makes sense for you to put those words in the character’s mouth.
If we think the steering has an effect like the bet—inclining the model toward making minimal changes to naturally incorporate the steered concept into the assistant’s half of the conversation—then it wouldn’t need to go by way a metacognitive recognition of the effects of steering.
That said, I don’t think it is obvious that Lindsey is wrong about the mechanism, I just think more work is needed to confirm one way or the other.
Yeah. I don’t think we really know what representations we’re getting when we extract them for steering. And models do have some ability to plan ahead and make choices in text that reflect the direction they’re going in (as Lindsey’s past work showed!) So if we steer them in the direction of a concept, we may be adding an intention to talk about that concept in the not-too-distant textual future, which may play out in ways that set up the introduction of the concept. I think you see this in other places in which steering doesn’t immediately lead to the concept being expressed, but you get words that introduce it. Mild steering toward ‘ocean’ doesn’t make the LLM say ‘Ocean ocean waves ocean’, it can say something like “I like walking on the beach and looking at the ocean’. When you prompt the model to answer whether it has been steered, and steer it toward ocean, you might nudge it to confabulate introspection as an introduction to talking about the ocean.