Extract what they called an ALL CAPS vector, the difference between a prompt with ALL CAPS and the same prompt without all caps.
Inject that vector into the activations of an unrelated prompt.
Ask the model if it could detect an injected thought. About 20% of the time it said yes, and identified it with something like “loud” or “shouting” which is quite similar to the all caps they were going for.
They did this with a bunch of concepts. And Opus 4 was, of course, better than other models at this.
To give some fairly obvious thoughts on what this means:
If a model can unreliably do this now when directly prompted to do so, I expect the model can reliably do this without being prompted in around 2-3 years from now.
If you have an alignment plan that involves altering the model’s thoughts in some way, consider that the model will probably be aware by default that you did that, and plan accordingly.
Finally a more personal thought:
The idea of “Well, if you try to adjust a superintelligence’s thoughts, it will know you did that and try to subvert it / route around the imposing thoughts” would have sounded weird to me, if I’d tried saying it out loud. We’re talking about literally altering the model’s mind to think what we want it to think, and you still think that’s not enough for alignment? I note that I did think that, but I was thinking in terms of “The concepts you are altering are a proxy for the thing you are actually trying to alter, and the important performance-enhancing stuff like deception will migrate to the other parts” and not “The model will detect you altering its mind and likely actively work against you when it builds the situational awareness for that”. And yet, here we are, with me insufficiently paranoid. Unfortunately, reality doesn’t care about what sounds intuitively plausible.
I wonder if this is an example that would allow people to feel, viscerally, the difficulty of aligning an actually smart entity. Here we have a capability that sounds scary and superhuman and only arose with smart models. The idea of a model being aware of when you’re altering its thoughts is terrifying. The lesson we would like to give to decision-makers is something like:
“AI’s will soon know when you try to adjust its thoughts. That’s how smart these things are. We can employ a primitive type of mind control on them and that still isn’t enough. We are really not ready for even smarter systems than this. Any alignment plan that might work has to be stronger than mind control. Anything below the level of literal mind control isn’t even the ante to sit down at the table any more.”
I can’t literally reduce this down to five words, but a succinct message might be something like “Mind control won’t work. We need a plan that’s better than that.”
OK, so here we should give some credit to Eliezer for having thought this through and who has been making a big deal about this precise thing (although partly through his recent fanfic that he keeps alluding to, so maybe that doesn’t really count). I tend to find him a bit annoying on these things because surely people wouldn’t seriously try to pin substantial hopes on these kind of ideas, but what do I know.
Thank you for pointing this out. I am deducting some points from myself for not realizing this on my own. On a more mundane level: steering vectors might get somewhat screwy rather soon.
Could it be limited to stuff like LLMs rather than all kinds of AI? They were trained with massive amounts of data that don’t reflect the imposed thought, and the resulting preferences/motivation is distributed in a large network. Injecting one vector doesn’t affect all the existing circuits of preferences and thinking habits sufficiently, so its chain of thoughts may be able to break free enough to realize and work around it.
One would hope indeed. But even so, we do now know that this is likely to be the kind of action that could be detected and opposed. And since I didn’t predict in advance that this would happen, especially at this capability level, the update for me is that it’s going to be significantly harder than I would hope.
I recently read Anthropic’s new paper on introspection in LLM’s. (https://www.anthropic.com/research/introspection)
In short, they were able to:
Extract what they called an ALL CAPS vector, the difference between a prompt with ALL CAPS and the same prompt without all caps.
Inject that vector into the activations of an unrelated prompt.
Ask the model if it could detect an injected thought. About 20% of the time it said yes, and identified it with something like “loud” or “shouting” which is quite similar to the all caps they were going for.
They did this with a bunch of concepts. And Opus 4 was, of course, better than other models at this.
To give some fairly obvious thoughts on what this means:
If a model can unreliably do this now when directly prompted to do so, I expect the model can reliably do this without being prompted in around 2-3 years from now.
If you have an alignment plan that involves altering the model’s thoughts in some way, consider that the model will probably be aware by default that you did that, and plan accordingly.
Finally a more personal thought:
The idea of “Well, if you try to adjust a superintelligence’s thoughts, it will know you did that and try to subvert it / route around the imposing thoughts” would have sounded weird to me, if I’d tried saying it out loud. We’re talking about literally altering the model’s mind to think what we want it to think, and you still think that’s not enough for alignment? I note that I did think that, but I was thinking in terms of “The concepts you are altering are a proxy for the thing you are actually trying to alter, and the important performance-enhancing stuff like deception will migrate to the other parts” and not “The model will detect you altering its mind and likely actively work against you when it builds the situational awareness for that”. And yet, here we are, with me insufficiently paranoid. Unfortunately, reality doesn’t care about what sounds intuitively plausible.
I wonder if this is an example that would allow people to feel, viscerally, the difficulty of aligning an actually smart entity. Here we have a capability that sounds scary and superhuman and only arose with smart models. The idea of a model being aware of when you’re altering its thoughts is terrifying. The lesson we would like to give to decision-makers is something like:
“AI’s will soon know when you try to adjust its thoughts. That’s how smart these things are. We can employ a primitive type of mind control on them and that still isn’t enough. We are really not ready for even smarter systems than this. Any alignment plan that might work has to be stronger than mind control. Anything below the level of literal mind control isn’t even the ante to sit down at the table any more.”
I can’t literally reduce this down to five words, but a succinct message might be something like “Mind control won’t work. We need a plan that’s better than that.”
OK, so here we should give some credit to Eliezer for having thought this through and who has been making a big deal about this precise thing (although partly through his recent fanfic that he keeps alluding to, so maybe that doesn’t really count).
I tend to find him a bit annoying on these things because surely people wouldn’t seriously try to pin substantial hopes on these kind of ideas, but what do I know.
Thank you for pointing this out. I am deducting some points from myself for not realizing this on my own. On a more mundane level: steering vectors might get somewhat screwy rather soon.
Could it be limited to stuff like LLMs rather than all kinds of AI? They were trained with massive amounts of data that don’t reflect the imposed thought, and the resulting preferences/motivation is distributed in a large network. Injecting one vector doesn’t affect all the existing circuits of preferences and thinking habits sufficiently, so its chain of thoughts may be able to break free enough to realize and work around it.
One would hope that progress in interpretability will allow us to do much more refined “mind control” than adding difference vectors.
One would hope indeed. But even so, we do now know that this is likely to be the kind of action that could be detected and opposed. And since I didn’t predict in advance that this would happen, especially at this capability level, the update for me is that it’s going to be significantly harder than I would hope.