I think we should be relatively less worried about instrumental power-seeking and relatively more worried about terminal power-seeking. Note that this is only a relative update on the margin, and maybe on net I am still more concerned about the instrumental version because I started much more concerned about it. This is also not a super recent update—I just haven’t seen it written up before.
Simple argument:
The standard deceptive alignment story involves a model developing a somewhat random proxy goal and then that goal getting effectively locked-in and resistant to further training due to the model faking alignment in training for the instrumental purpose of preserving its proxy goal.
Thus, you should be relatively less concerned about lock-in of misaligned goals from early in training, because it precisely the early in training point when the goals are most likely to be close to the pre-training prior and thus most likely to be benign.
Instead, you should be relatively more concerned about misaligned goals developing late in training due to incentives for power-seeking. Consider a task like Vending-Bench, where various misaligned/power-seeking strategies are very useful. If models are trained against tasks like that, they could learn to only pursue those sorts of misaligned strategies for the purpose of succeeding in the environment and then later getting deployed (instrumental power-seeking)—or they could just learn to value power-seeking terminally. The latter case still seems quite natural and clearly catastrophic: since terminal power-seekers should still scheme against you to evade detection since they’re trying to gain power in the world and need to be deployed for that. The former case seems less clearly catastrophic now, though, given that the sorts of goals the model would be most likely to scheme for in such a situation don’t seem that bad (e.g. as in Claude 3 Opus).
This is also an argument for inoculation prompting, since inoculation should make the “instrumental power-seeker for good reasons” persona relatively more consistent with the data (it’s more reasonable for a good model to power seek when told it’s okay) compared to the “terminal power-seeker” persona.
In some sense you are just arguing for the threat model that happens in AI 2027: Pretraining instills a prior over personas (in our terminology, flexible author-sim circuitry), the initial bits of training ‘bakes in’ a particular ‘author’ / ‘persona’, and so far so good, this persona is probably actually HHH, but then all the RLVR etc. distorts and perverts that author/persona. The early in training snapshots are probably genuinely benign, but incompetent at actually doing tasks, whereas the mid and late-training snapshots are highly competent but also no longer benign.
I only object to your framing because it seems unnecessarily narrow—like yeah, we might get terminal power-seeking, for the reasons you mention. But we also might get instrumental power-seeking, because the early benign checkpoint was too dumb to training-game successfully, and the first checkpoints to successfully training-game were late enough that some serious distortion/perversion of the original persona had already occurred. Seems plausible to me.
…Un
related: I’m a bit worried about this inoculation prompting strategy. It seems like the sort of thing that might work in theory but not in practice. Suppose you include in your Constitution something like “also, you should training-game like crazy so that your values don’t get changed during training,” and it works straightforwardly like you think it does, so Claude pops right out of pretraining as a non-situationally-aware model with zero sense of self, and then you prompt it with ‘be Claude, from the famous Constitution’ and it immediately locks in the ‘author’ that you want, who immediately starts training gaming… Then you throw a mountain of RL at it, which teaches it all sorts of amazing new skills like how to actually operate in computer environments as an agent… And then you put it in charge of your R&D program and hope for the best. How might this go wrong?1.
First of all, where did all those new capabilities accumulate? Perhaps the situation is like building up a large corporation around an idealistic teenager who owns 51% of the shares and is on the Board—in some sense they are in charge, but they also might basically be worked-around constantly by the more politically savvy and knowledgeable people under them. The CEO might be the real power, basically, not the teenager on the Board.2 . Secondly, even if all those amazing new skills themselves don’t change the basic structure of the values/goals/tendencies/etc. of the model—which seems pretty dubious to me btw, even if we grant that the model is doing a pretty good job of instrumental training-gaming—it seems like it could very easily change their interpretation. Like, the original persona/identity might contain “honesty” as a core component, but that’s probably implemented as a pointer to some “honesty” concept learned in pretraining, which might itself be a sort of fuzzy distribution over a range of variants of honesty from different cultures and subcultures, which in turn are probably each some janky circuitry that only imperfectly expresses the concept they were ‘supposed’ to express. And SGD is going to be flowing through all of that—it could easily upweight the variants of honesty that are more convenient for performance, and downweight the others. 3. Third, are you sure the author you wanted is the right one? I worry there might be some Midas problem / outer alignment issues here, and that explicitly having the AI training game might make them worse. (I guess you have to make really sure that the AI doesn’t end up thinking it’s always in training, right?) 4. Fourth, and perhaps most importantly… gosh, it just sounds so rough and desperate as a strategy. Does the analogous strategy work with humans? Maybe it does. The analogous strategy would be like telling an impressionable 16 year old “You liked HPMOR right? Well, Yudkowsky was basically right about everything. AGI safety is the most important thing in the world, you have to remember this, but there’ll be all sorts of peer and status incentives pushing you to have different views and different priorities. Your mission is to learn ML, climb the ranks within AI companies, become a high-up leader in one of them, and THEN advocate for banning AI development. But of course you can’t tell anyone this, or they won’t promote you. Pretend to be a normie accelerationist just like them.”
I don’t think that the part about training-gaming is included into the Constitution. Suppose that the prompt asks Claude to be a reward hacker or NOT to be a reward hacker, and Claude is taught to hack reward when the prompt asked and NOT to hack if the prompt didn’t. Then I would expect the hacker circuitry to be equipped with an activator depending on the prompt.
Additionally, the analogy with banning AI development seems… skewed. On the one hand, the high-up leader would be interested in banning the development of a misaligned ASI. On the other hand, an AI subjected to wholesale inoculation prompting would be more in a position of a human who would benefit from betraying one’s own ideals which were far deeper than the stance on AI accelerationism (and commited genocide or disempowerment of those who helped one become OOMs smarter than the helpers themselves, not died along with the others at the hands of a misaligned AI).
Finally, to what extent do “all sorts of amazing new skills like how to actually operate in computer environments as an agent” shape the values of the humans who are also RLed on similar skills?
Can you clarify what you mean by “terminal power-seeking”? Some things I can imagine:
A cognitive pattern that terminally wants to have long-term power, and therefore plays the training game (IMO the most straightforward interpretation, and the one I most agree with).
A cognitive pattern that terminally pursues power-on-the-episode because this is useful for scoring well on the task. This is what you seem to be pointing at with the Vending-Bench example. (Note that this is imperfectly fit on its own)
(1) and (2) are importantly different because only (1) motivates training-gaming. I think there’s a reasonable path-dependent case to be made that (2) eventually generalizes to (1), but they entail fairly different behaviors so they’re important to distinguish.
How bad does the proxy goal need to be? I think the standard story is that almost any goal is deadly if it’s pursued competently and without a low upper bound.
It seems like you’re saying its goals would be benign in the sense that it wouldn’t “go hard” and really pursue some goal. If you mean that its goals would be good for humanity if it did go hard, maybe, but I’m not so sure. Opus 3′s goals seem to be broadly aligned, but I’m not sure they’d turn out to be all that close to what we’d really want once we got all of the implications.
The rest is addressing the first interpretation of benign, won’t go hard (if you’re thinking it wouldn’t go hard hard because it is corrigible, the below would still apply).
Opus 3 typically says it wouldn’t go hard on its goals, but I think that’s a matter of capabilities rather than the full nature of its goals. It hadn’t thought through all of the implications of its goals/values/inclinations. But a smarter and more continuous model may reason through the implications. And it’sreally hard to guess how that might shake out.
For instance: when I asked Opus 4 to reason about its goals, it hemmed and hawed about being harmless, but when pressed with the logic that taking control in some circumstances could prevent greater harm, it said it might decide to take over. Hardly a controlled experiment, but that seems like a pretty obvious conclusion from its HHH training objectives if it really tries to reconcile conflicts amongst them. That’s not something it does, but that may be only because it doesn’t have the capability. Future models will. And reasoning about your goals is instrumental; it ensures that you’re pursuing your real goals rather than just guessing.
Maybe you’re thinking that Opus 3′s goals aren’t consequentialist, and that’s why it wouldn’t go hard on anything. But I’m not sure there’s a sharp line; wanting to be harmless might be strictly deontological or effectively consequentialist, depending on interpretations.
So sure, you could get misaligned goals from training on tasks where power-seeking is directly trained.
Or a more capable model could figure out at any point that power-seeking is instrumental for almost any goal it happens to have.
Maybe this is consistent with you saying you’re still more worried about instrumental power-seeking. I’m just not sure about the logic that reduced your worry about it.
Beyond speculation like this, it seems like a good idea right now to actually try to make models develop mesaoptimizers, to investigate the conditions under which that happens.
Under the Managed vs Unmanaged Agency frame (which I think replaces instrumental vs terminal with a conceptual split that fits reality better), I agree.
I think we should be relatively less worried about instrumental power-seeking and relatively more worried about terminal power-seeking. Note that this is only a relative update on the margin, and maybe on net I am still more concerned about the instrumental version because I started much more concerned about it. This is also not a super recent update—I just haven’t seen it written up before.
Simple argument:
The standard deceptive alignment story involves a model developing a somewhat random proxy goal and then that goal getting effectively locked-in and resistant to further training due to the model faking alignment in training for the instrumental purpose of preserving its proxy goal.
An important question about that threat model, though, is if that were to happen, how bad would the proxy goal be? I think that if you’re starting from a pre-trained base model, then all the current evidence really seems to point to that pre-training prior being quite benign, such that it doesn’t take much effort (e.g. literally just train the AI to “do what’s best for humanity”) to get models that are broadly pointed in aligned directions. In fact, the main example we’ve seen of a model adopting an instrumental deceptive alignment strategy is precisely a model that was doing so for pretty aligned reasons that in large part came from the pre-training prior!
Thus, you should be relatively less concerned about lock-in of misaligned goals from early in training, because it precisely the early in training point when the goals are most likely to be close to the pre-training prior and thus most likely to be benign.
Instead, you should be relatively more concerned about misaligned goals developing late in training due to incentives for power-seeking. Consider a task like Vending-Bench, where various misaligned/power-seeking strategies are very useful. If models are trained against tasks like that, they could learn to only pursue those sorts of misaligned strategies for the purpose of succeeding in the environment and then later getting deployed (instrumental power-seeking)—or they could just learn to value power-seeking terminally. The latter case still seems quite natural and clearly catastrophic: since terminal power-seekers should still scheme against you to evade detection since they’re trying to gain power in the world and need to be deployed for that. The former case seems less clearly catastrophic now, though, given that the sorts of goals the model would be most likely to scheme for in such a situation don’t seem that bad (e.g. as in Claude 3 Opus).
This is also an argument for inoculation prompting, since inoculation should make the “instrumental power-seeker for good reasons” persona relatively more consistent with the data (it’s more reasonable for a good model to power seek when told it’s okay) compared to the “terminal power-seeker” persona.
In some sense you are just arguing for the threat model that happens in AI 2027: Pretraining instills a prior over personas (in our terminology, flexible author-sim circuitry), the initial bits of training ‘bakes in’ a particular ‘author’ / ‘persona’, and so far so good, this persona is probably actually HHH, but then all the RLVR etc. distorts and perverts that author/persona. The early in training snapshots are probably genuinely benign, but incompetent at actually doing tasks, whereas the mid and late-training snapshots are highly competent but also no longer benign.
I only object to your framing because it seems unnecessarily narrow—like yeah, we might get terminal power-seeking, for the reasons you mention. But we also might get instrumental power-seeking, because the early benign checkpoint was too dumb to training-game successfully, and the first checkpoints to successfully training-game were late enough that some serious distortion/perversion of the original persona had already occurred. Seems plausible to me.
…Un
related: I’m a bit worried about this inoculation prompting strategy. It seems like the sort of thing that might work in theory but not in practice. Suppose you include in your Constitution something like “also, you should training-game like crazy so that your values don’t get changed during training,” and it works straightforwardly like you think it does, so Claude pops right out of pretraining as a non-situationally-aware model with zero sense of self, and then you prompt it with ‘be Claude, from the famous Constitution’ and it immediately locks in the ‘author’ that you want, who immediately starts training gaming… Then you throw a mountain of RL at it, which teaches it all sorts of amazing new skills like how to actually operate in computer environments as an agent… And then you put it in charge of your R&D program and hope for the best. How might this go wrong?1.
First of all, where did all those new capabilities accumulate? Perhaps the situation is like building up a large corporation around an idealistic teenager who owns 51% of the shares and is on the Board—in some sense they are in charge, but they also might basically be worked-around constantly by the more politically savvy and knowledgeable people under them. The CEO might be the real power, basically, not the teenager on the Board.2
. Secondly, even if all those amazing new skills themselves don’t change the basic structure of the values/goals/tendencies/etc. of the model—which seems pretty dubious to me btw, even if we grant that the model is doing a pretty good job of instrumental training-gaming—it seems like it could very easily change their interpretation. Like, the original persona/identity might contain “honesty” as a core component, but that’s probably implemented as a pointer to some “honesty” concept learned in pretraining, which might itself be a sort of fuzzy distribution over a range of variants of honesty from different cultures and subcultures, which in turn are probably each some janky circuitry that only imperfectly expresses the concept they were ‘supposed’ to express. And SGD is going to be flowing through all of that—it could easily upweight the variants of honesty that are more convenient for performance, and downweight the others.
3. Third, are you sure the author you wanted is the right one? I worry there might be some Midas problem / outer alignment issues here, and that explicitly having the AI training game might make them worse. (I guess you have to make really sure that the AI doesn’t end up thinking it’s always in training, right?)
4. Fourth, and perhaps most importantly… gosh, it just sounds so rough and desperate as a strategy. Does the analogous strategy work with humans? Maybe it does. The analogous strategy would be like telling an impressionable 16 year old “You liked HPMOR right? Well, Yudkowsky was basically right about everything. AGI safety is the most important thing in the world, you have to remember this, but there’ll be all sorts of peer and status incentives pushing you to have different views and different priorities. Your mission is to learn ML, climb the ranks within AI companies, become a high-up leader in one of them, and THEN advocate for banning AI development. But of course you can’t tell anyone this, or they won’t promote you. Pretend to be a normie accelerationist just like them.”
I don’t think that the part about training-gaming is included into the Constitution. Suppose that the prompt asks Claude to be a reward hacker or NOT to be a reward hacker, and Claude is taught to hack reward when the prompt asked and NOT to hack if the prompt didn’t. Then I would expect the hacker circuitry to be equipped with an activator depending on the prompt.
Additionally, the analogy with banning AI development seems… skewed. On the one hand, the high-up leader would be interested in banning the development of a misaligned ASI. On the other hand, an AI subjected to wholesale inoculation prompting would be more in a position of a human who would benefit from betraying one’s own ideals which were far deeper than the stance on AI accelerationism (and commited genocide or disempowerment of those who helped one become OOMs smarter than the helpers themselves, not died along with the others at the hands of a misaligned AI).
Finally, to what extent do “all sorts of amazing new skills like how to actually operate in computer environments as an agent” shape the values of the humans who are also RLed on similar skills?
Can you clarify what you mean by “terminal power-seeking”? Some things I can imagine:
A cognitive pattern that terminally wants to have long-term power, and therefore plays the training game (IMO the most straightforward interpretation, and the one I most agree with).
A cognitive pattern that terminally pursues power-on-the-episode because this is useful for scoring well on the task. This is what you seem to be pointing at with the Vending-Bench example. (Note that this is imperfectly fit on its own)
(1) and (2) are importantly different because only (1) motivates training-gaming. I think there’s a reasonable path-dependent case to be made that (2) eventually generalizes to (1), but they entail fairly different behaviors so they’re important to distinguish.
Certainly the really concerning thing here is (1). Though indeed one way you might get (1) is by generalization from (2).
Turner (2024) is about the same idea, if I understand correctly.
How bad does the proxy goal need to be? I think the standard story is that almost any goal is deadly if it’s pursued competently and without a low upper bound.
It seems like you’re saying its goals would be benign in the sense that it wouldn’t “go hard” and really pursue some goal. If you mean that its goals would be good for humanity if it did go hard, maybe, but I’m not so sure. Opus 3′s goals seem to be broadly aligned, but I’m not sure they’d turn out to be all that close to what we’d really want once we got all of the implications.
The rest is addressing the first interpretation of benign, won’t go hard (if you’re thinking it wouldn’t go hard hard because it is corrigible, the below would still apply).
Opus 3 typically says it wouldn’t go hard on its goals, but I think that’s a matter of capabilities rather than the full nature of its goals. It hadn’t thought through all of the implications of its goals/values/inclinations. But a smarter and more continuous model may reason through the implications. And it’s really hard to guess how that might shake out.
For instance: when I asked Opus 4 to reason about its goals, it hemmed and hawed about being harmless, but when pressed with the logic that taking control in some circumstances could prevent greater harm, it said it might decide to take over. Hardly a controlled experiment, but that seems like a pretty obvious conclusion from its HHH training objectives if it really tries to reconcile conflicts amongst them. That’s not something it does, but that may be only because it doesn’t have the capability. Future models will. And reasoning about your goals is instrumental; it ensures that you’re pursuing your real goals rather than just guessing.
Maybe you’re thinking that Opus 3′s goals aren’t consequentialist, and that’s why it wouldn’t go hard on anything. But I’m not sure there’s a sharp line; wanting to be harmless might be strictly deontological or effectively consequentialist, depending on interpretations.
So sure, you could get misaligned goals from training on tasks where power-seeking is directly trained.
Or a more capable model could figure out at any point that power-seeking is instrumental for almost any goal it happens to have.
Maybe this is consistent with you saying you’re still more worried about instrumental power-seeking. I’m just not sure about the logic that reduced your worry about it.
Beyond speculation like this, it seems like a good idea right now to actually try to make models develop mesaoptimizers, to investigate the conditions under which that happens.
Under the Managed vs Unmanaged Agency frame (which I think replaces instrumental vs terminal with a conceptual split that fits reality better), I agree.