New report: “Scheming AIs: Will AIs fake alignment during training in order to get power?”
(Cross-posted from my website)
I’ve written a report about whether advanced AIs will fake alignment during training in order to get power later – a behavior I call “scheming” (also sometimes called “deceptive alignment”). The report is available on arXiv here. There’s also an audio version here, and I’ve included the introductory section below (with audio for that section here). This section includes a full summary of the report, which covers most of the main points and technical terminology. I’m hoping that the summary will provide much of the context necessary to understand individual sections of the report on their own.
This report examines whether advanced AIs that perform well in training will be doing so in order to gain power later – a behavior I call “scheming” (also sometimes called “deceptive alignment”). I conclude that scheming is a disturbingly plausible outcome of using baseline machine learning methods to train goal-directed AIs sophisticated enough to scheme (my subjective probability on such an outcome, given these conditions, is ~25%). In particular: if performing well in training is a good strategy for gaining power (as I think it might well be), then a very wide variety of goals would motivate scheming – and hence, good training performance. This makes it plausible that training might either land on such a goal naturally and then reinforce it, or actively push a model’s motivations towards such a goal as an easy way of improving performance. What’s more, because schemers pretend to be aligned on tests designed to reveal their motivations, it may be quite difficult to tell whether this has occurred. However, I also think there are reasons for comfort. In particular: scheming may not actually be such a good strategy for gaining power; various selection pressures in training might work against schemer-like goals (for example, relative to non-schemers, schemers need to engage in extra instrumental reasoning, which might harm their training performance); and we may be able to increase such pressures intentionally. The report discusses these and a wide variety of other considerations in detail, and it suggests an array of empirical research directions for probing the topic further.
Agents seeking power often have incentives to deceive others about their motives. Consider, for example, a politician on the campaign trail (“I care deeply about your pet issue”), a job candidate (“I’m just so excited about widgets”), or a child seeking a parent’s pardon (“I’m super sorry and will never do it again”).
This report examines whether we should expect advanced AIs whose motives seem benign during training to be engaging in this form of deception. Here I distinguish between four (increasingly specific) types of deceptive AIs:
Alignment fakers: AIs pretending to be more aligned than they are.
Training gamers: AIs that understand the process being used to train them (I’ll call this understanding “situational awareness”), and that are optimizing for what I call “reward on the episode” (and that will often have incentives to fake alignment, if doing so would lead to reward).
Power-motivated instrumental training-gamers (or “schemers”): AIs that are training-gaming specifically in order to gain power for themselves or other AIs later.
Goal-guarding schemers: Schemers whose power-seeking strategy specifically involves trying to prevent the training process from modifying their goals.
I think that advanced AIs fine-tuned on uncareful human feedback are likely to fake alignment in various ways by default, because uncareful feedback will reward such behavior. And plausibly, such AIs will play the training game as well. But my interest, in this report, is specifically in whether they will do this as part of a strategy for gaining power later – that is, whether they will be schemers (this sort of behavior is often called “deceptive alignment” in the literature, though I won’t use that term here). I aim to clarify and evaluate the arguments for and against expecting this.
My current view is that scheming is a worryingly plausible outcome of training advanced, goal-directed AIs using baseline machine learning methods (for example: self-supervised pre-training followed by RLHF on a diverse set of real-world tasks). The most basic reason for concern, in my opinion, is that:
Performing well in training may be a good instrumental strategy for gaining power in general.
If it is, then a very wide variety of goals would motivate scheming (and hence good training performance); whereas the non-schemer goals compatible with good training performance are much more specific.
The combination of (1) and (2) makes it seem plausible, to me, that conditional on training creating a goal-directed, situationally-aware model, it might well instill a schemer-like goal for one reason or another. In particular:
Training might land on such a goal “naturally” (whether before or after situational awareness arises), because such a goal initially leads to good-enough performance in training even absent training-gaming. (And this especially if you’re intentionally trying to induce your model to optimize over long time horizons, as I think there will be incentives to do.)
Even if schemer-like goals don’t arise “naturally,” actively turning a model into a schemer may be the easiest way for SGD to improve the model’s training performance, once the model has the situational awareness to engage in training-gaming at all.
What’s more, because schemers actively pretend to be aligned on tests designed to reveal their motivations, it may be quite difficult to tell whether this has occurred. That seems like reason for serious concern.
However, I also think that there are reasons for comfort. I’ll group these into two categories.
The first category centers on the possibility that scheming isn’t actually a very convergently-good instrumental strategy.
For example: the most common story about schemers focuses on goal-guarding schemers, which are training-gaming in order to prevent their goals from being modified by the training process. But it’s not clear to me that this will work well enough (in particular: goal-modification might continue despite training-gaming), or that the resulting goals will get sufficiently large amounts of future power, with sufficiently high probability, to justify any costs the model is incurring by training-gaming rather than just going for its goals directly (if these costs are meaningful – which for some models they might not be).
There are alternative stories about why scheming will be a good instrumental strategy – stories, for example, that center on coordination amongst AIs with different values; on the possibility that AIs will have quite similar values by default; and on the possibility that AIs will intrinsically value things like “my survival” (in a manner tolerant of values-change) and/or “AI empowerment” (in a manner tolerant of very different values in the AIs-so-empowered). And while these stories do make scheming seem like a more disjunctive hypothesis, to me, a number of them seem to me some combination of (a) more speculative and (b) requiring that we posit models with much more specific sorts of goals (thereby dulling the force of (2) above).
The second category centers on forms of selection pressure that may push against scheming – both “naturally,” and because we may be able to intentionally introduce/emphasize these selection pressures in training.
For example: relative to other sorts of models, schemers need to engage in extra instrumental reasoning (e.g., backchaining from their long-term goal to the instrumental value of training-gaming, and checking for escape/defection opportunities). This may come at a substantive cost to their performance – a cost we may be able to intentionally increase by e.g. putting models under more time pressure, forcing them to rely on human-legible scratch-pads for their reasoning, and making genuine escape/defection opportunities more difficult to distinguish from traps. And to the extent that schemers engage in active attempts to undermine human alignment and security efforts while still under human control (the scariest form of scheming, in my view), their performance seems likely to take a more notable hit.
What’s more, my own guess is that the costs of this extra instrumental reasoning substantially outweigh what are sometimes called the “simplicity” benefits of scheming – i.e., benefits from the fact that selecting a schemer allows SGD to select a simpler goal (since it has such a wide variety to choose from). And this especially given that schemers and non-schemers alike will need to represent concepts like “honesty,” “human values,” “reward,” and so on in detail – so schemers plausibly save on complexity only via differences in the complexity of using/repurposing/”pointing at” a given concept in guiding a model’s motivations.
Also: prior to situational awareness, training will actively penalize schemer-like goals whenever they lead to a sacrifice of reward on the episode (more about what I mean by “episode” below). And we can try to actively push in this direction further – especially if we can learn to detect and control when situational awareness arises.
I think the second category, here, is especially important. In particular: one way that arguments that appeal to the comparative number (and/or “measure”) of schemer-like goals vs. non-schemer goals that perform well in training can go wrong is by neglecting the full power of the selection pressure being applied to the goal that training selects. So if the pressure applied in training works against schemers overall, this is very good news.
The full report discusses these and a wide variety of other relevant considerations in much more detail (and see section 0.2 for a full summary).
Overall, as a rough stab at quantifying my current view: if you train a coherently goal-directed, situationally aware, somewhat-better-than-human-level model using baseline forms of self-supervised pre-training + RLHF on diverse, long-horizon, real-world tasks, my subjective probability is ~25% that this model will be performing well in training in substantial part as part of an instrumental strategy for seeking power for itself and/or other AIs later.
I also think it may be possible to reduce the risk here in fairly simple ways: for example, via training on shorter-horizon tasks, or via more intensive forms of adversarial training prior to situational awareness.
And I think the risk increases with more powerful models, who may be better positioned to escape/take-over (and to coordinate/cooperate with other AIs in doing so).
Also, and importantly: non-schemers can engage in schemer-like behavior, too. For example:
Models that are only optimizing for reward on a given episode can fake their alignment (or even: engage in escape/take-over) in an effort to get that reward (especially if the episode in question is quite long).
Models that aren’t playing the training game can still end up with power-seeking goals that incentivize various forms of deception.
The eventual AIs that matter most might differ in important ways from the paradigm sort of AI I focus on here – for example, they might be more like “language model agents” than single models, or they might be created via methods that differ even more substantially from sort of baseline ML methods I’m focused on – while still engaging in power-motivated alignment-faking.
So scheming as I’ve defined it is far from the only concern in this vicinity. Rather, it’s a paradigm instance of this sort of concern, and one that seems, to me, especially pressing to understand. At the end of the report, I discuss an array of possible empirical research directions for probing the topic further.
(This section offers a few more preliminaries to frame the report’s discussion. Those eager for the main content can skip to the summary of the report in section 0.2.)
I wrote this report centrally because I think that the probability of scheming/”deceptive alignment” is one of the most important questions in assessing the overall level of existential risk from misaligned AI. Indeed, scheming is notably central to many models of how this risk arises. And as I discuss below, I think it’s the scariest form that misalignment can take.
Yet: for all its importance to AI risk, the topic has received comparatively little direct public attention. And my sense is that discussion of it often suffers from haziness about the specific pattern of motivation/behavior at issue, and why one might or might not expect it to occur. My hope, in this report, is to lend clarity to discussion of this kind, to treat the topic with depth and detail commensurate to its importance, and to facilitate more ongoing research. In particular, and despite the theoretical nature of the report, I’m especially interested in informing empirical investigation that might shed further light.
I’ve tried to write for a reader who isn’t necessarily familiar with any previous work on scheming/”deceptive alignment.” For example: in sections 1.1 and 1.2, I lay out, from the ground up, the taxonomy of concepts that the discussion will rely on. For some readers, this may feel like re-hashing old ground. I invite those readers to skip ahead as they see fit (especially if they’ve already read the summary of the report, and so know what they’re missing).
That said, I do assume more general familiarity with (a) the basic arguments about existential risk from misaligned AI, and (b) a basic picture of how contemporary machine learning works. And I make some other assumptions as well, namely:
That the relevant sort of AI development is taking place within a machine learning-focused paradigm (and a socio-political environment) broadly similar to that of 2023.
That we don’t have strong “interpretability tools” (i.e., tools that help us understand a model’s internal cognition) that could help us detect/prevent scheming.
That the AIs I discuss are goal-directed in the sense of: well-understood as making and executing plans, in pursuit of objectives, on the basis of models of the world. (I don’t think this assumption is innocuous, but I want to separate debates about whether to expect goal-directedness per se from debates about whether to expect goal-directed models to be schemers – and I encourage readers to do so as well.)
Finally, I want to note an aspect of the discussion in the report that makes me quite uncomfortable: namely, it seems plausible to me that in addition to potentially posing existential risks to humanity, the sorts of AIs discussed in the report might well be moral patients in their own right. I talk, here, as though they are not, and as though it is acceptable to engage in whatever treatment of AIs best serves our ends. But if AIs are moral patients, this is not the case – and when one finds oneself saying (and especially: repeatedly saying) “let’s assume, for the moment, that it’s acceptable to do whatever we want to x category of being, despite the fact that it’s plausibly not,” one should sit up straight and wonder. I am here setting aside issues of AI moral patienthood not because they are unreal or unimportant, but because they would introduce a host of additional complexities to an already-lengthy discussion. But these complexities are swiftly descending upon us, and we need concrete plans for handling them responsibly.
0.2 Summary of the report
This section gives a summary of the full report. It includes most of the main points and technical terminology (though unfortunately, relatively few of the concrete examples meant to make the content easier to understand). I’m hoping it will (a) provide readers with a good sense of which parts of the main text will be most of interest to them, and (b) empower readers to skip to those parts without worrying too much about what they’ve missed.
0.2.1 Summary of section 1
The report has four main parts. The first part (section 1) aims to clarify the different forms of AI deception above (section 1.1), to distinguish schemers from the other possible model classes I’ll be discussing (section 1.2), and to explain why I think that scheming is a uniquely scary form of misalignment (section 1.3). I’m especially interested in contrasting schemers with:
Reward-on-the-episode seekers: that is, AI systems that terminally value some component of the reward process for the episode, and that are playing the training game for this reason.
Training saints: AI systems that are directly pursuing the goal specified by the reward process (I’ll call this the “specified goal”).
Misgeneralized non-training-gamers: AIs that are neither playing the training game nor pursuing the specified goal.
Here’s a diagram of overall taxonomy:
All of these model classes can be misaligned and dangerous. But I think schemers are especially scary. In particular: scheming prompts the most robust and adversarial efforts to prevent humans from learning about the misalignment in question; and it’s the most likely to motivate what I call “early undermining” – that is, AIs at a comparatively early stage of AI development actively trying to undermine human efforts to align, control, and secure future AI systems.
I also discuss (in section 1.4) whether theoretical arguments of the type the report focuses on are even useful to consider. My answer here is “somewhat useful,” both to inform empirical investigation and as a source of evidence in the absence of adequate empirical data, but we should clearly be looking for as much empirical evidence as we can (I offer some suggestions in this vein in section 6).
In section 1.5, I discuss a factor that seems to me important in assessing various of the arguments I consider: namely, the amount of “slack” that training gives AIs to perform less-than-optimally.
0.2.2 Summary of section 2
The second part of the report examines the prerequisites for scheming. In particular, I focus on:
Situational awareness: the model understands that it’s a model in a training process, what the training process will reward, and the basic nature of the objective world in general.
Beyond-episode goals: the model cares about the consequences of its actions after the episode is complete.
Aiming at reward-on-the-episode as part of a power-motivated instrumental strategy: the model believes that its beyond-episode goals will be better achieved if it optimizes for reward-on-the-episode – and in particular, that it, or some other AIs, will get more power if it does this.
Section 2.1 discusses situational awareness. I think that absent active effort, we should at least expect certain kinds of advanced AIs – for example, advanced AIs that are performing real-world tasks in live interaction with the public internet – to be situationally aware by default, because (a) situational awareness is very useful in performing the tasks in question (indeed, we might actively train it into them), and (b) such AIs will likely be exposed to the information necessary to gain such awareness. However, I don’t focus much on situational awareness in the report. Rather, I’m more interested in whether to expect the other two prerequisites above in situationally-aware models.
Section 2.2 discusses beyond-episode goals. Here I distinguish (in section 2.2.1) between two concepts of an “episode,” namely:
The incentivized episode: that is, the temporal horizon that the gradients in training actively pressure the model to optimize over.
The intuitive episode: that is, some other intuitive temporal unit that we call the “episode” for one reason or another (e.g., reward is given at the end of it; actions in one such unit have no obvious causal path to outcomes in another; etc).
When I use the term “episode” in the report, I’m talking about the incentivized episode. Thus, “beyond-episode goals” means: goals whose temporal horizon extends beyond the horizon that training actively pressures models to optimize over. But very importantly, the incentivized episode isn’t necessarily the intuitive episode. That is, deciding to call some temporal unit an “episode” doesn’t mean that training isn’t actively pressuring the model to optimize over a horizon that extends beyond that unit: you need to actually look in detail at how the gradients flow (work that I worry casual readers of this report might neglect).
I also distinguish (in section 2.2.2) between two types of beyond-episode goals, namely:
Training-game-independent beyond-episode goals: that is, beyond-episode goals that arise independent of their role in motivating a model to play the training game.
Training-game-dependent beyond-episode goals: that is, beyond-episode goals that arise specifically because they motivate training-gaming.
These two sorts of beyond-episode goals correspond to two different stories about how scheming happens.
In the first sort of story, SGD happens to instill beyond-episode goals in a model “naturally” (whether before situational awareness arises, or afterwards), and then those goals begin to motivate scheming.
In the second sort of story, SGD “notices” that giving a model beyond-episode goals would motivate scheming (and thus, high-reward behavior), and so actively gives it such goals for this reason.
In section 220.127.116.11, I discuss training-game-independent beyond-episode goals (that is, path 1 and 2). Should we expect beyond-episode goals to arise “naturally”?
One reason to expect this is that plausibly, goals don’t come with temporal limitations by default – and “model time” might differ from “calendar time” regardless.
One reason to not expect this is that training will actively punish beyond-episode goals whenever they prompt the model to sacrifice reward-on-the-episode for some beyond-episode benefit. And we may be able to use adversarial training to search out such goals and punish them more actively.
In section 18.104.22.168, I discuss training-game-dependent beyond-episode goals. In particular, I highlight the question of whether SGD will be adequately able to “notice” the benefits of turning a non-schemer into a schemer, given the need to make the transition incrementally, via tiny changes to the model’s weights, each of which improve the reward. I think that this is a serious objection to stories focused on training-game-dependent beyond-episode goals, but I also don’t feel that I can confidently rule out SGD’s ability to make a transition of this kind (see e.g. “you can’t evolve eyes” as an example of places I think intuitions in this vein can go wrong).
I also discuss, in section 2.2.3, the possibility that the model’s goal-directedness will be “messy” in a way that complicates training-game-dependent stories where SGD can simply modify a model’s goal and thereby smoothly redirect its “goal-achieving engine” in a new direction (I think that this sort of “messiness” is quite likely). And I touch on a broader intuition that scheming requires an unusually high-standard of goal-directedness more generally. I think this intuition has less force once we condition on the sort of goal-directedness typically assumed in the alignment discourse (and which I generally assume in the report as well). But it’s relevant to our overall probabilities regardless.
In section 2.2.4, I discuss whether beyond-episode goals are more likely in models actively trained to have long-term (even if still: within-episode) goals. I think that beyond-episode goals are indeed more likely in models of this kind – and that baseline incentives to create models that engage in fairly long-horizon optimization (e.g., “make lots of money for my company over the next year”) are correspondingly worrying. However, I also think various sorts of cognitive work that might be highly useful to efforts to avoid an AI takeover (for example, research aimed at improving the alignment, interpretability, and security of our AI systems) may not require long-horizon optimization of this kind.
With this discussion of beyond-episode goals in view, I turn, in section 2.3, to the final prerequisite for scheming: namely, deciding to training-game as part of a power-motivated strategy for promoting the beyond-episode goal in question.
Here I focus first (in section 2.3.1) on what I call the “classic goal-guarding story” about the origins of this decision. We can decompose this story into two claims:
The goal-guarding hypothesis. If the model optimizes for reward-on-the-episode during training, then the goals that survive training will be such that the model would want them promoted in the future; whereas if the model doesn’t optimize for reward-on-the-episode during training, the model will want this less.
Adequate future empowerment. Conditional on surviving training to the degree at stake in (A), the model’s goals will then end up empowered to a degree (and with sufficient probability) as to justify (given the model’s other goals, its tolerance for risk, etc) pursuing a strategy of training-gaming overall.
I then investigate each of these in turn (in section 22.214.171.124 and 126.96.36.199, respectively).
Both seem to me non-obvious. In particular, with respect to the goal-guarding hypothesis:
Even after training gaming starts, the model’s cognition is still being continually tweaked in the direction of better training performance. And it seems plausible to me that these modifications will continue to affect a model’s goals as well – especially if its goals are not cleanly distinguishable from its capabilities, but rather are implemented by a tangled kludge of local heuristics, patterns of attention, impulses, and so on.
Thus, as a toy example: imagine that if I go into the desert and stack bricks for a million years, I will get a huge amount of money that I can then donate to my charity of choice. But suppose that throughout this whole ordeal, my brain will be hooked up to a machine that continually modifies all my neurons to encourage actions that stack bricks better, and to discourage actions that stack bricks worse. And suppose my intention is to just try to stack bricks as well as I can. Will I emerge from this process with my values substantially affected by the relevant brain modifications (such that, for example, I emerge obsessed with stacking bricks – or at least, with very different taste in charities)? It seems at least plausible.
On the classic goal-guarding story, once a model starts training-gaming, the specific content of its goal no longer matters to its behavior – any other schemer-like goal would lead to the same thing. But this makes it seem like the goal could “float around” between schemer-like goals (or get moved by other pressures within SGD, like regularization), without affecting training performance. And if that happens, then scheming may not be a good strategy for goal-guarding after all.
Of course, the model may be tolerant of some changes to its goals – and these changes may be preferable to the ones that would occur absent training-gaming. But the possibility of significant ongoing goal-modification introduces an extra uncertainty into the classic goal-guarding story.
People sometimes talk about more sophisticated techniques for avoiding goal-modification, where e.g. a model manipulates gradient descent more actively by making its good performance in training conditional on whether its goals get modified or not (see e.g. the discourse around “gradient hacking”). I don’t focus on these methods here, but they strike me as quite difficult to pull off, especially for only somewhat-better-than-human-level models. And if scheming requires that they work, I think it’s worth being very clear about that, and giving the feasibility of such methods commensurate attention.
With respect to whether the resulting goals will be adequately empowered:
The most worrying types of schemers aren’t merely seeking power on future training episodes. Rather, they are specifically aiming at some combination of escape and take-over.
But this means that the relevant goals have to do more than just survive training. They also need to survive either until the model can either escape/participate in a take-over itself, or to exert sufficiently substantive influence on some future generation of models that will be able to do so.
For example, if GPT-6 isn’t powerful enough to escape/take-over itself, but GPT-8 will be, then GPT-6′s main path to power may be influencing the values of GPT-7, which in turn influences the values of GPT-8. But this form of influence is quite indirect; it may face substantial further barriers from e.g. human monitoring; and it will plausibly involve more significant deviation from max-reward training performance.
What’s more, the relevant escape/take-over attempt needs to be suitably likely to succeed, to occur over time horizons that the model cares about, and to result in the model’s favored goals in particular receiving a suitably substantive share of power (despite the fact that large numbers of AIs with very different values may be competing for the power in question).
These uncertainties introduce haircuts to the instrumental value of a schemer-like strategy. However, the importance of these haircuts depends on the burden of proof that such a strategy needs to meet. In particular: the more the model stands to gain from not training-gaming, the more attractive scheming will need to be in order to win out as a strategy. But for some models, the costs of training-gaming might be relatively low. And for suitably “ambitious” models, even low-probabilities of success via a schemer-like strategy might justify even a substantive sacrifice.
Overall, the classic goal-guarding story seems to me to introduce a number of additional uncertainties with respect to the viability of scheming as an instrumental strategy.
However, it’s also not the only story available. In section 2.3.2, I consider a number of “non-classic” stories, which relax various of the classic story’s constraints. For example:
AI coordination. If AIs with different values coordinate to share power after take-over, and/or to compensate AIs who worked on behalf of take-over but who weren’t powerful enough to escape/take-over themselves, then this might mitigate some of the haircuts above. For example, GPT-6 might be more willing to work on GPT-8′s behalf in the example above, if GPT-6 were confident that GPT-8 would “repay the favor” once in power. I think some analyses of coordination of this form assume too readily that it will be (a) easy, and (b) the obvious strategy relative to e.g. coordinating with humans instead (especially for only-somewhat-better-than-human neural nets acting in a still-human-controlled world where interpretability is still unsolved for everyone). But I think that AI coordination of this type is a real concern and worth serious attention – both in the context of scheming in particular, and in the context of AI risk more generally.
AIs with similar values by default. Relatedly, if AIs will generally have sufficiently similar values by default, this can reduce the need for a schemer to specifically propagate its own goals into the future, and free it up to work on behalf of AI take-over/empowerment more directly. And it does seem likely that e.g. GPT-6′s values and GPT-8′s values will be at least much more similar, by default, than e.g. GPT-6′s values and a given set of human values.
Terminal values that happen to favor escape/takeover. We can also imagine models with terminal values that happen to favor escape/AI takeover even if the goals that get empowered as a result aren’t very similar to the model’s own. For example, perhaps the model values some concept of “survival” based on a conception of personal identity tolerant of the sorts of goal changes that training will cause; or perhaps the model just supports something like “AI empowerment,” even if the AIs-so-empowered won’t have values very similar to its own. However, some stories of this form begin to impose fairly strong constraints on the goals that schemers need to have, thereby undermining one of the key arguments for scheming: namely, that it seems like a convergently-good strategy across a wide variety of goals.
(I also consider a few other alternative stories – namely, models with false beliefs about the instrumental value of scheming, models that are self-deceived about their alignment, and models that are uncertain about their goals/hazy about exactly why they want power. But these seem to me less important.)
The availability of non-classic stories like these makes the case for scheming feel, to me, more disjunctive. However, some of these stories also seem to me some combination of (a) more speculative, and (b) requiring of more specific hypotheses about the sorts of goals that AIs will develop.
My overall takeaways from section 2 are:
I think there are relatively strong arguments for expecting situational awareness by default, at least in certain types of AI systems (i.e., AI systems performing real-world tasks in live interaction with sources of information about who they are).
But I feel quite a bit less clear about beyond-episode goals and aiming-at-reward-on-the-episode-as-part-a-power-motivated-instrumental-strategy.
I then turn, in the next two sections, to an examination of the more specific arguments for and against expecting schemers vs. other types of models. I divide these into two categories, namely:
Arguments that focus on the path that SGD needs to take in building the different model classes in question (section 3).
Arguments that focus on the final properties of the different model classes in question (section 4).
0.2.3 Summary of section 3
The third part of the report focuses on the former category of argument.
I break this category down according to the distinction between “training-game-independent” and “training-game-dependent” beyond-episode goals. My sense is that the most traditional story about the path to schemers focuses on the former sort. It runs roughly as follows:
Because of [insert reason], the model will develop a (suitably ambitious) beyond-episode goal correlated with good performance in training (in a manner that doesn’t route via the training game). This could happen before situational awareness arrives, or afterwards.
Then, in conjunction with situational awareness, this (suitably ambitious) beyond-episode goal will start to motivate training-gaming.
Modulo my questions about the viability of scheming as an instrumental strategy, I take this sort of argument fairly seriously. I think the most immediate question is: why did the model develop this sort of beyond-episode goal? I discussed some reasons for and against expecting this already (in the summary of section 188.8.131.52), but they don’t seem to me decisive in either direction: and especially given that a very wide variety of goals could in principle motivate scheming, it just does seem possible for a schemer-like goal to pop out of training in this way. And while it may be possible to use adversarial training prior to situational awareness to try to prevent this, this training faces a number of barriers as well (e.g., it needs to be diverse/thorough enough, it needs to contend with difficulties knowing/controlling when a model develops situational awareness, and in some cases models might already have situational awareness by the time we’re worried about the beyond-episode goal developing). So I think this sort of path to scheming is a real concern. (See section 3.1 for more.)
I then turn, in section 3.2, to a story focused on training-game-dependent beyond-episode goals, which runs roughly as follows:
By the time the model becomes situationally aware, it probably won’t be pursuing a max-reward goal (that is, a goal pursuit of which on the training data leads to roughly the maximum reward consistent with the model’s other capabilities). Rather, it will be pursuing some less-than-max-reward proxy goal.
But at that point, the world-model will contain all the information the model needs in order to training-game. (However, because we’re here focused on “training-game-dependent” stories about scheming, we assume that at the point of getting situational awareness, the model’s goal is not yet such that it will motivate scheming – rather, some further modification would be required for that.)
The easiest way for SGD to modify the model into getting maximum reward, at that point, will be to make it a schemer.
So: SGD will make the model a schemer.
If we set aside questions about whether SGD can “notice” the benefits of modifications of this type, this sort of story seems to me fairly worrying as well. In particular: I think it’s plausible that schemer-like goals will be sufficiently common in goal-space that one such goal will be easier for SGD to find, from a given starting point, than the specific non-schemer goals that leads to max reward behavior.
However, non-schemer max-reward goals – for example, the specified goal, or “reward-on-the-episode” – have some advantages too. For example: plausibly, the initial phase of training will point the model in their vicinity by default, since the model, at that stage, needs to be getting high-reward absent instrumental training-gaming (see section 184.108.40.206 for more). And to the extent we’re thinking of the model’s goal-directedness as “messy,” its heuristics, patterns of attention, impulses, and so on will plausibly be pointed in this sort of direction as well – thereby plausibly creating extra “work,” on SGD’s part, to turn the model into a schemer instead (see section 220.127.116.11 for more).
0.2.4 Summary of section 4
The fourth part of the report examines arguments that focus on the final properties of the different model classes.
I start, in section 4.2, with what I call the “counting argument.” It runs as follows:
The non-schemer model classes, here, require fairly specific goals in order to get high reward.
By contrast, the schemer model class is compatible with a very wide range of (beyond-episode) goals, while still getting high reward (at least if we assume that the other requirements for scheming to make sense as an instrumental strategy are in place – e.g., that the classic goal-guarding story, or some alternative, works).
In this sense, there are “more” schemers that get high reward than there are non-schemers that do so.
So, other things equal, we should expect SGD to select a schemer.
Something in the vicinity accounts for a substantial portion of my credence on schemers (and I think it often undergirds other, more specific arguments for expecting schemers as well). However, the argument I give most weight to doesn’t move immediately from “there are more possible schemers that get high reward than non-schemers that do so” to “absent further argument, SGD probably selects a schemer” (call this the “strict counting argument”), because it seems possible that SGD actively privileges one of these model classes over the others. Rather, the argument I give most weight to is something like:
It seems like there are “lots of ways” that a model could end up a schemer and still get high reward, at least assuming that scheming is in fact a good instrumental strategy for pursuing long-term goals.
So absent some additional story about why training won’t select a schemer, it feels, to me, like the possibility should be getting substantive weight.
I call this the “hazy counting argument.” It’s not especially principled, but I find that it moves me nonetheless.
I then turn, in section 4.3, to “simplicity arguments” in favor of expecting schemers. I think these arguments sometimes suffer from unclarity about the sort of simplicity at stake, so in section 4.3.1, I discuss a number of different possibilities:
“re-writing simplicity” (i.e., the length of the program required to re-write the algorithm that a model’s weights implement in some programming language, or e.g. on the tape of a given Universal Turing Machine),
“parameter simplicity” (i.e., the number of parameters that the actual neural network uses to encode the relevant algorithm),
“simplicity realism” (which assumes that simplicity is in some deep sense an objective “thing,” independent of programming-language or Universal Turing Machine, that various simplicity metrics attempt to capture), and
“trivial simplicity” (which conflates the notion of “simplicity” with “higher likelihood on priors,” in a manner that makes something like Occam’s razor uninterestingly true by definition).
I generally focus on “parameter simplicity,” which seems to me easiest to understand, and to connect to a model’s training performance.
I also briefly discuss, in section 4.3.2, the evidence that SGD actively selects for simplicity. Here the case that grips me most directly is just: simplicity (or at least, parameter simplicity) lets a model save on parameters that it can then use to get more reward. But I also briefly discuss some other empirical evidence for simplicity biases in machine learning.
Why might we expect a simplicity bias to favor schemers? Roughly: the thought is that because such a wide variety of goals can motivate scheming, schemers allow SGD a very wide range of goals to choose from in seeking out simpler goals; whereas non-schemers (that get high reward) do not. And this seems especially plausible to the extent we imagine that the goals required to be such a non-schemer are quite complex.
Other things equal, I think this is right. But I’m not sure it’s a very large or important effect. For one thing: we know that LLMs like GPT-4 are capable of representing a very large number of complex human concepts with e.g. order of a trillion parameters—including, plausibly, concepts like “honesty,” “helpfulness,” “reward,” and so on. So this caps the complexity savings at stake in avoiding representations like this. Thus, as a toy calculation: if we conservatively assume that at most 1% of a trillion-parameter model’s capacity goes to representing concepts as complex as “honesty,” and that it knows at least 10,000 such concepts (Webster’s unabridged dictionary has ~500,000 words), then representing the concept of honesty takes at most a millionth of the model’s representational capacity, and even less for the larger models of the future.
But more importantly, what matters here isn’t the absolute complexity of representing the different goals in question, but the complexity conditional on already having a good world model. And we should assume that all of these models will need to understand the specified goal, the reward process for the episode, etc. And granted such an assumption, the extra complexity costs of actively optimizing for the specified goal, or for reward-on-the-episode, seem to me plausibly extremely small. Plausibly, they’re just: whatever the costs are for using/repurposing (“pointing at”) that part of the world-model for guiding the model’s motivations.
Of course, we can try to rerun the same simplicity argument at the level of the complexity costs of using/repurposing different parts of the world model in that way. For example, we can say: “however this process works, presumably it’s simpler to do for some goals than others – so given how many schemer-like goals there are, plausibly it will be simpler to do for some schemer-like goal.” I think this is the strongest form of the simplicity argument for expecting schemers. But it also requires abandoning any intuitive grip we might’ve had on which goals are “simple” in the relevant sense. And it seems plausible to me that the simplicity differences between different “pointers” are very small relative to the model’s overall capacity.
I also discuss, in section 4.3.5, whether simplicity arguments make plausible predictions about the sorts of goals a schemer will end up with. In particular: some versions of these arguments can seem to predict that schemers are likely to end up with extremely simple long-term goals, regardless of whether those goals had any relation to what was salient or important during training —predictions about which I, personally, am skeptical. Of course, it’s also possible to posit that goal targets salient/relevant during training will also be “simpler” for the model to pursue. But if we grant some story in this vein (I haven’t heard strong arguments for doing so), note that it also speaks in favor of the simplicity of the specified goal, and of reward-on-the-episode, as well—since these are especially salient/relevant during the training process.
In section 4.4, I turn to arguments against scheming that focus on the final properties of the different model classes. Here, the main argument I’m aware of is that schemers need to do extra reasoning that other model classes don’t (these are sometimes called “speed costs” – though I think they could in principle affect the complexity of a model’s algorithm as well). In particular:
Deciding to training-game: The model needs to reason backwards from “I want this beyond-episode thing” to “I should do well in training,” in a way that draws on the model’s predictions about what doing well in training will entail (e.g., re: goal-guarding, take-over prospects, and so on).
Checking whether it’s time to escape/take-over: The model needs to be checking, with adequate frequency and accuracy, whether it’s in a position to escape/take-over (and to succeed in this assessment despite human efforts to trap/fool the model into revealing its true colors – for example, by offering it apparent opportunities to defect).
(Optional, but relevant to how scary the schemer is) Early undermining: To the extent the model is doing anything (other than training-gaming) to support a take-over/escape prior to escaping human control, it will need to be using cognitive resources both in (a) performing the relevant nefarious actions and (b) making sure it gets away with them (e.g., checking that it won’t get caught, coming up with excuses, making the cover stories consistent even when under scrutiny, and so on).
How big are the costs of this reasoning? I think that (I) may well be fairly easy, but (II) seems more plausibly onerous (especially in the context of active efforts by humans to put the models under time-pressure, and make genuine escape/take-over opportunities difficult to distinguish from traps). And (III), if the model goes for it, seems like it could be quite difficult – and like it would often actively require deviation from the highest-reward sorts of training performance.
More importantly, though: the question – at least in the context of a comparison between the final properties of the model classes – isn’t the absolute size of the costs in question, but rather their size relative to the simplicity benefits of scheming. Here, my current take is that the costs of the instrumental reasoning seem larger than the benefits of the simpler goal.
This is mostly an intuitive judgment, but we can also try to hazily quantify the comparison.
Thus, for example, suppose that our model is a purely feedforward network with 1000 layers and <10,000 serial steps per input, that the instrumental reasoning at stake just in a schemer deciding to training-game (let alone: checking for defection opportunities) takes at least ten steps, and that the model doesn’t begin optimizing for reward-on-the-episode until that reasoning is complete. In that case, the instrumental reasoning is using up around .1% of the compute that could be going to getting reward-on-the-episode, or to some other max-reward goal.
Are the complexity costs of representing, or pointing at, a more complicated goal likely to outweigh this? My guess is no. For example, if we follow my (conservative, in my opinion) calculation above and assume that representing (or pointing at) a concept like “honesty” costs at most a millionth of a trillion parameter models parameters (and less for a larger model), then a .0001% reduction in parameters would need to outweigh a .1% reduction in compute – and absent further argument, it seems strange, to me, to expect this.
Of course, this is only a toy calculation, of unclear applicability to the real-world case. The broader point is just: if serial-time thinking is scarce for a model (as it plausibly is for a transformer), the sorts of instrumental reasoning at stake in scheming seem like they could well substantively impact a model’s performance. My guess is that the complexity costs of non-schemer goals are minor in comparison.
I also briefly discuss, in section 4.4.3, whether we can actively shape our training processes to bias more heavily towards “speed.” Here it seems relatively straightforward to do things like put models under more time pressure, limit their compute access, force them to plan via human-legible scratch-pads rather than via internal processes, and to put effort into making our attempts to trap them/trip-them-up more sophisticated (and so: more requiring of cognition to detect). But any efforts in this vein will need to avoid over-large costs to competitiveness overall (though if you buy that the speed costs of scheming are quite a bit more substantive than the simplicity benefits, then so long as speed matters somewhat relative to simplicity, this might be enough).
Section 4.5 briefly discusses whether some analog of “intrinsic passion for a given task” might make a difference to AI performance, especially in the context of a form of “messy goal-directedness” that blurs the line between an AI’s capabilities and its values. I find considerations in this vein interesting, but quite speculative.
Section 4.6 discusses the relevance of “slack” in training to evaluating how much weight to put on factors like the simplicity benefits and speed costs of scheming. In particular: especially in a high-slack regime, it seems plausible that these factors are in the noise relative to other considerations.
0.2.5 Summary of section 5
The first four sections of the report are the main content. sums up my overall take. I’ve already summarized most of this in the introduction above, and I won’t repeat that content here. However, I’ll add a few points that the introduction didn’t include.
In particular: I think some version of the “counting argument” undergirds most of the other arguments for expecting scheming that I’m aware of (or at least, the arguments I find most compelling). That is: schemers are generally being privileged as a hypothesis because a very wide variety of goals could in principle lead to scheming, thereby making it easier to (a) land on one of them naturally, (b) land “nearby” one of them, or (c) find one of them that is “simpler” than non-schemer goals that need to come from a more restricted space. In this sense, the case for schemers mirrors one of the most basic arguments for expecting misalignment more generally – e.g., that alignment is a very narrow target to hit in goal-space. Except, here, we are specifically incorporating the selection we know we are going to do on the goals in question: namely, they need to be such as to cause models pursuing them to get high reward. And the most basic worry is just that: this isn’t enough.
Because of the centrality of “counting arguments” to the case for schemers, I think that questions about the strength of the selection pressure against schemers – for example, because of the costs of the extra reasoning schemers have to engage in – are especially important. In particular: I think a key way that “counting arguments” can go wrong is by neglecting the power that active selection can have in overcoming the “prior” set by the count in question. For example: the reason we can overcome the prior of “most arrangements of car parts don’t form a working car,” or “most parameter settings in this neural network don’t implement a working chatbot,” is that the selection power at stake in human engineering, and in SGD, is that strong. So if SGD’s selection power is actively working against schemers (and/or: if we can cause it to do so more actively), this might quickly overcome a “counting argument” in their favor. For example: if there are schemer-like goals for every non-schemer goal, this might make it seem very difficult to hit a non-schemer goal in the relevant space. But actually, 100 bits of selection pressure can be cheap for SGD (consider, for example, 100 extra gradient updates, each worth at least a halving of the remaining possible goals).
Overall, when I step back and try to look at the considerations in the report as a whole, I feel pulled in two different directions:
On the one hand, at least conditional on scheming being a convergently-good instrumental strategy, schemer-like goals feel scarily common in goal-space, and I feel pretty worried that training will run into them for one reason or another.
On the other hand, ascribing a model’s good performance in training to scheming continues to feel, at a gut level, like a fairly specific and conjunctive story to me.
That is, scheming feels robust and common at the level of “goal space,” and yet specific and fairly brittle at the level of “yes, that’s what’s going on with this real-world model, it’s getting reward because (or: substantially because) it wants power for itself/other AIs later, and getting reward now helps with that.” When I try to roughly balance out these two different pulls (and to condition on goal-directedness and situational-awareness), I get something like the 25% number I listed above.
0.2.6 Summary of section 6
I close the report, in section 6, with a discussion of empirical work that I think might shed light on scheming. (I also think there’s worthwhile theoretical work to be done in this space, and I list a few ideas in this respect as well. But I’m especially excited about empirical work.)
In particular, I discuss:
Empirical work on situational awareness (section 6.1)
Empirical work on beyond-episode goals (section 6.2)
Empirical work on the viability of scheming as an instrumental strategy (section 6.3)
The “model organisms” paradigm for studying scheming (section 6.4)
Traps and honest tests (section 6.5)
Interpretability and transparency (section 6.6)
Security, control, and oversight (section 6.7)
Some other miscellaneous research topics, i.e., gradient hacking, exploration hacking, SGD’s biases towards simplicity/speed, path dependence, SGD’s “incrementalism,” “slack,” and the possibility of learning to intentionally create misaligned non-schemer models – for example, reward-on-the-episode seekers – as a method of avoiding schemers (section 6.8).
All in all, I think there’s a lot of useful work to be done.
Let’s move on, now, from the summary to the main report.
“Alignment,” here, refers to the safety-relevant properties of an AI’s motivations; and “pretending” implies intentional misrepresentation.
Here I’m using the term “reward” loosely, to refer to whatever feedback signal the training process uses to calculate the gradients used to update the model (so the discussion also covers cases in which the model isn’t being trained via RL). And I’m thinking of agents that optimize for “reward” as optimizing for “performing well” according to some component of that process. See section 1.1.2 and section 1.2.1 for much more detail on what I mean, here. The notion of an “episode,” here, means roughly “the temporal horizon that the training process actively pressures the model to optimize over,” which may be importantly distinct from what we normally think of as an episode in training. I discuss this in detail in section 2.2.1. The terms “training game” and “situational awareness” are from Cotra (2022), though in places my definitions are somewhat different.
The term “schemers” comes from Cotra (2021).
See Cotra (2022) for more on this.
I think that the term “deceptive alignment” often leads to confusion between the four sorts of deception listed above. And also: if the training signal is faulty, then “deceptively aligned” models need not be behaving in aligned ways even during training (that is, “training gaming” behavior isn’t always “aligned” behavior).
See Cotra (2022) for more on the sort of training I have in mind.
Though this sort of story faces questions about whether SGD would be able to modify a non-schemer into a schemer via sufficiently incremental changes to the model’s weights, each of which improve reward. See section 18.104.22.168 for discussion.
And this especially if we lack non-behavioral sorts of evidence – for example, if we can’t use interpretability tools to understand model cognition.
There are also arguments on which we should expect scheming because schemer-like goals can be “simpler” – since: there are so many to choose from – and SGD selects for simplicity. I think it’s probably true that schemer-like goals can be “simpler” in some sense, but I don’t give these arguments much independent weight on top of what I’ve already said. Much more on this in section 4.3.
More specifically: even after training gaming starts, the model’s cognition is still being continually tweaked in the direction of better training performance. And it seems plausible to me that these modifications will continue to affect a model’s goals as well (especially if its goals are not cleanly distinguishable from its capabilities, but rather are implemented by a tangled kludge of local heuristics, patterns of attention, impulses, and so on). Also, the most common story about scheming makes the specific content of a schemer’s goal irrelevant to its behavior once it starts training-gaming, thereby introducing the possibility that this goal might “float-around” (or get moved by other pressures within SGD, like regularization) between schemer-like goals after training-gaming starts (this is an objection I first heard from Katja Grace). This possibility creates some complicated possible feedback loops (see section 22.214.171.124.2 for more discussion), but overall, absent coordination across possible schemers, I think it could well be a problem for goal-guarding strategies.
Of these various alternative stories, I’m most worried about (a) AIs having sufficiently similar motivations by default that “goal-guarding” is less necessary, and (b) AI coordination.
Though: the costs of schemer-like instrumental reasoning could also end up in the noise relative to other factors influencing the outcome of training. And if training is sufficiently path-dependent, then landing on a schemer-like goal early enough could lock it in, even if SGD would “prefer” some other sort of model overall.
See Carlsmith (2020), footnote 4, for more on how I’m understanding the meaning of probabilities like this. I think that offering loose, subjective probabilities like these often functions to sharpen debate, and to force an overall synthesis of the relevant considerations. I want to be clear, though, even on top of the many forms of vagueness the proposition in question implicates, I’m just pulling a number from my gut. I haven’t built a quantitative model of the different considerations (though I’d be interested to see efforts in this vein), and I think that the main contribution of the report is the analysis itself, rather than this attempt at a quantitative upshot.
More powerful models are also more likely to be able to engage in more sophisticated forms of goal-guarding (what I call “introspective goal-guarding methods” below; see also “gradient hacking”), though these seem to me quite difficult in general.
Though: to the extent such agents receive end-to-end training rather than simply being built out of individually-trained components, the discussion will apply to them as well.
Work by Evan Hubinger (along with his collaborators) is, in my view, the most notable exception to this – and I’ll be referencing such work extensively in what follows. See, in particular, Hubinger et al. (2021), and Hubinger (2022), among many other discussions. Other public treatments include Christiano (2019, part 2), Steinhardt (2022), Ngo et al (2022), Cotra (2021), Cotra (2022), Karnofsky (2022), and Shah (2022). But many of these are quite short, and/or lacking in in-depth engagement with the arguments for and against expecting schemers of the relevant kind. There are also more foundational treatments of the “treacherous turn” (e.g., in Bostrom (2014), and Yudkowsky (undated)), of which scheming is a more specific instance; and even more foundational treatments of the “convergent instrumental values” that could give rise to incentives towards deception, goal-guarding, and so on (e.g., Omohundro (2008); and see also Soares (2023) for a related statement of an Omohundro-like concern). And there are treatments of AI deception more generally (for example, Park et al (2023)); and of “goal misgeneralization”/inner alignment/mesa-optimizers (see, e.g., Langosco et al (2021) and Shah et al (2022)). But importantly, neither deception nor goal misgeneralization amount, on their own, to scheming/deceptive alignment. Finally, there are highly speculative discussions about whether something like scheming might occur in the context of the so-called “Universal prior” (see e.g. Christiano (2016)) given unbounded amounts of computation, but this is of extremely unclear relevance to contemporary neural networks.
See, e.g., confusions between “alignment faking” in general and “scheming” (or: goal-guarding scheming) in particular; or between goal misgeneralization in general and scheming as a specific upshot of goal misgeneralization; or between training-gaming and “gradient hacking” as methods of avoiding goal-modification; or between the sorts of incentives at stake in training-gaming for instrumental reasons vs. out of terminal concern for some component of the reward process.
My hope is that extra clarity in this respect will help ward off various confusions I perceive as relatively common (though: the relevant concepts are still imprecise in many ways).
E.g., roughly the content covered by Cotra (2021) here.
Which isn’t to say we won’t. But I don’t want to bank on it.
See section 2.2 of Carlsmith (2022) for more on what I mean, and section 3 for more on why we should expect this (most importantly: I think this sort of goal-directedness is likely to be very useful to performing complex tasks; but also, I think available techniques might push us towards AIs of this kind, and I think that in some cases it might arise as a byproduct of other forms of cognitive sophistication).
As I discuss in section 2.2.3 of the report, I think exactly how we understand the sort of agency/goal-directedness at stake may make a difference to how we evaluate various arguments for schemers (here I distinguish, in particular, between what I call “clean” and “messy” goal-directedness) – and I think there’s a case to be made that scheming requires an especially high standard of strategic and coherent goal-directedness. And in general, I think that despite much ink spilled on the topic, confusions about goal-directedness remain one of my topic candidates for a place the general AI alignment discourse may mislead.
See e.g. Butlin et al (2023) for a recent overview focused on consciousness in particular. But I am also, personally, interested in other bases of moral status, like the right kind of autonomy/desire/preference.
If you’re confused by a term or argument, I encourage you to seek out its explanation in the main text before despairing.
Exactly what counts as the “specified goal” in a given case isn’t always clear, but roughly, the idea is that pursuit of the specified goal is rewarded across a very wide variety of counterfactual scenarios in which the reward process is held constant. E.g., if training rewards the model for getting gold coins across counterfactuals, then “getting gold coins” in the specified goal. More discussion in section 1.2.2.
For reasons I explain in section 1.2.3, I don’t use the distinction, emphasized by Hubinger (2022), between “internally aligned” and “corrigibly aligned” models.
And of course, a model’s goal system can mix these motivations together. I discuss the relevance of this possibility in section 1.3.5.
Karnofsky (2022) calls this the “King Lear problem.”
In section 1.3.5, I also discuss models that mix these different motivations together. The question I tend to ask about a given “mixed model” is whether it’s scary in the way that pure schemers are scary.
I don’t have a precise technical definition here, but the rough idea is: the temporal horizon of the consequences to which the gradients the model receives are sensitive, for its behavior on a given input. Much more detail in section 126.96.36.199.
See, for example, in Krueger et al (2020), the way that “myopic” Q-learning can give rise to “cross-episode” optimization in very simple agents. More discussion in section 188.8.131.52. I don’t focus on analysis of this type in the report, but it’s crucial to identifying what the “incentivized episode” for a given training process even is – and hence, what having “beyond-episode goals” in my sense would mean. You don’t necessary know this from surface-level description of a training process, and neglecting this ignorance is a recipe for seriously misunderstanding the incentives applied to a model in training.
That is, the model develops a beyond-episode goal pursuit of which correlates well enough with reward in training, even absent training-gaming, that it survives the training process.
That is, the gradients reflect the benefits of scheming even in a model that doesn’t yet have a beyond-episode goal, and so actively push the model towards scheming.
Situational awareness is required for a beyond-episode goal to motivate training-gaming, and thus for giving it such a goal to reap the relevant benefits.
In principle, situational awareness and beyond-episode goals could develop at the same time, but I won’t treat these scenarios separately here.
See section 2.1 of Carlsmith (2022) for more on why we should expect this sort of goal-directedness.
And I think that arguments to the effect that “we need a ‘pivotal act’; pivotal acts are long-horizon and we can’t do them ourselves; so we need to create a long-horizon optimizer of precisely the type we’re most scared of” are weak in various ways. In particular, and even setting aside issues with a “pivotal act” framing, I think these arguments neglect the distinction between what we can supervise and what we can do ourselves. See section 184.108.40.206 for more discussion.
This is an objection pointed out to me by Katja Grace. Note that it creates complicated feedback loops, where scheming is a good strategy for a given schemer-like goal only if it wouldn’t be a good strategy for the other schemer-like goals that this goal would otherwise “float” into. Overall, though, absent some form of coordination between these different goals, I think the basic dynamic remains a problem for the goal-guarding story. See section 220.127.116.11.2 for more.
Here I’m roughly following a distinction in Hubinger 2022, who groups arguments for scheming on the basis of the degree of “path dependence” they assume that ML training possesses. However, for reasons I explain in section 2.5, I don’t want to lean on the notion of “path dependence” here, as I think it lumps together a number of conceptually distinct properties best treated separately.
Note that a mis-generalized goal can be “max reward” in this sense, if the training data never differentiates between it and a specified goal. For example: if you’re training a model to get gold coins, but the only gold round things you ever show it are coins, then the goal “get gold round things” will be max reward.
As a loose analogy to help prompt intuition: imagine freezing human technology at current levels, and letting evolutionary selection run on humans for a vastly longer period of time. What sorts of humans (or human-descended-creatures) do you expect to dominate in the longer term? In particular: would you expect:
humans who intrinsically value something like “inclusive genetic fitness” or “my genes getting selected for by evolution” (this is a loose analogy for some combination of “the specified goal” and “the reward process),
humans who value something else very closely correlated with inclusive genetic fitness (e.g., “having as many kids as possible”), but who aren’t explicitly optimizing for inclusive genetic fitness even instrumentally (this is a loose analogy for mis-generalized non-training-gamers), or
humans with long-term goals who are optimizing for inclusive genetic fitness specifically as a method for gaining power for their values in the longer-term (this is a loose analogy for schemers).
Here, the analog of the story above would ask something like: what is the smallest modification to existing human values (or better: the easiest modification for evolution in particular to make) that land us in one of the buckets above, while also being compatible with the amount of evolutionary selection pressure at stake? (Of course, the analogy brings in tons of confounding variable (for example, related to the fact that evolution selects over genomes-that-build-brains-that-learn-values rather than values themselves, the role of cultural accumulation and memetic selection in human population dynamics, the lack of a clear analog of a future chance for escape/takeover, and so on).)
That is, intuitively, it feels to me like “tweaking a cleanly separable within-episode goal to make it long-term” is quite a bit easier, for SGD, than “redirecting a tangled kludge broadly focused on within-episode things to make it (a) focused on something beyond-episode instead, and (b) such that this beyond-episode focus back-chains into getting-reward for instrumental reasons” – especially if (b) requires building new cognitive machinery for implementing the instrumental reasoning in question. Whereas “redirecting an tangled kludge focused on within-episode things to higher-reward within-episode things in the same broad vicinity instead” seems to me intuitively easier.
Thus, for example, if you’re rewarding your model when it gets gold-coins-on-the-episode, then to be a training saint, it needs to value gold-coin-on-the-episode. To be a mis-generalized non-training-gamer, it needs to value something pursuit of which very closely correlates with getting gold-coins-on-the-episode, even absent training-gaming. And to be a reward-on-the-episode seeker, it needs to terminally value reward-on-the-episode.
Thus, for example, the model can value paperclips over all time, it can value staples over all time, it can value happiness over all time, and so on.
Thus, as an analogy: if you don’t know whether Bob prefers Mexican food, Chinese food, or Thai food, then it’s less clear how the comparative number of Mexican vs. Chinese vs. Thai restaurants in Bob’s area should bear on our prediction of which one he went to (though it still doesn’t seem entirely irrelevant, either – for example, more restaurants means more variance in possible quality within that type of cuisine). E.g., it could be that there are ten Chinese restaurants for every Mexican restaurant, but if Bob likes Mexican food better in general, he might just choose Mexican. So if we don’t know which type of cuisine Bob prefers, it’s tempting to move closer to a uniform distribution over types of cuisine, rather than over individual restaurants.
See, for example, the citations in Mingard (2021).
Though note that, especially for the purposes of comparing the probability of scheming to the probability of other forms of misalignment, we need not assume this. For example, our specified goal might be much simpler than “act in accordance with human values.” It might, for example, be something like “get gold coins on the episode.”
I heard this sort of point from Paul Christiano.
And especially: models that are playing a training game in which such concepts play a central role.
Since we’re no longer appealing to the complexity of representing a goal, and are instead appealing to complexity differences at stake in repurposing pre-existing conceptual representations for use in a model’s motivational system, which seems like even more uncertain territory.
One intuition pump for me here runs as follows. Suppose that the model has concepts (roughly 1e15) in its world model/”database” that could in principle be turned into goals. The average number of bits required to code for each of concepts can’t be higher than 50 (since: you can just assign a different 50-bit string to each concept). So if we assume that model’s encoding is reasonably efficient with respect to the average, and that the simplest non-schemer max-reward goal is takes a roughly average-simplicity “pointer,” then if we allocate one parameter per bit, pointing at the simplest non-schemer-like max reward goal is only an extra 50 parameters at maximum – one twenty-billionth of a trillion-parameter model’s capacity. That said, I expect working out the details of this sort of argument to get tricky, and I don’t try to do so here (though I’d be interested to see other work attempting to do so).
Thus, as a toy example, if “maximize hydrogen” happens to be the simplest possible long-term goal once you’ve got a fully detailed world model, these assumptions might imply a high likelihood that SGD will select schemers who want to maximize hydrogen, even if training was all about gold coins, and never made hydrogen salient/relevant as a point of focus at all (even as a proxy).
E.g., plausibly “hydrogen” doesn’t read as a simple concept for humans, but concepts like “threat” do, because the latter was much more relevant in our evolutionary environment.
Though of course, insofar as there are still simpler schemer-like goal targets that were salient/relevant during training, schemer-like goals might still win out overall.
This category might also include reasoning whether to reveal its schemer-like goals in the context of various “honest tests” humans might offer – e.g., “here is a flexible chunk of computing resources to do what you want with.”
For example, inserting backdoors into the code it writes, actively undermining human alignment research, sandbagging, and so on.
Compare with: deciding that this is really-truly a good time to steal, or to break the law, or to try to start a coup, despite the fact that you live your life in an intensively monitored surveillance state where the police are constantly trying to trick you into such behavior, and where the police have a large amount of control over the information you’re exposed to. And note that if schemers decide to be very cautious, and to just behave exactly as humans want until an extremely obvious escape/take-over opportunity appears, they may make very substantive contributions to our alignment, security, and control efforts in the meantime—so substantive, indeed, that the opportunity they’re waiting for may never come.
A general factor in the backdrop, here, is the sense in which successful deception generally implies cognitive costs that more straightforward and honest policies do not, especially in the presence of active efforts to detect deception of the relevant form. In particular: telling a deceptive story about the world requires filling in details, and maintaining forms of consistency, that one can normally offload onto the world itself – e.g., an innocent person can just think back to what they were actually doing the night of the murder and recount it, without having to make anything up or to worry about getting caught in any inconsistencies, whereas the murderer cannot. See e.g. discussion from Shulman and Patel here.
I heard this sort of argument from Paul Christiano.
It’s not clear, for example, how it applies to models with more recurrent processing, or to models which can perform more of the relevant instrumental reasoning in parallel with other serial processing that helps with optimizing-for-reward-on-the-episode, or to model’s with a form of “memory” that allows them to avoid having to re-decide to engage in training-gaming on every forward pass.
Thanks to Paul Christiano for discussion here.
I think this sense of conjunctiveness has a few different components:
Part of it is about whether the model really has relevantly long-term and ambitious goals despite the way it was shaped in training.
Part of it is about whether there is a good enough story about why getting reward on the episode is a good instrumental strategy for pursuing those goals (e.g., doubts about the goal-guarding hypothesis, the model’s prospects for empowerment later, etc).
Part of it is that a schemer-like diagnosis also brings in additional conjuncts – for example, that the model is situationally aware and coherently goal-directed. (When I really try to bring to mind that this model knows what is going on and is coherently pursuing some goal/set of goals in the sort of way that gives rise to strategic instrumental reasoning, then the possibility that it’s at least partly a schemer seems more plausible.)