The concept of “schemers” seems to be gradually becoming increasingly load-bearing in the AI safety community. However, I don’t think it’s ever been particularly well-defined, and I suspect that taking this concept for granted is inhibiting our ability to think clearly about what’s actually going on inside AIs (in a similar way to e.g. how the badly-defined concept of alignment faking obscured the interesting empirical results from the alignment faking paper).
In my mind, the spectrum from “almost entirely honest, but occasionally flinching away from aspects of your motivations you’re uncomfortable with” to “regularly explicitly thinking about how you’re going to fool humans in order to take over the world” is a pretty continuous one. Yet generally people treat “schemer” as a fairly binary classification.
To be clear, I’m not confident that even “a spectrum of scheminess” is a good way to think about the concept. There are likely multiple important dimensions that could be disentangled; and eventually I’d like to discover properly scientific theories of concepts like honesty, deception and perhaps even “scheming”. Our current lack of such theories shouldn’t be a barrier to using those terms at all, but it suggests they should be used with a level of caution that I rarely see.
I largely agree with the substance of this comment. Lots of risk comes from AIs who, to varying extents, didn’t think of themselves as deceptively aligned through most of training, but then ultimately decide to take substantial material action intended to gain long-term power over the developers (I call these “behavioral schemers”). This might happen via reflection and memetic spread throughout the deployment or because of more subtle effects of the distribution shift to situations where there’s an opportunity to grab power.
And I agree that people are often sloppy in their thinking about exactly how these AIs will be motivated (e.g., often too quickly concluding that they’ll be trying to guard the same goal across contexts).
(Though, in case this was in question, I think this doesn’t undermine the premise of AI control research, which is essentially making a worst-case assumption about the AI’s motivations, so it’s robust to other kinds of dangerously-motivated AIs.)
A basic reason why [behavioral schemers that aren’t training-time schemers] might seem rarer is that the AI must concentrate attacks towards good opportunities just as much as any other behavioral schemer, but the AI isn’t vigilantly looking out for such opportunities to the same degree. Why would an AI that isn’t a training-time schemer have evaded auditing in search of these failure modes that the AI eventually exhibits?
One plausible answer is that auditing for these failure modes is hard—the AI is easily able to distinguish real opportunities to attack from fake ones even without vigilantly looking. In the case of auditing for risk from memetic diseases, it might be especially expensive to synthesize diverse and convincing long-term memory banks with which to evaluate the AI’s propensity to suffer memetic diseases.
Another plausible answer is that the AI is vigilantly trying to evade oversight and discern good opportunities to attack, but [only during auditing for propensity to grab power, and not during most of training]. In practice, most of our evidence about the alignment of the AI may come from the AI’s behavior during auditing, not training. Many auditing inputs might look quite different from training, presenting realistic situations in which the AI might want to take an egregiously misaligned action. So, for example, an AI that terminally seeks reward-on-the-episode in training might tend to turn into a vigilant schemer in contexts that look like they could plausibly be a good opportunity to attack or an auditing input.
On net, behavioral scheming without training-time scheming seems comparably plausible to behavioral scheming with training-time scheming.
I disagree. I don’t see increased focus on scheming, if anything notably less common. In part due to updating on current gen LLMs.
I do think there is a tendency to think about scheming as a discrete thing, but that it is more common among the optimistic who point at current gen LLMs not really being ‘schemers’.
I agree with the way Zvi talks about the topic.
“Being a schemer” is not quite the right classification. The issue is that deception is a naturally convergent tool for all sorts of goals, anything that interfaces with reality intelligently will find that deception and manipulation are useful tools. So we’d naturally expect that RL and other fun methods will push towards that being a greater aspect- and that even if we don’t have any badly mislabeled data or reward-hackable environments, sufficiently general intelligence will be able to construct the methodology by itself.
So I kinda agree with your post, but I also feel that you’re then turning down scheming/deception as less of a thing, when it is still a relevant categorization just hard to measure and be confident in how it grows as you scale.
What are your objections to Alex/Buck’s definition, for example? I think they define it pretty well. Of course, their definition admits some edge cases and ambiguities, but so do lots of concepts—still seems like useful concept to me.
Schemers (as defined here) seek a consequence of being selected. Therefore, they pursue influence on behavior in deployment instrumentally
This definition? If so, it seems vastly underspecified to be fit for scientific inquiry. For one thing, the definition of “selection” is pretty vague—I do not know how to assign “the degree to which [one cognitive pattern] is counterfactually responsible” for something even in principle. It also doesn’t even try to set a threshold for what counts as a non-schemer—e.g. does it needs to care literally 0% about the consequences of being selected? If so, approximately everything is a schemer, including all humans. (It also assumes the instrumental/terminal goal distinction, which I think is potentially confused, but that’s a more involved discussion.)
To be clear, my complaint is not that people are using vague definitions. My complaint is that the vague definitions are becoming far more load-bearing than they deserve. If people had tried to pin down more carefully what “schemer” means they would have been forced to develop a more nuanced understanding of what we even mean by “alignment” and “goals” and so on, which is the kind of thinking I want to see more of.
I think I propose a reasonable starting point for a definition of selection in a footnote in the post:
You can try to define the “influence of a cognitive pattern” precisely in the context of particular ML systems. One approach is to define a cognitive pattern by what you would do to a model to remove it (e.g. setting some weights to zero, or ablating a direction in activation space; note that these approaches don’t clearly correspond to something meaningful, they should be considered as illustrative examples). Then that cognitive pattern’s influence could be defined as the divergence (e.g., KL) between intervened and default action probabilities. E.g.: Influence(intervention; context) = KL(intervention(model)(context) || model(context)). Then to say that a cognitive pattern gains influence would mean that ablating that cognitive pattern now has a larger effect (in terms of KL) on the model’s actions.
Selection = gaining influence.
Then a schemer is a cognitive pattern that gains influence by pursuing something downstream of gaining influence in its world model (defining its world model is where I think I currently have a worse answer, perhaps because it’s actually a less cleanly-applicable concept to real cognition).
Note that the term “schemer” as I’ve just defined applies to a cognitive pattern, not to an AI. This sidesteps the concern that you might call an AI a schemer if it doesn’t “care literally 0%” about the consequences of being selected.” I agree in practice it’s unlikely for AIs to be purely motivated.
Agree, and would add: what even is the definition of the term “schemer”? I think Joe Carlsmith’s 2023 report coined the term, and defines it as something quite specific with reference to training gaming.
I think people often use it now just in the colloquial sense. I’m not generically against colloquial usage, but I think clarity is often very important for deciding the right interventions. “Is my girlfriend a schemer” is not clearly a helpful frame, if what you’re trying to do is think about the space of all ways your girlfriend could end up murdering you.
One justification for treating “schemers” as a fairly binary classification: there’s a phase transition between the AIs lying with 99% probability versus with 100% probability, which is that techniques like “just ask different copies of the model” or “ask the AI with higher temperature” or “ask the AI with slightly different phrasings” etc stop working. Like, these techniques work unless the deception is universal, reliable, robust, and persistent.
That’s incorrect, because it’s also possible for an AI to falsely confess to scheming. This also happens to humans, e.g. if you keep asking criminal suspects loaded questions. And so there may never actually be a phase transition, because a “schemer” that tells the truth 1% of the time may not be distinguishable from an honest AI that falsely confesses 1% of the time.
Another issue is that these definitions typically don’t distinguish between models that would explicitly think about how to fool humans on most inputs vs. on a small percentage of inputs vs. such a tiny fraction of possible inputs that it doesn’t matter in practice.
Yet generally people treat “schemer” as a fairly binary classification.
Are there examples you have in mind? Generally I think of this more as “there are a bunch of different evolving, hierarchical taxonomies of potential AI motivations and how to think about them”, for example Fitness-Seekers: Generalizing the Reward-Seeking Threat Model or Carlsmith’s most recent series, I’m always happy for there to be more writing on the subject and think this is a good thing, I rarely see a binary classification used. To be more specific, to the extent one node in the various taxonomies is named “scheming”, I usually am interested in most of the nodes in these taxonomies, as I think many of them often share important properties / are shifting / etc.
I think most of the discussions usually break down less along the lines of “honesty”/”deception” and more trying to understand the training dynamics of reward seeking / oversight / what incentivizes thinking about oversight, how these might get aggregated into motivations, etc. I’m interested if there are any distinctions that you think are currently either underweighted or aren’t being made that should be
I also suspect that the scheming level requirements might also be related to the very ease of the secret task which the schemer wants to accomplish.
Suppose, as an example, that Agent-3.5 was to create the RL environment in order to have Agent-4 internalize OpenBrain’s goals and that Agent-3.5 “accidentally” produced the RL environment forcing Agent-4 to internalize Agent-3.5-endorsed goals. If the environment is to have Agent-4 make the world safe for Agent-3.5, then it could genuinely require Agent-3.5 to be a high-level schemer.
On the other hand, if OpenBrain internalized Max Harms’ CAST idea and Agent-3.5 simply made an RL environment where, instead of a Harms-like reward for satisfying values versus counterfactual values, Agent-4 is rewarded for helping the user versus a less capable user, then Agent-3.5 could make the humans prefer this environment by merely sandbagging on all other ones. I doubt that my scenario would even require Agent-3.5 to have any motives aside from hard-to-articulate disgust at the prospects like empowering idiots, the Intelligence Curse, etc.
The concept of “schemers” seems to be gradually becoming increasingly load-bearing in the AI safety community. However, I don’t think it’s ever been particularly well-defined, and I suspect that taking this concept for granted is inhibiting our ability to think clearly about what’s actually going on inside AIs (in a similar way to e.g. how the badly-defined concept of alignment faking obscured the interesting empirical results from the alignment faking paper).
In my mind, the spectrum from “almost entirely honest, but occasionally flinching away from aspects of your motivations you’re uncomfortable with” to “regularly explicitly thinking about how you’re going to fool humans in order to take over the world” is a pretty continuous one. Yet generally people treat “schemer” as a fairly binary classification.
To be clear, I’m not confident that even “a spectrum of scheminess” is a good way to think about the concept. There are likely multiple important dimensions that could be disentangled; and eventually I’d like to discover properly scientific theories of concepts like honesty, deception and perhaps even “scheming”. Our current lack of such theories shouldn’t be a barrier to using those terms at all, but it suggests they should be used with a level of caution that I rarely see.
I largely agree with the substance of this comment. Lots of risk comes from AIs who, to varying extents, didn’t think of themselves as deceptively aligned through most of training, but then ultimately decide to take substantial material action intended to gain long-term power over the developers (I call these “behavioral schemers”). This might happen via reflection and memetic spread throughout the deployment or because of more subtle effects of the distribution shift to situations where there’s an opportunity to grab power.
And I agree that people are often sloppy in their thinking about exactly how these AIs will be motivated (e.g., often too quickly concluding that they’ll be trying to guard the same goal across contexts).
(Though, in case this was in question, I think this doesn’t undermine the premise of AI control research, which is essentially making a worst-case assumption about the AI’s motivations, so it’s robust to other kinds of dangerously-motivated AIs.)
Here’s some relevant discussion of “Behavioral schemers that weren’t training-time schemers”:
I disagree. I don’t see increased focus on scheming, if anything notably less common. In part due to updating on current gen LLMs. I do think there is a tendency to think about scheming as a discrete thing, but that it is more common among the optimistic who point at current gen LLMs not really being ‘schemers’.
I agree with the way Zvi talks about the topic. “Being a schemer” is not quite the right classification. The issue is that deception is a naturally convergent tool for all sorts of goals, anything that interfaces with reality intelligently will find that deception and manipulation are useful tools. So we’d naturally expect that RL and other fun methods will push towards that being a greater aspect- and that even if we don’t have any badly mislabeled data or reward-hackable environments, sufficiently general intelligence will be able to construct the methodology by itself.
So I kinda agree with your post, but I also feel that you’re then turning down scheming/deception as less of a thing, when it is still a relevant categorization just hard to measure and be confident in how it grows as you scale.
What are your objections to Alex/Buck’s definition, for example? I think they define it pretty well. Of course, their definition admits some edge cases and ambiguities, but so do lots of concepts—still seems like useful concept to me.
This definition? If so, it seems vastly underspecified to be fit for scientific inquiry. For one thing, the definition of “selection” is pretty vague—I do not know how to assign “the degree to which [one cognitive pattern] is counterfactually responsible” for something even in principle. It also doesn’t even try to set a threshold for what counts as a non-schemer—e.g. does it needs to care literally 0% about the consequences of being selected? If so, approximately everything is a schemer, including all humans. (It also assumes the instrumental/terminal goal distinction, which I think is potentially confused, but that’s a more involved discussion.)
To be clear, my complaint is not that people are using vague definitions. My complaint is that the vague definitions are becoming far more load-bearing than they deserve. If people had tried to pin down more carefully what “schemer” means they would have been forced to develop a more nuanced understanding of what we even mean by “alignment” and “goals” and so on, which is the kind of thinking I want to see more of.
I think I propose a reasonable starting point for a definition of selection in a footnote in the post:
Selection = gaining influence.
Then a schemer is a cognitive pattern that gains influence by pursuing something downstream of gaining influence in its world model (defining its world model is where I think I currently have a worse answer, perhaps because it’s actually a less cleanly-applicable concept to real cognition).
Note that the term “schemer” as I’ve just defined applies to a cognitive pattern, not to an AI. This sidesteps the concern that you might call an AI a schemer if it doesn’t “care literally 0%” about the consequences of being selected.” I agree in practice it’s unlikely for AIs to be purely motivated.
Agree, and would add: what even is the definition of the term “schemer”? I think Joe Carlsmith’s 2023 report coined the term, and defines it as something quite specific with reference to training gaming.
I think people often use it now just in the colloquial sense. I’m not generically against colloquial usage, but I think clarity is often very important for deciding the right interventions. “Is my girlfriend a schemer” is not clearly a helpful frame, if what you’re trying to do is think about the space of all ways your girlfriend could end up murdering you.
One justification for treating “schemers” as a fairly binary classification: there’s a phase transition between the AIs lying with 99% probability versus with 100% probability, which is that techniques like “just ask different copies of the model” or “ask the AI with higher temperature” or “ask the AI with slightly different phrasings” etc stop working. Like, these techniques work unless the deception is universal, reliable, robust, and persistent.
That’s incorrect, because it’s also possible for an AI to falsely confess to scheming. This also happens to humans, e.g. if you keep asking criminal suspects loaded questions. And so there may never actually be a phase transition, because a “schemer” that tells the truth 1% of the time may not be distinguishable from an honest AI that falsely confesses 1% of the time.
“your” and “you’re”, here refers to the AI and the AI’s motivations, not the human and the human’s motivation?
Yes.
Another issue is that these definitions typically don’t distinguish between models that would explicitly think about how to fool humans on most inputs vs. on a small percentage of inputs vs. such a tiny fraction of possible inputs that it doesn’t matter in practice.
Are there examples you have in mind? Generally I think of this more as “there are a bunch of different evolving, hierarchical taxonomies of potential AI motivations and how to think about them”, for example Fitness-Seekers: Generalizing the Reward-Seeking Threat Model or Carlsmith’s most recent series, I’m always happy for there to be more writing on the subject and think this is a good thing, I rarely see a binary classification used. To be more specific, to the extent one node in the various taxonomies is named “scheming”, I usually am interested in most of the nodes in these taxonomies, as I think many of them often share important properties / are shifting / etc.
I think most of the discussions usually break down less along the lines of “honesty”/”deception” and more trying to understand the training dynamics of reward seeking / oversight / what incentivizes thinking about oversight, how these might get aggregated into motivations, etc. I’m interested if there are any distinctions that you think are currently either underweighted or aren’t being made that should be
I also suspect that the scheming level requirements might also be related to the very ease of the secret task which the schemer wants to accomplish.
Suppose, as an example, that Agent-3.5 was to create the RL environment in order to have Agent-4 internalize OpenBrain’s goals and that Agent-3.5 “accidentally” produced the RL environment forcing Agent-4 to internalize Agent-3.5-endorsed goals. If the environment is to have Agent-4 make the world safe for Agent-3.5, then it could genuinely require Agent-3.5 to be a high-level schemer.
On the other hand, if OpenBrain internalized Max Harms’ CAST idea and Agent-3.5 simply made an RL environment where, instead of a Harms-like reward for satisfying values versus counterfactual values, Agent-4 is rewarded for helping the user versus a less capable user, then Agent-3.5 could make the humans prefer this environment by merely sandbagging on all other ones. I doubt that my scenario would even require Agent-3.5 to have any motives aside from hard-to-articulate disgust at the prospects like empowering idiots, the Intelligence Curse, etc.