Especially in casual conversations people tend to conflate between things like “someone builds a scaffolded LLM agent that starts to acquire power and resources, deceive humans (including about the agent’s aims), self-preserve etc.” and “scheming”. This is incorrect. While the outlined scenario can count for instrumental alignment-faking, scheming (as a technical term defined by Carlsmith) demands training gaming, and hence scaffolded LLM agents are out of the scope of the definition.[3]
The main point: when people debate the likelihood of scheming/deceptive alignment, they are NOT talking about whether scaffolded LLM agents will exhibit instrumental deceptionor such. They are debating whether the training process creates models that “play the training game” (for instrumental reasons).
I think the right mental picture is to think of dynamics of SGD and the training process rather than dynamics of LLM scaffolding and prompting.[4]
Corollaries:
This confusion allows for accidental motte-and-bailey dynamics[5]
Motte: “scaffolded LLM agents will exhibit power-seeking behavior, including deception about their alignment” (which is what some might call “the AI scheming”)
Bailey: “power-motivated instrumental training gaming is likely to arise from such-and-such training processes” (which is what the actual technical term of scheming refers to)
People disagreeing with the bailey are not necessarily disagreeing about the motte.[6]
You can still be worried about the motte (indeed, that is bad as well!) without having to agree with the bailey.
Alignment-faking is basically just “deceiving humans about the AI’s alignment specifically”. Scheming demands the model is training-gaming(!) for instrumental reasons. See the very beginning of Carlsmith’s report.
Scheming as an English word is descriptive of the situation, though, and this duplicate meaning of the word probably explains much of the confusion. “Deceptive alignment” suffers from the same issue (and can also be confused for mere alignment-faking, i.e. deception about alignment).
Note also that “there is a hyperspecific prompt you can use to make the model simulate Clippy” is basically separate from scheming: if Clippy-mode doesn’t active during training, the Clippy can’t training-game, and thus this isn’t scheming-as-defined-by-Carlsmith.
There’s more to say about context-dependent vs. context-independent power-seeking malicious behavior, but I won’t discuss that here.
The motte and bailey just are very different. And an example: Reading Alex Turner’s Many Arguments for AI x-risk are wrong, he seems to think deceptive alignment is unlikely while writing “I’m worried about people turning AIs into agentic systems using scaffolding and other tricks, and then instructing the systems to complete large-scale projects.”
Clarifying a confusion around deceptive alignment / scheming
There’s a common blurrying-the-lines motion related to deceptive alignment that especially non-experts easily fall into.[1]
There is a whole spectrum of “how deceptive/schemy is the model”, that includes at least
deception—instrumental deception—alignment-faking—instrumental alignment-faking—scheming.[2]
Especially in casual conversations people tend to conflate between things like “someone builds a scaffolded LLM agent that starts to acquire power and resources, deceive humans (including about the agent’s aims), self-preserve etc.” and “scheming”. This is incorrect. While the outlined scenario can count for instrumental alignment-faking, scheming (as a technical term defined by Carlsmith) demands training gaming, and hence scaffolded LLM agents are out of the scope of the definition.[3]
The main point: when people debate the likelihood of scheming/deceptive alignment, they are NOT talking about whether scaffolded LLM agents will exhibit instrumental deception or such. They are debating whether the training process creates models that “play the training game” (for instrumental reasons).
I think the right mental picture is to think of dynamics of SGD and the training process rather than dynamics of LLM scaffolding and prompting.[4]
Corollaries:
This confusion allows for accidental motte-and-bailey dynamics[5]
Motte: “scaffolded LLM agents will exhibit power-seeking behavior, including deception about their alignment” (which is what some might call “the AI scheming”)
Bailey: “power-motivated instrumental training gaming is likely to arise from such-and-such training processes” (which is what the actual technical term of scheming refers to)
People disagreeing with the bailey are not necessarily disagreeing about the motte.[6]
You can still be worried about the motte (indeed, that is bad as well!) without having to agree with the bailey.
See also: Deceptive AI =≠= Deceptively-aligned AI, which makes very closely related points, and my comment on that post listing a bunch of anti-examples of deceptive alignment.
(Source: I’ve seen this blurrying pop up in a couple of conversations, and have earlier fallen into the mistake myself.)
Alignment-faking is basically just “deceiving humans about the AI’s alignment specifically”. Scheming demands the model is training-gaming(!) for instrumental reasons. See the very beginning of Carlsmith’s report.
Scheming as an English word is descriptive of the situation, though, and this duplicate meaning of the word probably explains much of the confusion. “Deceptive alignment” suffers from the same issue (and can also be confused for mere alignment-faking, i.e. deception about alignment).
Note also that “there is a hyperspecific prompt you can use to make the model simulate Clippy” is basically separate from scheming: if Clippy-mode doesn’t active during training, the Clippy can’t training-game, and thus this isn’t scheming-as-defined-by-Carlsmith.
There’s more to say about context-dependent vs. context-independent power-seeking malicious behavior, but I won’t discuss that here.
I’ve found such dynamics in my own thoughts at least.
The motte and bailey just are very different. And an example: Reading Alex Turner’s Many Arguments for AI x-risk are wrong, he seems to think deceptive alignment is unlikely while writing “I’m worried about people turning AIs into agentic systems using scaffolding and other tricks, and then instructing the systems to complete large-scale projects.”