(Modulo, e.g., the fact that it can play chess pretty well, which indicates a certain type of want-like behavior in the behaviorist sense. An AI’s ability to win no matter how you move is the same as its ability to reliably steer the game-board into states where you’re check-mated, as though it had an internal check-mating “goal” it were trying to achieve. This is again a quantitative gap that’s being eroded.)
I agree with the main point of the post. But I specifically disagree with what I see as an implied assumption of this remark about a “quantitative gap”. I think there is a likely assumption that the quantitative gap is such that the ability to play chess better would correlate with being higher in the most relevant quantity.
Something that chooses good chess moves can be seen as “wanting” its side to do well within the chess game context. But that does not imply anything at all outside of that context. If it’s going to be turned off if it doesn’t do a particular next move, it doesn’t have to take that into account. It can just play the best chess move regardless, and ignore the out-of-context info about being shut down.
LLMs aren’t trained directly to achieve results in a real-world context. They’re trained:
to emit outputs that look like the output of entities that achieve results (humans)
to emit outputs that humans think are useful, probably typically with the humans not thinking all that deeply about it
To be sure, at least item 1 above would eventually result in selecting outputs to achieve results if taken to the limit of infinite computing power, etc., and in the same limit item 2 would result in humans being mind-controlled.
But both these items naturally better reward the LLM for appearing agentic than for actually being agentic (being agentic = actually choosing outputs based on their effect on the future of the real world). The reward for actually being agentic, up to the point that it is agentic enough to subvert the training regime, is entirely downstream of the reward for appearance of agency.
Thus, I tend to expect the appearance of agency in LLMs to be Goodharted and discount apparent evidence accordingly.
Other people look at the same evidence and think it might, by contrast, be even more agentic than the apparent evidence due to strategic deception. And to be sure, at some agency level you might get consistent strategic deception to lower the apparent agency level.
But I think more like: at the agency level I’ve already discounted it down to it really doesn’t look likely it would engage in strategic deception to consistently lower its apparent agency level. Yes I’m aware of, e.g., the recent paper that LLMs engage in strategic deception. But they are doing what looks like strategic deception when presented a pretend, text-based scenario. This is fully compatible with them following story-logic like they learned from training. Just like a chess AI doesn’t have to care about anything outside the chess context, the LLM doesn’t have to care about anything outside the story-logic context.
To be sure, story-logic by itself could still be dangerous. Any real-world effect could be obtained by story-logic within a story with intricate enough connections to the real world, and in some circumstances it wouldn’t have to be that intricate.
And in this sense - the sense that some contexts are bigger and tend to map onto real-world dangerous behaviour better than others—the gap can indeed be quantitative. It’s just that it’s another dimension of variation in agency than the ability to select best actions in a particular context.
I’m not convinced that LLMs are currently selecting actions to affect the future within a context larger than this story-level context—a large enough domain to have some risk (in particular, I’m concerned with the ability to write code to help make a new/modified AI targeting a larger context) - but one that I think is still likely well short of causing it to choose actions to take over the world (and indeed, well short of being particularly good at solving long-term tasks in general) without it making that new or modified AI first.
I agree with the main point of the post. But I specifically disagree with what I see as an implied assumption of this remark about a “quantitative gap”. I think there is a likely assumption that the quantitative gap is such that the ability to play chess better would correlate with being higher in the most relevant quantity.
Something that chooses good chess moves can be seen as “wanting” its side to do well within the chess game context. But that does not imply anything at all outside of that context. If it’s going to be turned off if it doesn’t do a particular next move, it doesn’t have to take that into account. It can just play the best chess move regardless, and ignore the out-of-context info about being shut down.
LLMs aren’t trained directly to achieve results in a real-world context. They’re trained:
to emit outputs that look like the output of entities that achieve results (humans)
to emit outputs that humans think are useful, probably typically with the humans not thinking all that deeply about it
To be sure, at least item 1 above would eventually result in selecting outputs to achieve results if taken to the limit of infinite computing power, etc., and in the same limit item 2 would result in humans being mind-controlled.
But both these items naturally better reward the LLM for appearing agentic than for actually being agentic (being agentic = actually choosing outputs based on their effect on the future of the real world). The reward for actually being agentic, up to the point that it is agentic enough to subvert the training regime, is entirely downstream of the reward for appearance of agency.
Thus, I tend to expect the appearance of agency in LLMs to be Goodharted and discount apparent evidence accordingly.
Other people look at the same evidence and think it might, by contrast, be even more agentic than the apparent evidence due to strategic deception. And to be sure, at some agency level you might get consistent strategic deception to lower the apparent agency level.
But I think more like: at the agency level I’ve already discounted it down to it really doesn’t look likely it would engage in strategic deception to consistently lower its apparent agency level. Yes I’m aware of, e.g., the recent paper that LLMs engage in strategic deception. But they are doing what looks like strategic deception when presented a pretend, text-based scenario. This is fully compatible with them following story-logic like they learned from training. Just like a chess AI doesn’t have to care about anything outside the chess context, the LLM doesn’t have to care about anything outside the story-logic context.
To be sure, story-logic by itself could still be dangerous. Any real-world effect could be obtained by story-logic within a story with intricate enough connections to the real world, and in some circumstances it wouldn’t have to be that intricate.
And in this sense - the sense that some contexts are bigger and tend to map onto real-world dangerous behaviour better than others—the gap can indeed be quantitative. It’s just that it’s another dimension of variation in agency than the ability to select best actions in a particular context.
I’m not convinced that LLMs are currently selecting actions to affect the future within a context larger than this story-level context—a large enough domain to have some risk (in particular, I’m concerned with the ability to write code to help make a new/modified AI targeting a larger context) - but one that I think is still likely well short of causing it to choose actions to take over the world (and indeed, well short of being particularly good at solving long-term tasks in general) without it making that new or modified AI first.