Relatedly: to imagine the AI starting to succeed at those long-horizon tasks without imagining it starting to have more wants/desires (in the “behaviorist sense” expanded upon below) is, I claim, to imagine a contradiction—or at least an extreme surprise.
This seems like a great spot to make some falsifiable predictions which discriminate your particular theory from the pack. (As it stands, I don’t see a reason to buy into this particular chain of reasoning.)
AIs will increasingly be deployed and tuned for long-term tasks, so we can probably see the results relatively soon. So—do you have any predictions to share? I predict that AIs can indeed do long-context tasks (like writing books with foreshadowing) without having general, cross-situational goal-directedness.[1]
AIs can write novels with at least 50% winrate against a randomly selected novel from a typical American bookstore, as judged by blinded human raters or LLMs which have at least 70% agreement with human raters on reasonably similar tasks.
Credence: 70%; resolution date: 12/1/2025
Conditional on that, I predict with 85% confidence that it’s possible to do this with AIs which are basically as tool-like as GPT-4. I don’t know how to operationalize that in a way you’d agree to.
(I also predict that on 12/1/2025, there will be a new defense offered for MIRI-circle views, and a range of people still won’t update.)
I expect most of real-world “agency” to be elicited by the scaffolding directly prompting for it (e.g. setting up a plan/critique/execute/summarize-and-postmortem/repeat loop for the LLM), and for that agency to not come from the LLM itself.
The thing people seem to be disagreeing about is the thing you haven’t operationalized—the “and it’ll still be basically as tool-like as GPT4” bit. What does that mean and how do we measure it?
From my perspective, meaningfully operationalizing “tool-like” seems like A) almost the whole crux of the disagreement, and B) really quite difficult (i.e., requiring substantial novel scientific progress to accomplish), so it seems weird to leave as a simple to-do at the end.
Like, I think that “tool versus agent” shares the same confusion that we have about “non-life versus life”—why do some pieces of matter seem to “want” things, to optimize for them, to make decisions, to steer the world into their preferred states, and so on, while other pieces seem to “just” follow a predetermined path (algorithms, machines, chemicals, particles, etc.)? What’s the difference? How do we draw the lines? Is that even the right question? I claim we are many scientific insights away from being able to talk about these questions at the level of precision necessary to make predictions like this.
Concrete operationalizations seem great to ask for, when they’re possible to give—but I suspect that expecting/requesting them before they’re possible is more likely to muddy the discourse than clarify it.
I claim we are many scientific insights away from being able to talk about these questions at the level of precision necessary to make predictions like this.
Hm, I’m sufficiently surprised at this claim that I’m not sure that I understand what you mean. I’ll attempt a response on the assumption that I do understand; apologies if I don’t:
A common form is to be a mapping between inputs and outputs that isn’t swayed by anything outside of the context of that mapping (which I’ll term “external world states”). You can view a calculator as a coherent agent, but you can’t usefully describe the calculator as a coherent agent with a utility function regarding world states that are external to the calculator’s process.
You could use a calculator within a larger system that is describable as a maximizer over a utility function that includes unconditional terms for external world states, but that doesn’t change the nature of the calculator. Draw the box around the calculator within the system? Pretty obviously a tool. Draw the box around the whole system? Not a tool.
I’ve been using the following two requirements to point at a maximally[1] tool-like set of agents. This composes what I’ve been calling goal agnosticism:
The agent cannot be usefully described[2] as having unconditional preferences about external world states.
Any uniformly random sampling of behavior from the agent has a negligible probability of being a strong and incorrigible optimizer.
Note that this isn’t the same thing as a definition for “tool.” An idle rock uselessly obeys this definition; tools tend to useful for something. This definition is meant to capture the distinction between things that feel like tools and those that feel like “proper” agents.
To phrase it another way, the intuitive degree of “toolness” is a spectrum of how much the agent exhibits unconditional preferences about external world states through instrumental behavior.
Notably, most pretrained LLMs with the usual autoregressive predictive loss and a diverse training set are heavily constrained into fitting this definition. Anything equivalent to RL agents trained with sparse/distant rewards is not. RLHF bakes a condition into the model of peculiar shape. I wouldn’t be surprised if it doesn’t strictly obey the definition anymore, but it’s close enough along the spectrum that it still feels intuitive to call it a tool.
Further, just like in the case of the calculator, you can easily build a system around a goal agnostic “tool” LLM that is not, itself, goal agnostic. Even prompting is enough to elicit a new agent-in-effect that is not necessarily goal agnostic. The ability for a goal agnostic agent to yield non-goal agnostic agents does not break the underlying agent’s properties.[3]
This does indeed sound kind of useless, but I promise the distinction does actually end up mattering quite a lot! That discussion goes beyond the scope of this post. The FAQ goes into more depth.
I didn’t leave it as a “simple” to-do, but rather an offer to collaboratively hash something out.
That said: If people don’t even know what it would look like when they see it, how can one update on evidence? What is Nate looking at which tells him that GPT doesn’t “want things in a behavioralist sense”? (I bet he’s looking at something real to him, and I bet he could figure it out if he tried!)
I claim we are many scientific insights away from being able to talk about these questions at the level of precision necessary to make predictions like this.
To be clear, I’m not talking about formalizing the boundary. I’m talking about a bet between people, adjudicated by people.
(EDIT: I’m fine with a low sensitivity, high specificity outcome—we leave it unresolved if it’s ambiguous / not totally obvious relative to the loose criteria we settled on. Also, the criterion could include randomly polling n alignment / AI people and asking them how “behaviorally-wanting” the system seemed on a Likert scale. I don’t think you need fundamental insights for that to work.)
This seems like a great spot to make some falsifiable predictions which discriminate your particular theory from the pack. (As it stands, I don’t see a reason to buy into this particular chain of reasoning.)
AIs will increasingly be deployed and tuned for long-term tasks, so we can probably see the results relatively soon. So—do you have any predictions to share? I predict that AIs can indeed do long-context tasks (like writing books with foreshadowing) without having general, cross-situational goal-directedness.[1]
I have a more precise prediction:
Conditional on that, I predict with 85% confidence that it’s possible to do this with AIs which are basically as tool-like as GPT-4. I don’t know how to operationalize that in a way you’d agree to.
(I also predict that on 12/1/2025, there will be a new defense offered for MIRI-circle views, and a range of people still won’t update.)
I expect most of real-world “agency” to be elicited by the scaffolding directly prompting for it (e.g. setting up a plan/critique/execute/summarize-and-postmortem/repeat loop for the LLM), and for that agency to not come from the LLM itself.
The thing people seem to be disagreeing about is the thing you haven’t operationalized—the “and it’ll still be basically as tool-like as GPT4” bit. What does that mean and how do we measure it?
From my perspective, meaningfully operationalizing “tool-like” seems like A) almost the whole crux of the disagreement, and B) really quite difficult (i.e., requiring substantial novel scientific progress to accomplish), so it seems weird to leave as a simple to-do at the end.
Like, I think that “tool versus agent” shares the same confusion that we have about “non-life versus life”—why do some pieces of matter seem to “want” things, to optimize for them, to make decisions, to steer the world into their preferred states, and so on, while other pieces seem to “just” follow a predetermined path (algorithms, machines, chemicals, particles, etc.)? What’s the difference? How do we draw the lines? Is that even the right question? I claim we are many scientific insights away from being able to talk about these questions at the level of precision necessary to make predictions like this.
Concrete operationalizations seem great to ask for, when they’re possible to give—but I suspect that expecting/requesting them before they’re possible is more likely to muddy the discourse than clarify it.
Hm, I’m sufficiently surprised at this claim that I’m not sure that I understand what you mean. I’ll attempt a response on the assumption that I do understand; apologies if I don’t:
I think of tools as agents with oddly shaped utility functions. They tend to be conditional in nature.
A common form is to be a mapping between inputs and outputs that isn’t swayed by anything outside of the context of that mapping (which I’ll term “external world states”). You can view a calculator as a coherent agent, but you can’t usefully describe the calculator as a coherent agent with a utility function regarding world states that are external to the calculator’s process.
You could use a calculator within a larger system that is describable as a maximizer over a utility function that includes unconditional terms for external world states, but that doesn’t change the nature of the calculator. Draw the box around the calculator within the system? Pretty obviously a tool. Draw the box around the whole system? Not a tool.
I’ve been using the following two requirements to point at a maximally[1] tool-like set of agents. This composes what I’ve been calling goal agnosticism:
The agent cannot be usefully described[2] as having unconditional preferences about external world states.
Any uniformly random sampling of behavior from the agent has a negligible probability of being a strong and incorrigible optimizer.
Note that this isn’t the same thing as a definition for “tool.” An idle rock uselessly obeys this definition; tools tend to useful for something. This definition is meant to capture the distinction between things that feel like tools and those that feel like “proper” agents.
To phrase it another way, the intuitive degree of “toolness” is a spectrum of how much the agent exhibits unconditional preferences about external world states through instrumental behavior.
Notably, most pretrained LLMs with the usual autoregressive predictive loss and a diverse training set are heavily constrained into fitting this definition. Anything equivalent to RL agents trained with sparse/distant rewards is not. RLHF bakes a condition into the model of peculiar shape. I wouldn’t be surprised if it doesn’t strictly obey the definition anymore, but it’s close enough along the spectrum that it still feels intuitive to call it a tool.
Further, just like in the case of the calculator, you can easily build a system around a goal agnostic “tool” LLM that is not, itself, goal agnostic. Even prompting is enough to elicit a new agent-in-effect that is not necessarily goal agnostic. The ability for a goal agnostic agent to yield non-goal agnostic agents does not break the underlying agent’s properties.[3]
For one critical axis in the toolishness basis, anyway.
Tricky stuff like having a bunch of terms regarding external world states that just so happen to always cancel don’t count.
This does indeed sound kind of useless, but I promise the distinction does actually end up mattering quite a lot! That discussion goes beyond the scope of this post. The FAQ goes into more depth.
I didn’t leave it as a “simple” to-do, but rather an offer to collaboratively hash something out.
That said: If people don’t even know what it would look like when they see it, how can one update on evidence? What is Nate looking at which tells him that GPT doesn’t “want things in a behavioralist sense”? (I bet he’s looking at something real to him, and I bet he could figure it out if he tried!)
To be clear, I’m not talking about formalizing the boundary. I’m talking about a bet between people, adjudicated by people.
(EDIT: I’m fine with a low sensitivity, high specificity outcome—we leave it unresolved if it’s ambiguous / not totally obvious relative to the loose criteria we settled on. Also, the criterion could include randomly polling n alignment / AI people and asking them how “behaviorally-wanting” the system seemed on a Likert scale. I don’t think you need fundamental insights for that to work.)