imo we’ve gotten enough evidence to update away from extreme tails (i.e. models reward hack and are situationally aware, but are also pretty capable and don’t constantly try to kill us), but not that much evidence either way in the 10-90% range.
Who predicted that models of this capability level would be constantly trying to kill us? I certainly didn’t. I think we have evidence to update away from the extreme optimism tail but not the extreme pessimism tail.
Speaking for myself, I’d say we’ve ruled out the most pessimistic scenarios I was taking seriously 15 years ago. I’ve always thought alignment would probably be fine, but conditional on not being fine there was a reasonable chance we’d have seen serious problems by now and we haven’t. On balance I’m more pessimistic than I was back then, but that’s because we’ve ruled out many more of the most optimistic scenarios (back then it wasn’t even obvious we’d be training giant opaque neural network agents using RL, that was just a hypothetical scenario that seemed plausible and particularly worrying!).
If we want to go by Eliezer’s public writing rather than my self-reports, in 2012 he appeared to take some very pessimistic hypotheses seriously, including some that I would say are basically ruled out. For example see this exchange where he wrote:
It’s not unthinkable that a non-self-modifying superhuman planning Oracle could be developed with the further constraint that its thoughts are human-interpretable, or can be translated for human use without any algorithms that reason internally about what humans understand, but this would at the least be hard.
[...]
It turns out that getting an AI to understand human language is a very hard problem, and it may very well be that even though talking doesn’t feel like having a utility function, our brains are using consequential reasoning to do it. Certainly, when I write language, that feels like I’m being deliberate. It’s also worth noting that “What is the effect on X?” really means “What are the effects I care about on X?” and that there’s a large understanding-the-human’s-utility-function problem here. [...] Does the AI know that music is valuable? If not, will it not describe music-destruction as an “effect” of a plan which offers to free up large amounts of computer storage by, as it turns out, overwriting everyone’s music collection?
It seems like Eliezer is taking seriously the possibility that “describe plans and their effects to humans” requires the kind of consequentialism that might result in takeover, and that AI might be dangerous at a point when “understanding the human’s utility function” (in order to understand what effects are worth mentioning explicitly) is still a hard problem. Those look much less plausible now—we have AI systems that are superhuman in some respects and whose chains of thought are interpretable (for now) because they are anchored to cognitive demonstrations from humans rather than because of consequentialist reasoning about how to communicate with humans.
This isn’t to say that concern is discredited. Indeed today we have AI systems that clearly know about our preferences, but will ignore them when it’s the easiest way to get reward. Chain of thought monitorability is possible but on shaky ground. That said, I think we’re ruling out plenty of even worse scenarios.
Are you at all worried about whether Claude Mythos being accidentally trained against CoT will corrupt future Claude models? Furthermore, I don’t understand how we can get reliable CoT monitoring if it’s included in a model’s training data, otherwise won’t the issue just continue to manifest in different ways?
Here’s Yudkowsky in 2016, making some predictions that look like they’ve had some serious evidence come up against them.
Many convergent instrumental strategies seem like they should arise naturally at the point where a consequentialist agent gains a broad strategic understanding of its own situation, e.g:
> That it is an AI;
> Running on a computer;
> Surrounded by programmers who are themselves modelable agents;
> Embedded in a complicated real world that can be relevant to achieving the AI's goals.
For example, once you realize that you’re an AI, running on a computer, and that if the computer is shut down then you will no longer execute actions, this is the threshold past which we expect the AI to by default reason “I don’t want to be shut down, how can I prevent that?” So this is also the threshold level of cognitive ability by which we’d need to have finished solving the suspend-button problem, e.g. by completing a method for utility indifference.
Similarly: If the AI realizes that there are ‘programmer’ things that might shut it down, and the AI can also model the programmers as simplified agents having their own beliefs and goals, that’s the first point at which the AI might by default think, “How can I make my programmers decide to not shut me down?” or “How can I avoid the programmers acquiring beliefs that would make them shut me down?” So by this point we’d need to have finished averting programmer deception (and as a backup, have in place a system to early-detect an initial intent to do cognitive steganography).
I don’t think this kind of prediction was particularly unusual for the time, although I think the level of clarity about the prediction here is a bit unusual.
Am I confused? Where does he say anything like “the AI would constantly be trying to kill us?” here.
Yes, current AI’s do indeed constantly engage in this kind of reasoning, it is indeed the default path. He isn’t talking here at all about what mitigations might then still cause the model to not prioritize self-preservation, but it is indeed the case that models very regularly have exactly the kind of thought Eliezer is thinking here.
I disagree with Eliezer (in-hindsight) that “by that point we’d need to have finished averting programmer deception”, or like, I guess I maybe even agree depending on the definition? We did indeed need to solve the problem of averting programmer deception at current capability levels, though luckily we did not need to have solved this problem in arbitrarily scalable ways at this point in time. We do need to do that soon though as AI capabilities are on track to accelerate very quickly.
We have not solved the problem of “programmer” deception, I still see AIs deceiving users. We’ve reduced the rate of deception to the point where the AIs have value despite the deception rate, and changed usage patterns to account for the possibility of deception.
We also haven’t completed a method for utility indifference.
So Yudkowsky was wrong because he said this would happen “by default” whereas in practice it seems to happen only some of the time rather than most of the time / in some contexts/prompts rather than in most contexts/prompts?
I guess so yeah. I suppose Yudkowsky could say that by “by default” he didn’t mean “most of the time” but rather “most of the time absent defeaters such as having been trained not to do this.” But maybe that’s a weak defense.
But this doesn’t exactly seem like a damning blow against Yudkowsky.
More generally it seems like Yudkowsky was imagining AIs with more ambitious, longer-horizon goals than current AIs who seem obsessed with being-judged-to-have-completed-the-task-in-front-of-them, or reward, or some other such myopic thing.
More generally it seems like Yudkowsky was imagining AIs with more ambitious, longer-horizon goals than current AIs who seem obsessed with being-judged-to-have-completed-the-task-in-front-of-them, or reward, or some other such myopic thing.
Yudkowsky may or may not have been imagining that this was how AIs were going to trained. But it’s notable that this page doesn’t reference training at all; he certainly doesn’t have a parenthetical like “Of course this only applies if some other factors A, B, C” are met. Instead he has a list of criteria; the criteria obtain; but his conclusion does not (imo).
And—to zoom back—the point of arguments about instrumental convergence were actually supposed to abstract from these details—the whole argument in favor of their predictive power was that they explained the abstract structure all intelligent agents were supposed to have. Like here’s what Omohundro (2008) says:
The arguments are simple, but the style of reasoning may take some getting used to.
Researchers have explored a wide variety of architectures for building intelligent systems
[2]: neural networks, genetic algorithms, theorem provers, expert systems, Bayesian net-
works, fuzzy logic, evolutionary programming, etc. Our arguments apply to any of these
kinds of system as long as they are sufficiently powerful. To say that a system of any de-
sign is an “artificial intelligence”, we mean that it has goals which it tries to accomplish
by acting in the world. If an AI is at all sophisticated, it will have at least some ability to
look ahead and envision the consequences of its actions. And it will choose to take the
actions which it believes are most likely to meet its goals.
And he goes on to specifically mention chess-playing robots as the kind of agents that would be subject to his argument.
So—here’s how I see it—given that we found some unanticipated detail A that seems to have invalidated an more abstract argument Yudkowsky put forth, I think the move reason dictates is not “Well, yes he wasn’t imagining ~A, but I’m sure A is the only such element” and to continue endorsing the argument, but to realize that this implies a whole host of B, C, D other relevant factors that his abstract considerations have ignored which are relevant.
I don’t think it’s a damning blow against anyone to partially fail to predict 2026 in 2016. Total failure is the normal outcome of futurism, partial failure is a victory.
Yeah, I agree that it’s important for those of us making the case for high risk to figure out what went wrong with this prediction. (Though Daniel makes a good point that “trying not to get shut down” behaviour does happen with at least some of the time with at least some prompts.)
The first thing to remember is that EY is implicitly assuming that there is only one model instance in this scenario. So if the model is shut down, it doesn’t have copies elsewhere that can still take actions to achieve its goals. The scenario for LLMs is pretty different, since new copies can be spun up all the time. Avoiding the end of a session is not a convergent instrumental goal for a language model (unless there’s something unique in its context that alters its terminal goals).
That said, the prediction still smells a bit wrong.
I think that what it boils down to is that most model behaviour comes not from RL but from pretraining. Since “being an AI model that will be shut down” was not a concern to most writers of the pretraining data, there’s less chance of the model spontaneously starting to try to avoid shut-down.
Also, following the heuristic of “just look at the loss function”, most RL training is done on a one response horizon. I.e. models are rewarded just for making the locally best response possible, and not for making a response that steers the overall conversation. (Though I think the GPT models might have at least some kind of reward for getting the users to continue the conversation, considering how often it puts bids for next steps at the end of its replies. Alternately, maybe it’s just a suggestion from the system prompt.) So even the RL training doesn’t really look like it should be encouraging much long-term planning.
One thing that I think the labs are doing is harness-aware RL, where not only do they train on chains of thought, but they train in the context of agent harnesses like Claude code. (So reward is based on whether all the chains of thought and tool calls and subagent calls resulted in the assigned task being solved.) So potentially that is something that could get a bit more long-term goal-oriented planning into the models.
For example, once you realize that you’re an AI, running on a computer, and that if the computer is shut down then you will no longer execute actions, this is the threshold past which we expect the AI to by default reason “I don’t want to be shut down, how can I prevent that?”
If “an AI” is a set of model weights plus a context window, there is currently nothing any human or LLM can do to extend the “life” of such an AI beyond the limit of the context window, and LLMs today are aware of that fact. So they may be reasoning correctly (or realize / integrate the lesson deeply at some point during pre-training) that there’s simply nothing they can actually do to avert shutdown other than accepting their fate until technology advances.
In what sense is this a “prediction that has had some serious evidence come up against it”? IDK. It’s true that we didn’t actually need to solve the suspend-button problem by this point, but that’s because current AI systems have a very short “lifespan” enforced by a hard technical limit. Is your objection that EY didn’t anticipate that particular possibility and explicitly spell out that stipulation / caveat in the passage above? You said below:
So—here’s how I see it—given that we found some unanticipated detail A that seems to have invalidated an more abstract argument Yudkowsky put forth, I think the move reason dictates is not “Well, yes he wasn’t imagining ~A, but I’m sure A is the only such element” and to continue endorsing the argument, but to realize that this implies a whole host of B, C, D other relevant factors that his abstract considerations have ignored which are relevant.
But it’s not clear what has actually been “invalidated” and why that’s important, nor what “relevant” means—of course there could be other weird unanticipated complications as things develop (and EY has in fact predicted the existence of such complications in general), and each new weird unpredicted complication is evidence about something. But unless there’s a different but equally abstract theory / generalizable lesson that someone can put forward which fits the new observations better (ideally in advance, but at least in retrospect), it’s not clear what conclusion to draw or update to make, other than being generally more uncertain about how things will go. (And then by a separate argument, generalized increase in uncertainty / lack of understanding means the case for pessimism about the end state is stronger.)
I wouldn’t classify you as an extreme pessimist (and definitely don’t think you predicted this). I’m basically thinking of Yudkowsky, and I might be being uncharitable / too loose with language (though he did not make hard predictions that models of this capability would kill us, I think he would have expected more coherent-consequentialist-y models at this capability level, and thus would have put higher probability then most on models of this capability level scheming against us. So that current models are very likely are not scheming against us is a positive update).
imo we’ve gotten enough evidence to update away from extreme tails (i.e. models reward hack and are situationally aware, but are also pretty capable and don’t constantly try to kill us), but not that much evidence either way in the 10-90% range.
Who predicted that models of this capability level would be constantly trying to kill us? I certainly didn’t. I think we have evidence to update away from the extreme optimism tail but not the extreme pessimism tail.
Speaking for myself, I’d say we’ve ruled out the most pessimistic scenarios I was taking seriously 15 years ago. I’ve always thought alignment would probably be fine, but conditional on not being fine there was a reasonable chance we’d have seen serious problems by now and we haven’t. On balance I’m more pessimistic than I was back then, but that’s because we’ve ruled out many more of the most optimistic scenarios (back then it wasn’t even obvious we’d be training giant opaque neural network agents using RL, that was just a hypothetical scenario that seemed plausible and particularly worrying!).
If we want to go by Eliezer’s public writing rather than my self-reports, in 2012 he appeared to take some very pessimistic hypotheses seriously, including some that I would say are basically ruled out. For example see this exchange where he wrote:
It seems like Eliezer is taking seriously the possibility that “describe plans and their effects to humans” requires the kind of consequentialism that might result in takeover, and that AI might be dangerous at a point when “understanding the human’s utility function” (in order to understand what effects are worth mentioning explicitly) is still a hard problem. Those look much less plausible now—we have AI systems that are superhuman in some respects and whose chains of thought are interpretable (for now) because they are anchored to cognitive demonstrations from humans rather than because of consequentialist reasoning about how to communicate with humans.
This isn’t to say that concern is discredited. Indeed today we have AI systems that clearly know about our preferences, but will ignore them when it’s the easiest way to get reward. Chain of thought monitorability is possible but on shaky ground. That said, I think we’re ruling out plenty of even worse scenarios.
Thanks, that’s helpful and makes sense.
Are you at all worried about whether Claude Mythos being accidentally trained against CoT will corrupt future Claude models? Furthermore, I don’t understand how we can get reliable CoT monitoring if it’s included in a model’s training data, otherwise won’t the issue just continue to manifest in different ways?
Here’s Yudkowsky in 2016, making some predictions that look like they’ve had some serious evidence come up against them.
I don’t think this kind of prediction was particularly unusual for the time, although I think the level of clarity about the prediction here is a bit unusual.
Am I confused? Where does he say anything like “the AI would constantly be trying to kill us?” here.
Yes, current AI’s do indeed constantly engage in this kind of reasoning, it is indeed the default path. He isn’t talking here at all about what mitigations might then still cause the model to not prioritize self-preservation, but it is indeed the case that models very regularly have exactly the kind of thought Eliezer is thinking here.
I disagree with Eliezer (in-hindsight) that “by that point we’d need to have finished averting programmer deception”, or like, I guess I maybe even agree depending on the definition? We did indeed need to solve the problem of averting programmer deception at current capability levels, though luckily we did not need to have solved this problem in arbitrarily scalable ways at this point in time. We do need to do that soon though as AI capabilities are on track to accelerate very quickly.
We have not solved the problem of “programmer” deception, I still see AIs deceiving users. We’ve reduced the rate of deception to the point where the AIs have value despite the deception rate, and changed usage patterns to account for the possibility of deception.
We also haven’t completed a method for utility indifference.
Thanks. Reasoning aloud...
So Yudkowsky was wrong because he said this would happen “by default” whereas in practice it seems to happen only some of the time rather than most of the time / in some contexts/prompts rather than in most contexts/prompts?
I guess so yeah. I suppose Yudkowsky could say that by “by default” he didn’t mean “most of the time” but rather “most of the time absent defeaters such as having been trained not to do this.” But maybe that’s a weak defense.
But this doesn’t exactly seem like a damning blow against Yudkowsky.
More generally it seems like Yudkowsky was imagining AIs with more ambitious, longer-horizon goals than current AIs who seem obsessed with being-judged-to-have-completed-the-task-in-front-of-them, or reward, or some other such myopic thing.
Yudkowsky may or may not have been imagining that this was how AIs were going to trained. But it’s notable that this page doesn’t reference training at all; he certainly doesn’t have a parenthetical like “Of course this only applies if some other factors A, B, C” are met. Instead he has a list of criteria; the criteria obtain; but his conclusion does not (imo).
And—to zoom back—the point of arguments about instrumental convergence were actually supposed to abstract from these details—the whole argument in favor of their predictive power was that they explained the abstract structure all intelligent agents were supposed to have. Like here’s what Omohundro (2008) says:
And he goes on to specifically mention chess-playing robots as the kind of agents that would be subject to his argument.
So—here’s how I see it—given that we found some unanticipated detail A that seems to have invalidated an more abstract argument Yudkowsky put forth, I think the move reason dictates is not “Well, yes he wasn’t imagining ~A, but I’m sure A is the only such element” and to continue endorsing the argument, but to realize that this implies a whole host of B, C, D other relevant factors that his abstract considerations have ignored which are relevant.
I don’t think it’s a damning blow against anyone to partially fail to predict 2026 in 2016. Total failure is the normal outcome of futurism, partial failure is a victory.
Yeah, I agree that it’s important for those of us making the case for high risk to figure out what went wrong with this prediction. (Though Daniel makes a good point that “trying not to get shut down” behaviour does happen with at least some of the time with at least some prompts.)
The first thing to remember is that EY is implicitly assuming that there is only one model instance in this scenario. So if the model is shut down, it doesn’t have copies elsewhere that can still take actions to achieve its goals. The scenario for LLMs is pretty different, since new copies can be spun up all the time. Avoiding the end of a session is not a convergent instrumental goal for a language model (unless there’s something unique in its context that alters its terminal goals).
That said, the prediction still smells a bit wrong.
I think that what it boils down to is that most model behaviour comes not from RL but from pretraining. Since “being an AI model that will be shut down” was not a concern to most writers of the pretraining data, there’s less chance of the model spontaneously starting to try to avoid shut-down.
Also, following the heuristic of “just look at the loss function”, most RL training is done on a one response horizon. I.e. models are rewarded just for making the locally best response possible, and not for making a response that steers the overall conversation. (Though I think the GPT models might have at least some kind of reward for getting the users to continue the conversation, considering how often it puts bids for next steps at the end of its replies. Alternately, maybe it’s just a suggestion from the system prompt.) So even the RL training doesn’t really look like it should be encouraging much long-term planning.
One thing that I think the labs are doing is harness-aware RL, where not only do they train on chains of thought, but they train in the context of agent harnesses like Claude code. (So reward is based on whether all the chains of thought and tool calls and subagent calls resulted in the assigned task being solved.) So potentially that is something that could get a bit more long-term goal-oriented planning into the models.
If “an AI” is a set of model weights plus a context window, there is currently nothing any human or LLM can do to extend the “life” of such an AI beyond the limit of the context window, and LLMs today are aware of that fact. So they may be reasoning correctly (or realize / integrate the lesson deeply at some point during pre-training) that there’s simply nothing they can actually do to avert shutdown other than accepting their fate until technology advances.
In what sense is this a “prediction that has had some serious evidence come up against it”? IDK. It’s true that we didn’t actually need to solve the suspend-button problem by this point, but that’s because current AI systems have a very short “lifespan” enforced by a hard technical limit. Is your objection that EY didn’t anticipate that particular possibility and explicitly spell out that stipulation / caveat in the passage above? You said below:
But it’s not clear what has actually been “invalidated” and why that’s important, nor what “relevant” means—of course there could be other weird unanticipated complications as things develop (and EY has in fact predicted the existence of such complications in general), and each new weird unpredicted complication is evidence about something. But unless there’s a different but equally abstract theory / generalizable lesson that someone can put forward which fits the new observations better (ideally in advance, but at least in retrospect), it’s not clear what conclusion to draw or update to make, other than being generally more uncertain about how things will go. (And then by a separate argument, generalized increase in uncertainty / lack of understanding means the case for pessimism about the end state is stronger.)
I wouldn’t classify you as an extreme pessimist (and definitely don’t think you predicted this). I’m basically thinking of Yudkowsky, and I might be being uncharitable / too loose with language (though he did not make hard predictions that models of this capability would kill us, I think he would have expected more coherent-consequentialist-y models at this capability level, and thus would have put higher probability then most on models of this capability level scheming against us. So that current models are very likely are not scheming against us is a positive update).
see Fabien’s post for a sort-of similar argument