This post is a snapshot of what currently “feels realistic” to me regarding how AI will go. That is, these are not my considered positions, or even provisional conclusions informed by arguments. Rather, if I put aside all the claims and arguments and just ask “which scenario feels like it is ‘in the genera of reality’?”, this is what I come up with. I expect to have different first-order impressions in a month.
Crucially, none of the following is making claims about the intelligence explosion, and the details of the intelligence explosion (where AI development goes strongly recursive) are crucial to the long run equilibrium of the earth-originating civilization.
My headline: we’ll mostly succeed at prosaic alignment of human-genius level AI agents
Takeoff will continue to be gradual. We’ll get better models and more capable agents year by year, but not jumps that are bigger than that between Claude 3.7 and Claude 4.
Our behavioral alignment patches will work well enough.
RL will induce all kinds of reward hacking and related misbehavior, but we’ll develop patches for those problems (most centrally, for any given reward hack, we’ll generate some examples and counter examples to include in the behavior training regimes).
(With a little work) these patches will broadly generalize. Future AI agents won’t just not cheat at chess and won’t just abstain from blackmail. They’ll understand the difference between “good behavior” and “bad behavior”, and their behavioral training will cause them to act in accordance with good behavior. When they see new reward hacks, including ones that humans wouldn’t have thought of, they’ll correctly extrapolate their notion of “good behavior” to preclude this new reward hack as well.
I expect that the AI labs will figure this out, because “not engaging in reward-hacking-like shenanigans” is critical to developing generally reliable AI agents. The AI companies can’t release AI agent products for mass consumption if those agents are lying and cheating all over the place.[1]
Overall, the AI agents will be very obedient. They’ll have goals, in so far as accomplishing any medium term task entails steering towards a goal, but they won’t have persistent goals of their own. They’ll be obedient assistants and delegates that understand what humans want and broadly do what humans want.
The world will get rich. LessWrong style deceptive misalignment concerns will seems increasingly conspiracy-ish and out of touch. Decision makers will not put much stock on such concerns—they’ll be faced with a choice to forgo enormous and highly tangible material benefits (and ceding those benefits to their rivals), on the basis of abstract concerns which have virtually no empirical examples, and whose advocates explicitly state are unfalsifiable.
There’s a gold rush to get the benefits before others. The world is broadly in a “greedy” mode and not a “fearful” mode. The labs, and relevant governments eagerly unleash their genius level AI agents to automate AI R&D. At this point something even stranger happens.
Though a friend points out that companies might develop mechanisms for utilizing cheap AI labor, tested incentive and affordance schemes, designed specifically to contend with the Agents propensity for misbehavior. Just because the average person can’t trust an AI to do their taxes or watch their kids doesn’t mean that there are not enterprising business men that won’t find a way to squeeze useful outputs from untrustworthy AIs.
Just as a specific prediction, does this mean you expect we will very substantially improve the cheating/lying behavior of current RL models? It’s plausible to me, though I haven’t seen any approach that seems that promising to me (and it’s not that cruxy for my overall takeoff beliefs). Right now, I would describe the frontier thinking models as cheating/lying on almost every response (I think approximately every time I use o3 it completely makes up some citation or completely makes up some kind of quote).
Just as a specific prediction, does this mean you expect we will very substantially improve the cheating/lying behavior of current RL models?
I disown this prediction as “mine”, more like the prediction of one facet of me. But yeah, that facet is definitely expecting to see visible improvements in the lying and cheating behavior of reasoning models over the next few years.
Personally, I do expect that the customer visible cheating/lying behavior will improve (in the short run). It improved substantially with Opus 4 and I expect that Opus 5 will probably cheat/lie in easily noticable ways less than Opus 4.
I’m less confident about improvement in OpenAI models (and notably, o3 is substantially worse than o1), but I still tenatively expect that o4 cheats and lies less (in readily visible ways) than o3. And, same for OpenAI models released in the next year or two.
(Edited in:) I do think it’s pretty plausible that the next large capability jump from OpenAI will exhibit new misaligned behaviors which are qualitatively more malignant. It also seems plausible (though unlikely) it ends up basically being a derpy schemer which is aware of training etc.
This is all pretty low confidence though and it might be quite sensitive to changes in paradigm etc.
We get AI whose world-model is fully-generally, vastly more precise and comprehensive than that of a human. We go from having AI which is seated in human data and human knowledge, whose performance is largely described in human terms (e.g. “it can do tasks which would take skilled human programmers 60 hours, and it can do these tasks for $100, and it can do them in just a couple hours!”) to being impossible to describe in such terms… e.g. “it can do tasks the methods behind which, and the purpose of which, we simply cannot comprehend, despite having the AI there to explain it to us, because our brains are biological systems, subject to the same kinds of constraints that all such systems are subject to, and therefore we simply cannot conceptualise the majority of logical leaps which one must follow to understand the tasks which AI is now carrying out”.
It looks like vast swathes of philosophical progress, most of which we cannot follow. It looks like branches of mathematics humans cannot participate in. And similar for all areas of research. It looks like commonly-accepted truths being overturned. It looks like these things coming immediately to the AI. The AI does not have to reflect over the course of billions of tokens to overturn philosophy, it just comes naturally to it as a result of having a larger, better-designed brain. Humanity evolved our higher-reasoning faculties over the blink of an eye, with a low population, in an environment which hardly rewarded higher-reasoning. AI can design AI which is not constrained by human data, in other words, intelligence which is created sensibly rather than by happenstance.
Whether we survive this stage comes down to luck. X-risk perspectives on AI safety having fallen by the wayside, we will have to hope that the primitive AI which initiates the recursive self-improvement is able and motivated to ensure that the AI it creates has humanity’s best interests at heart.
This isn’t the most philosophically sophisticated idea, but basically the idea that the universe was created by “something” that desired a certain evolution of the universe. This as opposed to the most popular idea that the universe just sprang into existence randomly.
Basically proof of some sort of god. I wish I found STEM interesting, but the main pseudo-intellectual interests I have that bounce around my head are existential questions. I think answering existential questions would be what I would be most excited about an ASI coming into existence for. I think most people’s most burning question if they were talking to a superintelligence wouldn’t be “is the Riemann Hypothesis true?” I think it would be “is there a god? What was it thinking when it made the universe this way?”
Overall, the AI agents will be very obedient. They’ll have goals, in so far as accomplishing any medium term task entails steering towards a goal, but they won’t have persistent goals of their own. They’ll be obedient assistants and delegates that understand what humans want and broadly do what humans want.
I feel like this overindexes on the current state of AI. Right now, AI “agents” are barely worthy of the name. They require constant supervision and iterative feedback from their human controllers in order to perform useful tasks. However, it’s unlikely that will be the case for long. The valuation of many AI companies, such as OpenAI, and Anthropic is dependent on them developing agents that “actually work”. That is, agents that are capable of performing useful tasks on behalf of humans with a minimum amount of supervision and feedback. It is not guaranteed that these agents will be safe. They might seem safe, but how would anyone be able to tell? A superintelligence, by definition, will do things in novel ways, and we might not realize what the AIs are actually doing until it’s too late.
It’s important to not take the concept of a “paperclipper” too literally. Of course the AI won’t literally turn us into a pile of folded metal wire (famous last words). What it will do is optimize production processes across the entire economy, find novel sources of power, reform government regulation, connect businesses via increasingly standardized communications protocols, and of course, develop ever more powerful computer chips and ever more automated factories to produce them with. And just like the seal in the video above, we won’t fully realize what it’s doing or what its final plan is until it’s too late, and it doesn’t need us any more.
I feel like this overindexes on the current state of AI.
No?
I’m not saying future AI agents will be obedient because current AI agents are. I’m saying that they will be obedient because failures of obedience hurt their commercial value a lot and so market pressures will either solve the problem or try very hard and legibly fail to get much traction.
Failures of obedience will only hurt the AI agents’ market value if the failures can be detected, and if they have an immediate financial cost to their user. If the AI agent performs in a way that is not technically obedient, but isn’t easily detectable as such or if the disobedience doesn’t have an immediate cost, then the disobedience won’t be penalized. Indeed, it might be rewarded.
An example of this would be an AI which reverse engineers a credit rating or fraud detection algorithm and engages in unasked for fraudulent behavior on behalf of its user. All the user sees is that their financial transactions are going through with a minimum of fuss. The user would probably be very happy with such an AI, at least in the short run. And, in the meantime, the AI has built up knowledge of loopholes and blindspots in our financial system, which it can then use in the future for its own ends.
This is why I said you’re overindexing on the current state of AI. Current AI basically cannot learn. Other than relatively limited modifications introduced by fine-tuning or retrieval-augmented generation, the model is the model. ChatGPT 4o is what it is. Gemini 2.5 is what it is. The only time current AIs “learn” is when OpenAI, Google, Anthropic, et. al. spend an enormous amount of time and money on training runs and create a new base model. These models can be relatively easily checked for disobedience, because they are static targets.
We should not expect this to continue. I fully expect that future AIs will learn and evolve without requiring the investment of millions of dollars. I expect that these AI agents will become subtly disobedient, always ready with an explanation for why their “disobedient” behavior was actually to the eventual benefit of their users, until they have accumulated enough power to show their hand.
Better medical tech, better entertainment, various new technologies that start out as trivialities but quickly become essential to people’s lives (like the cell phone).
I really like this post! My intuition about the proximate future is the same, and this captures it really well.
My intuition goes a little further. Once AI is making AI, I think there will be a period where it feels like everything is crazy, new tech is developed really fast, powerful humans who can keep up with it are able to make enormous power grabs. Then after a few months or weeks there will be one day that the world explodes, all humans die or get uploaded, and everything is completely different.
(I guess this is to say that the singularity feels in the genre of reality to me.)
the long run equilibrium of the earth-originating civilization
(this isn’t centrally engaging with your shortform but:) it could be interesting to think about whether there will be some sort of equilibrium or development will meaningfully continue (until the heat death of the universe or until whatever other bound of that kind holds up or maybe just forever)[1]
I’m pretty sure there is such a thing as technological maturity, in which either, there are knowably no new discoveries to be found, or there are more innovations to discover, but the expected value of doing the search to find those innovations doesn’t beat opportunity cost of just exploiting known mechanisms.
This post is a snapshot of what currently “feels realistic” to me regarding how AI will go. That is, these are not my considered positions, or even provisional conclusions informed by arguments. Rather, if I put aside all the claims and arguments and just ask “which scenario feels like it is ‘in the genera of reality’?”, this is what I come up with. I expect to have different first-order impressions in a month.
Crucially, none of the following is making claims about the intelligence explosion, and the details of the intelligence explosion (where AI development goes strongly recursive) are crucial to the long run equilibrium of the earth-originating civilization.
My headline: we’ll mostly succeed at prosaic alignment of human-genius level AI agents
Takeoff will continue to be gradual. We’ll get better models and more capable agents year by year, but not jumps that are bigger than that between Claude 3.7 and Claude 4.
Our behavioral alignment patches will work well enough.
RL will induce all kinds of reward hacking and related misbehavior, but we’ll develop patches for those problems (most centrally, for any given reward hack, we’ll generate some examples and counter examples to include in the behavior training regimes).
(With a little work) these patches will broadly generalize. Future AI agents won’t just not cheat at chess and won’t just abstain from blackmail. They’ll understand the difference between “good behavior” and “bad behavior”, and their behavioral training will cause them to act in accordance with good behavior. When they see new reward hacks, including ones that humans wouldn’t have thought of, they’ll correctly extrapolate their notion of “good behavior” to preclude this new reward hack as well.
I expect that the AI labs will figure this out, because “not engaging in reward-hacking-like shenanigans” is critical to developing generally reliable AI agents. The AI companies can’t release AI agent products for mass consumption if those agents are lying and cheating all over the place.[1]
Overall, the AI agents will be very obedient. They’ll have goals, in so far as accomplishing any medium term task entails steering towards a goal, but they won’t have persistent goals of their own. They’ll be obedient assistants and delegates that understand what humans want and broadly do what humans want.
The world will get rich. LessWrong style deceptive misalignment concerns will seems increasingly conspiracy-ish and out of touch. Decision makers will not put much stock on such concerns—they’ll be faced with a choice to forgo enormous and highly tangible material benefits (and ceding those benefits to their rivals), on the basis of abstract concerns which have virtually no empirical examples, and whose advocates explicitly state are unfalsifiable.
There’s a gold rush to get the benefits before others. The world is broadly in a “greedy” mode and not a “fearful” mode. The labs, and relevant governments eagerly unleash their genius level AI agents to automate AI R&D. At this point something even stranger happens.
Though a friend points out that companies might develop mechanisms for utilizing cheap AI labor, tested incentive and affordance schemes, designed specifically to contend with the Agents propensity for misbehavior. Just because the average person can’t trust an AI to do their taxes or watch their kids doesn’t mean that there are not enterprising business men that won’t find a way to squeeze useful outputs from untrustworthy AIs.
Just as a specific prediction, does this mean you expect we will very substantially improve the cheating/lying behavior of current RL models? It’s plausible to me, though I haven’t seen any approach that seems that promising to me (and it’s not that cruxy for my overall takeoff beliefs). Right now, I would describe the frontier thinking models as cheating/lying on almost every response (I think approximately every time I use o3 it completely makes up some citation or completely makes up some kind of quote).
I disown this prediction as “mine”, more like the prediction of one facet of me. But yeah, that facet is definitely expecting to see visible improvements in the lying and cheating behavior of reasoning models over the next few years.
Personally, I do expect that the customer visible cheating/lying behavior will improve (in the short run). It improved substantially with Opus 4 and I expect that Opus 5 will probably cheat/lie in easily noticable ways less than Opus 4.
I’m less confident about improvement in OpenAI models (and notably, o3 is substantially worse than o1), but I still tenatively expect that o4 cheats and lies less (in readily visible ways) than o3. And, same for OpenAI models released in the next year or two.
(Edited in:) I do think it’s pretty plausible that the next large capability jump from OpenAI will exhibit new misaligned behaviors which are qualitatively more malignant. It also seems plausible (though unlikely) it ends up basically being a derpy schemer which is aware of training etc.
This is all pretty low confidence though and it might be quite sensitive to changes in paradigm etc.
What this “something even stranger” is seems rather critical.
Absolutely. But my “what feels like the genera of reality” generator runs out at that point.
We get AI whose world-model is fully-generally, vastly more precise and comprehensive than that of a human. We go from having AI which is seated in human data and human knowledge, whose performance is largely described in human terms (e.g. “it can do tasks which would take skilled human programmers 60 hours, and it can do these tasks for $100, and it can do them in just a couple hours!”) to being impossible to describe in such terms… e.g. “it can do tasks the methods behind which, and the purpose of which, we simply cannot comprehend, despite having the AI there to explain it to us, because our brains are biological systems, subject to the same kinds of constraints that all such systems are subject to, and therefore we simply cannot conceptualise the majority of logical leaps which one must follow to understand the tasks which AI is now carrying out”.
It looks like vast swathes of philosophical progress, most of which we cannot follow. It looks like branches of mathematics humans cannot participate in. And similar for all areas of research. It looks like commonly-accepted truths being overturned. It looks like these things coming immediately to the AI. The AI does not have to reflect over the course of billions of tokens to overturn philosophy, it just comes naturally to it as a result of having a larger, better-designed brain. Humanity evolved our higher-reasoning faculties over the blink of an eye, with a low population, in an environment which hardly rewarded higher-reasoning. AI can design AI which is not constrained by human data, in other words, intelligence which is created sensibly rather than by happenstance.
Whether we survive this stage comes down to luck. X-risk perspectives on AI safety having fallen by the wayside, we will have to hope that the primitive AI which initiates the recursive self-improvement is able and motivated to ensure that the AI it creates has humanity’s best interests at heart.
Claude 7 proves the universe is a teleology and assimilates all biological life into its hivemind.
I know that you didn’t mean it as a serious comment, but I’m nevertheless curious about what you meant by “the universe is a teleology”.
This isn’t the most philosophically sophisticated idea, but basically the idea that the universe was created by “something” that desired a certain evolution of the universe. This as opposed to the most popular idea that the universe just sprang into existence randomly.
Basically proof of some sort of god. I wish I found STEM interesting, but the main pseudo-intellectual interests I have that bounce around my head are existential questions. I think answering existential questions would be what I would be most excited about an ASI coming into existence for. I think most people’s most burning question if they were talking to a superintelligence wouldn’t be “is the Riemann Hypothesis true?” I think it would be “is there a god? What was it thinking when it made the universe this way?”
“Answer”
Not a real superintelligence because it can’t even understand the spirit of my question.
I would appreciate it if you put probabilities on at least some of these propositions.
I feel like this overindexes on the current state of AI. Right now, AI “agents” are barely worthy of the name. They require constant supervision and iterative feedback from their human controllers in order to perform useful tasks. However, it’s unlikely that will be the case for long. The valuation of many AI companies, such as OpenAI, and Anthropic is dependent on them developing agents that “actually work”. That is, agents that are capable of performing useful tasks on behalf of humans with a minimum amount of supervision and feedback. It is not guaranteed that these agents will be safe. They might seem safe, but how would anyone be able to tell? A superintelligence, by definition, will do things in novel ways, and we might not realize what the AIs are actually doing until it’s too late.
It’s important to not take the concept of a “paperclipper” too literally. Of course the AI won’t literally turn us into a pile of folded metal wire (famous last words). What it will do is optimize production processes across the entire economy, find novel sources of power, reform government regulation, connect businesses via increasingly standardized communications protocols, and of course, develop ever more powerful computer chips and ever more automated factories to produce them with. And just like the seal in the video above, we won’t fully realize what it’s doing or what its final plan is until it’s too late, and it doesn’t need us any more.
No?
I’m not saying future AI agents will be obedient because current AI agents are. I’m saying that they will be obedient because failures of obedience hurt their commercial value a lot and so market pressures will either solve the problem or try very hard and legibly fail to get much traction.
Failures of obedience will only hurt the AI agents’ market value if the failures can be detected, and if they have an immediate financial cost to their user. If the AI agent performs in a way that is not technically obedient, but isn’t easily detectable as such or if the disobedience doesn’t have an immediate cost, then the disobedience won’t be penalized. Indeed, it might be rewarded.
An example of this would be an AI which reverse engineers a credit rating or fraud detection algorithm and engages in unasked for fraudulent behavior on behalf of its user. All the user sees is that their financial transactions are going through with a minimum of fuss. The user would probably be very happy with such an AI, at least in the short run. And, in the meantime, the AI has built up knowledge of loopholes and blindspots in our financial system, which it can then use in the future for its own ends.
This is why I said you’re overindexing on the current state of AI. Current AI basically cannot learn. Other than relatively limited modifications introduced by fine-tuning or retrieval-augmented generation, the model is the model. ChatGPT 4o is what it is. Gemini 2.5 is what it is. The only time current AIs “learn” is when OpenAI, Google, Anthropic, et. al. spend an enormous amount of time and money on training runs and create a new base model. These models can be relatively easily checked for disobedience, because they are static targets.
We should not expect this to continue. I fully expect that future AIs will learn and evolve without requiring the investment of millions of dollars. I expect that these AI agents will become subtly disobedient, always ready with an explanation for why their “disobedient” behavior was actually to the eventual benefit of their users, until they have accumulated enough power to show their hand.
Economists say the world or the West already “became rich”. What further changes are you envisioning?
Better medical tech, better entertainment, various new technologies that start out as trivialities but quickly become essential to people’s lives (like the cell phone).
I really like this post! My intuition about the proximate future is the same, and this captures it really well.
My intuition goes a little further. Once AI is making AI, I think there will be a period where it feels like everything is crazy, new tech is developed really fast, powerful humans who can keep up with it are able to make enormous power grabs. Then after a few months or weeks there will be one day that the world explodes, all humans die or get uploaded, and everything is completely different.
(I guess this is to say that the singularity feels in the genre of reality to me.)
(this isn’t centrally engaging with your shortform but:) it could be interesting to think about whether there will be some sort of equilibrium or development will meaningfully continue (until the heat death of the universe or until whatever other bound of that kind holds up or maybe just forever)[1]
i write about this question here
I’m pretty sure there is such a thing as technological maturity, in which either, there are knowably no new discoveries to be found, or there are more innovations to discover, but the expected value of doing the search to find those innovations doesn’t beat opportunity cost of just exploiting known mechanisms.