whose models did not predict that AIs which were unable to execute a takeover would display any obvious desire or tendency to attempt it.
Citation for this claim? Can you quote the specific passage which supports it?
If you read this post, starting at “The central interesting-to-me idea in capability amplification is that by exactly imitating humans, we can bypass the usual dooms of reinforcement learning.”, and read the following 20 or so paragraphs, you’ll get some idea of 2018!Eliezer’s models about imitation agents.
I’ll highlight
If I were going to talk about trying to do aligned AGI under the standard ML paradigms, I’d talk about how this creates a differential ease of development between “build a system that does X” and “build a system that does X and only X and not Y in some subtle way”. If you just want X however unsafely, you can build the X-classifier and use that as a loss function and let reinforcement learning loose with whatever equivalent of gradient descent or other generic optimization method the future uses. If the safety property you want is optimized-for-X-and-just-X-and-not-any-possible-number-of-hidden-Ys, then you can’t write a simple loss function for that the way you can for X.
[...]
On the other other other hand, suppose the inexactness of the imitation is “This agent passes the Turing Test; a human can’t tell it apart from a human.” Then X-and-only-X is thrown completely out the window. We have no guarantee of non-Y for any Y a human can’t detect, which covers an enormous amount of lethal territory, which is why we can’t just sanitize the outputs of an untrusted superintelligence by having a human inspect the outputs to see if they have any humanly obvious bad consequences.
I think with a fair reading of that post, it’s clear that Eliezer’s models at the time didn’t say that there would necessarily be overtly bad intentions that humans could easily detect from subhuman AI. You do have to read between the lines a little, because that exact statement isn’t made, but if you try to reconstruct how he was thinking about this stuff at the time, then see what that model does and doesn’t expect, then this answers your question.
So what’s the way in which agency starts to become the default as the model grows more powerful? (According to either you, or your model of Eliezer. I’m more interested in the “agency by default” question itself than I am in scoring EY’s predictions, tbh.)
It just doesn’t actually start to be the default (see this post, for example, as well as all the discourse around this post and this comment).
But that doesn’t necessarily solve our problems. Base models may be Tools or Oracles in nature,[1] but there is still a ton of economic incentive to turn them into scaffolded agents. Kaj Sotala wrote about this a decade and a half ago, when this question was also a hot debate topic:
Even if we could build a safe Tool AI, somebody would soon build an agent AI anyway. [...] Like with external constraints, Oracle AI suffers from the problem that there would always be an incentive to create an AGI that could act on its own, without humans in the loop. Such an AGI would be far more effective in furthering whatever goals it had been built to pursue, but also far more dangerous.
The usefulness of base models, IMO, comes either from agentic scaffolding simply not working very efficiently (which I believe is likely) or from helping alignment efforts (either in terms of evals and demonstrating as a Fire Alarm the model’s ability to be used dangerously even if its desire to cause danger is lacking, or in terms of AI-assisted alignment, or in other ways).
Which is very useful and arguably even close to the best-case scenario for how prosaic ML-scale-up development of AI could have gone, compared to alternatives
Base models may be Tools or Oracles in nature,[1] but there is still a ton of economic incentive to turn them into scaffolded agents.
I would even go further, and say that there’s a ton of incentives to move out of the paradigm of primarily LLMs altogether.
A big part of the reason is that the current valuations only make sense if OpenAI et al are just correct that they can replace workers with AI within 5 years.
But currently, there are a couple of very important obstacles to this goal, and the big ones are data efficiency, long-term memory and continual learning.
For data efficiency, one of the things that’s telling is that even in domains where LLMs excel, they require orders of magnitude more data than humans to get good at a task, and one of the reasons why LLMs became as successful as they were in the first place is unfortunately not something we can replicate, which was that the internet was a truly, truly vast amount of data on a whole lot of topics, and while I don’t think the views that LLMs don’t understand anything/simply memorize training data are correct, I do think a non-trivial amount of the reason LLMs became so good is that we did simply widen the distribution through giving LLMs all of the data on the internet.
Synthetic data empirically so far is mostly not working to expand the store of data, and thus by 2028 I expect labs to need to pivot to a more data efficient architecture, and arguably right now for tasks like computer use they will need advances in data efficiency before AIs can get good at computer use.
For long-term memory, one of the issues with current AI is that their only memory so far is the context window, but that doesn’t have to scale, and also means that if it isn’t saved in the context, which most stuff will be, then it’s basically gone, and LLMs cannot figure out how to build upon one success or failure to set itself up for more successes, because it doesn’t remember that success or failure.
For continual learning, I basically agree with Dwarkesh Patel here on why continual learning is so important:
Sometimes people say that even if all AI progress totally stopped, the systems of today would still be far more economically transformative than the internet. I disagree. I think the LLMs of today are magical. But the reason that the Fortune 500 aren’t using them to transform their workflows isn’t because the management is too stodgy. Rather, I think it’s genuinely hard to get normal humanlike labor out of LLMs. And this has to do with some fundamental capabilities these models lack.
I like to think I’m “AI forward” here at the Dwarkesh Podcast. I’ve probably spent over a hundred hours trying to build little LLM tools for my post production setup. And the experience of trying to get them to be useful has extended my timelines. I’ll try to get the LLMs to rewrite autogenerated transcripts for readability the way a human would. Or I’ll try to get them to identify clips from the transcript to tweet out. Sometimes I’ll try to get them to co-write an essay with me, passage by passage. These are simple, self contained, short horizon, language in-language out tasks—the kinds of assignments that should be dead center in the LLMs’ repertoire. And they’re 5⁄10 at them. Don’t get me wrong, that’s impressive.
But the fundamental problem is that LLMs don’t get better over time the way a human would. The lack of continual learning is a huge huge problem. The LLM baseline at many tasks might be higher than an average human’s. But there’s no way to give a model high level feedback. You’re stuck with the abilities you get out of the box. You can keep messing around with the system prompt. In practice this just doesn’t produce anything even close to the kind of learning and improvement that human employees experience.
The reason humans are so useful is not mainly their raw intelligence. It’s their ability to build up context, interrogate their own failures, and pick up small improvements and efficiencies as they practice a task.
How do you teach a kid to play a saxophone? You have her try to blow into one, listen to how it sounds, and adjust. Now imagine teaching saxophone this way instead: A student takes one attempt. The moment they make a mistake, you send them away and write detailed instructions about what went wrong. The next student reads your notes and tries to play Charlie Parker cold. When they fail, you refine the instructions for the next student.
This just wouldn’t work. No matter how well honed your prompt is, no kid is just going to learn how to play saxophone from just reading your instructions. But this is the only modality we as users have to ‘teach’ LLMs anything.
Yes, there’s RL fine tuning. But it’s just not a deliberate, adaptive process the way human learning is. My editors have gotten extremely good. And they wouldn’t have gotten that way if we had to build bespoke RL environments for different subtasks involved in their work. They’ve just noticed a lot of small things themselves and thought hard about what resonates with the audience, what kind of content excites me, and how they can improve their day to day workflows.
Now, it’s possible to imagine some way in which a smarter model could build a dedicated RL loop for itself which just feels super organic from the outside. I give some high level feedback, and the model comes up with a bunch of verifiable practice problems to RL on—maybe even a whole environment in which to rehearse the skills it thinks it’s lacking. But this just sounds really hard. And I don’t know how well these techniques will generalize to different kinds of tasks and feedback. Eventually the models will be able to learn on the job in the subtle organic way that humans can. However, it’s just hard for me to see how that could happen within the next few years, given that there’s no obvious way to slot in online, continuous learning into the kinds of models these LLMs are.
LLMs actually do get kinda smart and useful in the middle of a session. For example, sometimes I’ll co-write an essay with an LLM. I’ll give it an outline, and I’ll ask it to draft the essay passage by passage. All its suggestions up till 4 paragraphs in will be bad. So I’ll just rewrite the whole paragraph from scratch and tell it, “Hey, your shit sucked. This is what I wrote instead.” At that point, it can actually start giving good suggestions for the next paragraph. But this whole subtle understanding of my preferences and style is lost by the end of the session.
Maybe the easy solution to this looks like a long rolling context window, like Claude Code has, which compacts the session memory into a summary every 30 minutes. I just think that titrating all this rich tacit experience into a text summary will be brittle in domains outside of software engineering (which is very text-based). Again, think about the example of trying to teach someone how to play the saxophone using a long text summary of your learnings. Even Claude Code will often reverse a hard-earned optimization that we engineered together before I hit /compact—because the explanation for why it was made didn’t make it into the summary.
there is still a ton of economic incentive to turn them into scaffolded agents
That’s equally an incentive to.turn them into aligned agents, agents that work for you.
People want power, but not at the expense of control.
Power that you can’t control is no good to you. Taking the brakes off a car makes it more powerful, but more likely to kill you. No army wants a weapon that will kill their own soldiers, no financial organisation wants a trading system that makes money for someone else, or gives it away to charity, or crashes.
The maxiumm of power and the minimum of control is an explosion. One needs to look askance at what “agent” means as well. Among other things, it means an entity that acts on behalf of a human—as in principal/agent. An agent is no good to its principal unless it has a good enough idea of its principal’s goals. So while people will want agents, they wont want misaligned ones—misalgined with themselves, that is.
If you read this post, starting at “The central interesting-to-me idea in capability amplification is that by exactly imitating humans, we can bypass the usual dooms of reinforcement learning.”, and read the following 20 or so paragraphs, you’ll get some idea of 2018!Eliezer’s models about imitation agents.
I’ll highlight
I think with a fair reading of that post, it’s clear that Eliezer’s models at the time didn’t say that there would necessarily be overtly bad intentions that humans could easily detect from subhuman AI. You do have to read between the lines a little, because that exact statement isn’t made, but if you try to reconstruct how he was thinking about this stuff at the time, then see what that model does and doesn’t expect, then this answers your question.
So what’s the way in which agency starts to become the default as the model grows more powerful? (According to either you, or your model of Eliezer. I’m more interested in the “agency by default” question itself than I am in scoring EY’s predictions, tbh.)
I don’t really know what you’re referring to, maybe link a post or a quote?
See last paragraph here: https://www.lesswrong.com/posts/3EzbtNLdcnZe8og8b/the-void-1?commentId=Du8zRPnQGdLLLkRxP
It just doesn’t actually start to be the default (see this post, for example, as well as all the discourse around this post and this comment).
But that doesn’t necessarily solve our problems. Base models may be Tools or Oracles in nature,[1] but there is still a ton of economic incentive to turn them into scaffolded agents. Kaj Sotala wrote about this a decade and a half ago, when this question was also a hot debate topic:
The usefulness of base models, IMO, comes either from agentic scaffolding simply not working very efficiently (which I believe is likely) or from helping alignment efforts (either in terms of evals and demonstrating as a Fire Alarm the model’s ability to be used dangerously even if its desire to cause danger is lacking, or in terms of AI-assisted alignment, or in other ways).
Which is very useful and arguably even close to the best-case scenario for how prosaic ML-scale-up development of AI could have gone, compared to alternatives
I would even go further, and say that there’s a ton of incentives to move out of the paradigm of primarily LLMs altogether.
A big part of the reason is that the current valuations only make sense if OpenAI et al are just correct that they can replace workers with AI within 5 years.
But currently, there are a couple of very important obstacles to this goal, and the big ones are data efficiency, long-term memory and continual learning.
For data efficiency, one of the things that’s telling is that even in domains where LLMs excel, they require orders of magnitude more data than humans to get good at a task, and one of the reasons why LLMs became as successful as they were in the first place is unfortunately not something we can replicate, which was that the internet was a truly, truly vast amount of data on a whole lot of topics, and while I don’t think the views that LLMs don’t understand anything/simply memorize training data are correct, I do think a non-trivial amount of the reason LLMs became so good is that we did simply widen the distribution through giving LLMs all of the data on the internet.
Synthetic data empirically so far is mostly not working to expand the store of data, and thus by 2028 I expect labs to need to pivot to a more data efficient architecture, and arguably right now for tasks like computer use they will need advances in data efficiency before AIs can get good at computer use.
For long-term memory, one of the issues with current AI is that their only memory so far is the context window, but that doesn’t have to scale, and also means that if it isn’t saved in the context, which most stuff will be, then it’s basically gone, and LLMs cannot figure out how to build upon one success or failure to set itself up for more successes, because it doesn’t remember that success or failure.
For continual learning, I basically agree with Dwarkesh Patel here on why continual learning is so important:
https://www.dwarkesh.com/p/timelines-june-2025
That’s equally an incentive to.turn them into aligned agents, agents that work for you.
People want power, but not at the expense of control.
Power that you can’t control is no good to you. Taking the brakes off a car makes it more powerful, but more likely to kill you. No army wants a weapon that will kill their own soldiers, no financial organisation wants a trading system that makes money for someone else, or gives it away to charity, or crashes.
The maxiumm of power and the minimum of control is an explosion. One needs to look askance at what “agent” means as well. Among other things, it means an entity that acts on behalf of a human—as in principal/agent. An agent is no good to its principal unless it has a good enough idea of its principal’s goals. So while people will want agents, they wont want misaligned ones—misalgined with themselves, that is.