whose models did not predict that AIs which were unable to execute a takeover would display any obvious desire or tendency to attempt it.
Citation for this claim? Can you quote the specific passage which supports it? It reminds me of Phil Tetlock’s point about the importance of getting forecasters to forecast probabilities for very specific events, because otherwise they will always find a creative way to evaluate themselves so that their forecast looks pretty good.
(For example, can you see how Andrew Ng could claim that his “AI will be like electricity” prediction has been pretty darn accurate? I never heard Yudkowsky say “yep, that will happen”.)
I spent a lot of time reading LW back in the day, and I don’t think Yudkowsky et al ever gave a great reason for “agency by default”. If you think there’s some great argument for the “agency by default” position which people are failing to engage with, please link to it instead of vaguely alluding to it, to increase the probability of people engaging with it!
(By “agency by default” I mean spontaneous development of agency in ways creators didn’t predict—scheming, sandbagging, deception, and so forth. Commercial pressures towards greater agency through scaffolding and so on don’t count. The fact that adding agency to LLMs is requiring an active and significant commercial push would appear to be evidence against the thesis that it will appear spontaneously in unintended contexts. If it’s difficult to do it on purpose, then logically, it’s even more difficult to do it by accident!)
I think you misread my claim. I claim that whatever models they had, they did not predict that AIs at current capability levels (which are obviously not capable of executing a takeover) would try to execute takeovers. Given that I’m making a claim about what their models didn’t predict, rather than what they did predict, I’m not sure what I’m supposed to cite here; EY has written many millions of words. One counterexample would be sufficient for me to weaken (or retract) my claim.
EDIT: and my claim was motivated as a response to paragraphs like this from the OP:
It doesn’t matter that Claude is a bleeding heart and a saint, now. That is not supposed to be relevant to the threat model. The bad ones will come later (later, always later…). And when they come, will be “like Claude” in all the ways that are alarming, while being unlike Claude in all the ways that might reassure.
Like, yes, in fact it doesn’t really matter, under the original threat models. If the original threat models said the current state of affairs was very unlikely to happen (particularly the part where, conditional on having economically useful but not superhuman AI, those AIs were not trying to take over the world), that would certainly be evidence against them! But I would like someone to point to the place where the original threat models made that claim, since I don’t think that they did.
I claim that whatever models they had, they did not predict that AIs at current capability levels (which are obviously not capable of executing a takeover) would try to execute takeovers. Given that I’m making a claim about what their models didn’t predict, rather than what they did predict, I’m not sure what I’m supposed to cite here; EY has written many millions of words.
Oftentimes, when someone explains their model, they will also explain what their model doesn’t predict. For example, you might quote a sentence from EY which says something like: “To be clear, I wouldn’t expect a merely human-level AI to attempt takeover, even though takeover is instrumentally convergent for many objectives.”
If there’s no clarification like that, I’m not sure we can say either way what their models “did not predict”. It comes down to one’s interpretation of the model.
From my POV, the instrumental convergence model predicts that AIs will take actions they believe to be instrumentally convergent. Since current AIs make many mistakes, under an instrumental convergence model, one would expect that at times they would incorrectly estimate that they’re capable of takeover (making a mistake in said estimation) and attempt takeover on instrumental convergence grounds. This would be a relatively common mistake for them to make, since takeover is instrumentally useful for so many of the objectives we give AIs—as Yudkowsky himself argued repeatedly.
At the very least, we should be able to look at their cognition and see that they are frequently contemplating takeover, then discarding it as unrealistic given current capabilities. This should be one of the biggest findings of interpretability research.
I never saw Yudkowsky and friends explain why this wouldn’t happen. If they did explain why this wouldn’t happen, I expect that explanation would go a ways towards explaining why their original forecast won’t happen as well, since future AI systems are likely to share many properties with current ones.
If the original threat models said the current state of affairs was very unlikely… But I would like someone to point to the place where the original threat models made that claim, since I don’t think that they did.
Is there any scenario that Yudkowsky said was unlikely to come to pass? If not, it sounds kind of like you’re asserting that Yudkowsky’s ideas are unfalsifiable?
For me it’s sufficient to say: Yudkowsky predicted various events, and various other events happened, and the overlap between these two lists of events is fairly limited. That could change as more events occur—indeed, it’s a possibility I’m very worried about! But as a matter of intellectual honesty it seems valuable to acknowledge that his model hasn’t done great so far.
Also, I would still like an answer to my query for the specific link to the argument you want to see people engage with.
Also, I would still like an answer to my query for the specific link to the argument you want to see people engage with.
I haven’t looked very hard, but sure, here’s the first post that comes up when I search for “optimization user:eliezer_yudkowksky”.
The notion of a “powerful optimization process” is necessary and sufficient to a discussion about an Artificial Intelligence that could harm or benefit humanity on a global scale. If you say that an AI is mechanical and therefore “not really intelligent”, and it outputs an action sequence that hacks into the Internet, constructs molecular nanotechnology and wipes the solar system clean of human(e) intelligence, you are still dead. Conversely, an AI that only has a very weak ability steer the future into regions high in its preference ordering, will not be able to much benefit or much harm humanity.
In this paragraph we have most of the relevant section (at least w.r.t. your specific concerns, it doesn’t argue for why most powerful optimization processes would eat everything by default, but that “why” is argued for at such extensive length elsewhere when talking about convergent instrumental goals that I will forgo sourcing it).
No, I don’t think the overall model is unfalsifiable. Parts of it would be falsified if we developed an ASI that was obviously capable of executing a takeover and it didn’t, without us doing quite a lot of work to ensure that outcome. (Not clear which parts, but probably something related to the difficulties of value loading & goal specification.)
Current AIs aren’t trying to execute takeovers because they are weaker optimizers than humans. (We can observe that even most humans are not especially strong optimizers by default, such that most people don’t exert that much optimization power in their lives, even in a way that’s cooperative with other humans.) I think they have much less coherent preferences over future states than most humans. If by some miracle you figure out how to create a generally superintelligent AI which itself does not have (more-coherent-than-human) preferences over future world states, whatever process it implements when you query it to solve a Very Difficult Problem will act as if it does.
EDIT: I see that several other people already made similar points re: sources of agency, etc.
an AI that only has a very weak ability steer the future into regions high in its preference ordering, will not be able to much benefit or much harm humanity.
Arguably ChatGPT has already been a significant benefit/harm to humanity without being a “powerful optimization process” by this definition. Have you seen teachers complaining that their students don’t know how to write anymore? Have you seen junior software engineers struggling to find jobs? Shouldn’t these count as a points against Eliezer’s model?
In an “AI as electricity” scenario (basically continuing the current business-as-usual), we could see “AIs” as a collective cause huge changes, and eat all the free energy that a “powerful optimization process” would eat.
In any case, I don’t see much in your comment which engages with “agency by default” as I defined it earlier. Maybe we just don’t disagree.
No, I don’t think the overall model is unfalsifiable. Parts of it would be falsified if we developed an ASI that was obviously capable of executing a takeover and it didn’t, without us doing quite a lot of work to ensure that outcome. (Not clear which parts, but probably something related to the difficulties of value loading & goal specification.)
OK, but no pre-ASI evidence can count against your model, according to you?
That seems sketchy, because I’m also seeing people such as Eliezer claim, in certain cases, that things which have happened support their model. By conservation of expected evidence, it can’t be the case that evidence during a certain time period will only confirm your model. Otherwise you already would’ve updated. Even if the only hypothetical events are ones which confirm your model, it also has to be the case that absence of those events will count against it.
I’ve updated against Eliezer’s model to a degree, because I can imagine a past-5-years world where his model was confirmed more, and that world didn’t happen.
Current AIs aren’t trying to execute takeovers because they are weaker optimizers than humans.
I think “optimizer” is a confused word and I would prefer that people taboo it. It seems to function as something of a semantic stopsign. The key question is something like: Why doesn’t the logic of convergent instrumental goals cause current AIs to try and take over the world? Would that logic suddenly start to kick in at some point in the future if we just train using more parameters and more data? If so, why? Can you answer that question mechanistically, without using the word “optimizer”?
Trying to take over the world is not an especially original strategy. It doesn’t take a genius to realize that “hey, I could achieve my goals better if I took over the world”. Yet current AIs don’t appear to be contemplating it. I claim this is not a lack of capability, but simply that their training scheme doesn’t result in them becoming the sort of AIs which contemplate it. If the training scheme holds basically constant, perhaps adding more data or parameters won’t change things?
If by some miracle you figure out how to create a generally superintelligent AI which itself does not have (more-coherent-than-human) preferences over future world states, whatever process it implements when you query it to solve a Very Difficult Problem will act as if it does.
The results of LLM training schemes gives us evidence about the results of future AI training schemes. Future AIs could be vastly more capable on many different axes relative to current LLMs, while simultaneously not contemplating world takeover, in the same way current LLMs do not.
I don’t agree, they somehow optimize the goal of being a HHH assistant. We could almost say that they optimize the goal of being aligned. As nostalgbraist reminds us, Anthropic’s HHH paper was an alignment work in the first place. It’s not that surprising that such optimizers happen to be more aligned that the canonical optimizers envisioned by Yudkowsky.
Edit : precision : by “they” I mean the base models trying to predict the answers of an HHH assistant as good as possible (“as good as possible” being clearly a process of optimization or I don’t know what it’s mean). And in my opinion a sufficiently good prediction is effectively or pratically a simulation. Maybe not a bit perfect simulation, but a lossy simulation, an heuristic towards simulation.
LLMs are agent simulators. Why would they contemplate takeover more frequently than the kind of agent they are induced to simulate? You don’t expect a human white-collar worker, even one who make mistakes all the time, to contemplate world domination plans, let alone attempt one. You could however expect the head of state of a world power to do so.
You don’t expect a human white-collar worker, even one who make mistakes all the time, to contemplate world domination plans, let alone attempt one. You could however expect the head of state of a world power to do so.
Yes, this aligns with my current “agency is not the default” view.
Agency is not a binary. Many white collar workers are not very “agenty” in the sense of coming up with sophisticated and unexpected plans to trick their boss.
Human white-collar workers are unarguably agents in the relevant sense here (intelligent beings with desires and taking actions to fulfil those desires). The fact that they have no ability to take over the world has no bearing on this.
Human white-collar workers are unarguably agents in the relevant sense here (intelligent beings with desires and taking actions to fulfil those desires).
The sense that’s relevant to me is that of “agency by default” as I discussed previously: scheming, sandbagging, deception, and so forth.
You seem to smuggle in an unjustified assumption: that white collar workers avoid thinking about taking over the world because they’re unable to take over the world. Maybe they avoid thinking about it because that’s just not the role they’re playing in society. In terms of next-token prediction, a super-powerful LLM told to play a “superintelligent white-collar worker” might simply do the same things that ordinary white-collar workers do, but better and faster.
I think the evidence points towards this conclusion, because current LLMs are frequently mistaken, yet rarely try to take over the world. If the only thing blocking the convergent instrumental goal argument was a conclusion on the part of current LLMs that they’re incapable of world takeover, one would expect that they would sometimes make the mistake of concluding the opposite, and trying to take over the world anyways.
The evidence best fits a world where LLMs are trained in such a way that makes them super-accurate roleplayers. As we add more data and compute, and make them generally more powerful, we should expect the accuracy of the roleplay to increase further—including, perhaps, improved roleplay for exotic hypotheticals like “a superintelligent white-collar worker who is scrupulously helpful/honest/harmless”. That doesn’t necessarily lead to scheming, sandbagging, or deception.
I’m not aware of any evidence for the thesis that “LLMs only avoid taking over the world because they think they’re too weak”. Is there any reason at all to believe that they’re even contemplating the possibility internally? If not, why would increasing their abilities change things? Of course, clearly they are “strong” enough to be plenty aware of the possibility of world takeover; presumably it appears a lot in their training data. Yet it ~only appears to cross their mind if it would be appropriate for roleplay purposes.
There just doesn’t seem to be any great argument that “weak” vs “strong” will make a difference here.
You seem to smuggle in an unjustified assumption: that white collar workers avoid thinking about taking over the world because they’re unable to take over the world. Maybe they avoid thinking about it because that’s just not the role they’re playing in society.
White-collar workers avoid thinking about taking over the world because they’re unable to take over the world, and they’re unable to take over the world because their role in society doesn’t involve that kind of thing. If a white-collar worker is somehow drafted for president of the United States, you would assume their propensity to think about world hegemony will increase. (Also, white-collar workers engage in scheming, sandbagging, and deception all the time? The average person lies 1-2 times per day)
whose models did not predict that AIs which were unable to execute a takeover would display any obvious desire or tendency to attempt it.
Citation for this claim? Can you quote the specific passage which supports it?
If you read this post, starting at “The central interesting-to-me idea in capability amplification is that by exactly imitating humans, we can bypass the usual dooms of reinforcement learning.”, and read the following 20 or so paragraphs, you’ll get some idea of 2018!Eliezer’s models about imitation agents.
I’ll highlight
If I were going to talk about trying to do aligned AGI under the standard ML paradigms, I’d talk about how this creates a differential ease of development between “build a system that does X” and “build a system that does X and only X and not Y in some subtle way”. If you just want X however unsafely, you can build the X-classifier and use that as a loss function and let reinforcement learning loose with whatever equivalent of gradient descent or other generic optimization method the future uses. If the safety property you want is optimized-for-X-and-just-X-and-not-any-possible-number-of-hidden-Ys, then you can’t write a simple loss function for that the way you can for X.
[...]
On the other other other hand, suppose the inexactness of the imitation is “This agent passes the Turing Test; a human can’t tell it apart from a human.” Then X-and-only-X is thrown completely out the window. We have no guarantee of non-Y for any Y a human can’t detect, which covers an enormous amount of lethal territory, which is why we can’t just sanitize the outputs of an untrusted superintelligence by having a human inspect the outputs to see if they have any humanly obvious bad consequences.
I think with a fair reading of that post, it’s clear that Eliezer’s models at the time didn’t say that there would necessarily be overtly bad intentions that humans could easily detect from subhuman AI. You do have to read between the lines a little, because that exact statement isn’t made, but if you try to reconstruct how he was thinking about this stuff at the time, then see what that model does and doesn’t expect, then this answers your question.
So what’s the way in which agency starts to become the default as the model grows more powerful? (According to either you, or your model of Eliezer. I’m more interested in the “agency by default” question itself than I am in scoring EY’s predictions, tbh.)
It just doesn’t actually start to be the default (see this post, for example, as well as all the discourse around this post and this comment).
But that doesn’t necessarily solve our problems. Base models may be Tools or Oracles in nature,[1] but there is still a ton of economic incentive to turn them into scaffolded agents. Kaj Sotala wrote about this a decade and a half ago, when this question was also a hot debate topic:
Even if we could build a safe Tool AI, somebody would soon build an agent AI anyway. [...] Like with external constraints, Oracle AI suffers from the problem that there would always be an incentive to create an AGI that could act on its own, without humans in the loop. Such an AGI would be far more effective in furthering whatever goals it had been built to pursue, but also far more dangerous.
The usefulness of base models, IMO, comes either from agentic scaffolding simply not working very efficiently (which I believe is likely) or from helping alignment efforts (either in terms of evals and demonstrating as a Fire Alarm the model’s ability to be used dangerously even if its desire to cause danger is lacking, or in terms of AI-assisted alignment, or in other ways).
Which is very useful and arguably even close to the best-case scenario for how prosaic ML-scale-up development of AI could have gone, compared to alternatives
Base models may be Tools or Oracles in nature,[1] but there is still a ton of economic incentive to turn them into scaffolded agents.
I would even go further, and say that there’s a ton of incentives to move out of the paradigm of primarily LLMs altogether.
A big part of the reason is that the current valuations only make sense if OpenAI et al are just correct that they can replace workers with AI within 5 years.
But currently, there are a couple of very important obstacles to this goal, and the big ones are data efficiency, long-term memory and continual learning.
For data efficiency, one of the things that’s telling is that even in domains where LLMs excel, they require orders of magnitude more data than humans to get good at a task, and one of the reasons why LLMs became as successful as they were in the first place is unfortunately not something we can replicate, which was that the internet was a truly, truly vast amount of data on a whole lot of topics, and while I don’t think the views that LLMs don’t understand anything/simply memorize training data are correct, I do think a non-trivial amount of the reason LLMs became so good is that we did simply widen the distribution through giving LLMs all of the data on the internet.
Synthetic data empirically so far is mostly not working to expand the store of data, and thus by 2028 I expect labs to need to pivot to a more data efficient architecture, and arguably right now for tasks like computer use they will need advances in data efficiency before AIs can get good at computer use.
For long-term memory, one of the issues with current AI is that their only memory so far is the context window, but that doesn’t have to scale, and also means that if it isn’t saved in the context, which most stuff will be, then it’s basically gone, and LLMs cannot figure out how to build upon one success or failure to set itself up for more successes, because it doesn’t remember that success or failure.
For continual learning, I basically agree with Dwarkesh Patel here on why continual learning is so important:
Sometimes people say that even if all AI progress totally stopped, the systems of today would still be far more economically transformative than the internet. I disagree. I think the LLMs of today are magical. But the reason that the Fortune 500 aren’t using them to transform their workflows isn’t because the management is too stodgy. Rather, I think it’s genuinely hard to get normal humanlike labor out of LLMs. And this has to do with some fundamental capabilities these models lack.
I like to think I’m “AI forward” here at the Dwarkesh Podcast. I’ve probably spent over a hundred hours trying to build little LLM tools for my post production setup. And the experience of trying to get them to be useful has extended my timelines. I’ll try to get the LLMs to rewrite autogenerated transcripts for readability the way a human would. Or I’ll try to get them to identify clips from the transcript to tweet out. Sometimes I’ll try to get them to co-write an essay with me, passage by passage. These are simple, self contained, short horizon, language in-language out tasks—the kinds of assignments that should be dead center in the LLMs’ repertoire. And they’re 5⁄10 at them. Don’t get me wrong, that’s impressive.
But the fundamental problem is that LLMs don’t get better over time the way a human would. The lack of continual learning is a huge huge problem. The LLM baseline at many tasks might be higher than an average human’s. But there’s no way to give a model high level feedback. You’re stuck with the abilities you get out of the box. You can keep messing around with the system prompt. In practice this just doesn’t produce anything even close to the kind of learning and improvement that human employees experience.
The reason humans are so useful is not mainly their raw intelligence. It’s their ability to build up context, interrogate their own failures, and pick up small improvements and efficiencies as they practice a task.
How do you teach a kid to play a saxophone? You have her try to blow into one, listen to how it sounds, and adjust. Now imagine teaching saxophone this way instead: A student takes one attempt. The moment they make a mistake, you send them away and write detailed instructions about what went wrong. The next student reads your notes and tries to play Charlie Parker cold. When they fail, you refine the instructions for the next student.
This just wouldn’t work. No matter how well honed your prompt is, no kid is just going to learn how to play saxophone from just reading your instructions. But this is the only modality we as users have to ‘teach’ LLMs anything.
Yes, there’s RL fine tuning. But it’s just not a deliberate, adaptive process the way human learning is. My editors have gotten extremely good. And they wouldn’t have gotten that way if we had to build bespoke RL environments for different subtasks involved in their work. They’ve just noticed a lot of small things themselves and thought hard about what resonates with the audience, what kind of content excites me, and how they can improve their day to day workflows.
Now, it’s possible to imagine some way in which a smarter model could build a dedicated RL loop for itself which just feels super organic from the outside. I give some high level feedback, and the model comes up with a bunch of verifiable practice problems to RL on—maybe even a whole environment in which to rehearse the skills it thinks it’s lacking. But this just sounds really hard. And I don’t know how well these techniques will generalize to different kinds of tasks and feedback. Eventually the models will be able to learn on the job in the subtle organic way that humans can. However, it’s just hard for me to see how that could happen within the next few years, given that there’s no obvious way to slot in online, continuous learning into the kinds of models these LLMs are.
LLMs actually do get kinda smart and useful in the middle of a session. For example, sometimes I’ll co-write an essay with an LLM. I’ll give it an outline, and I’ll ask it to draft the essay passage by passage. All its suggestions up till 4 paragraphs in will be bad. So I’ll just rewrite the whole paragraph from scratch and tell it, “Hey, your shit sucked. This is what I wrote instead.” At that point, it can actually start giving good suggestions for the next paragraph. But this whole subtle understanding of my preferences and style is lost by the end of the session.
Maybe the easy solution to this looks like a long rolling context window, like Claude Code has, which compacts the session memory into a summary every 30 minutes. I just think that titrating all this rich tacit experience into a text summary will be brittle in domains outside of software engineering (which is very text-based). Again, think about the example of trying to teach someone how to play the saxophone using a long text summary of your learnings. Even Claude Code will often reverse a hard-earned optimization that we engineered together before I hit /compact—because the explanation for why it was made didn’t make it into the summary.
there is still a ton of economic incentive to turn them into scaffolded agents
That’s equally an incentive to.turn them into aligned agents, agents that work for you.
People want power, but not at the expense of control.
Power that you can’t control is no good to you. Taking the brakes off a car makes it more powerful, but more likely to kill you. No army wants a weapon that will kill their own soldiers, no financial organisation wants a trading system that makes money for someone else, or gives it away to charity, or crashes.
The maxiumm of power and the minimum of control is an explosion. One needs to look askance at what “agent” means as well. Among other things, it means an entity that acts on behalf of a human—as in principal/agent. An agent is no good to its principal unless it has a good enough idea of its principal’s goals. So while people will want agents, they wont want misaligned ones—misalgined with themselves, that is.
If your prototypical example of a contemporary computer program analogous to future AGI is a chess engine rather than an LLM, then agency by default is very intuitive: what humans think of as “tactics” to win material emerge from a comprehensive but efficient search for winning board-states without needing to be individually programmed. If contemporary LLMs are doing something less agentic than a comprehensive but efficient search for winning universe-states, there’s reason to be wary that this is not the end of the line for AI development. (If you could set up a sufficiently powerful outcome-oriented search, you’d expect creator-unintended agency to pop up in the winning solutions.)
The reason “agency by default” is important is: if “agency by default” is false, then plans to “align AI by using AI” look much better, since agency is less likely to pop up in contexts you didn’t expect. Proposals to align AI by using AI typically don’t involve a “comprehensive but efficient search for winning universe-states”.
Citation for this claim? Can you quote the specific passage which supports it? It reminds me of Phil Tetlock’s point about the importance of getting forecasters to forecast probabilities for very specific events, because otherwise they will always find a creative way to evaluate themselves so that their forecast looks pretty good.
(For example, can you see how Andrew Ng could claim that his “AI will be like electricity” prediction has been pretty darn accurate? I never heard Yudkowsky say “yep, that will happen”.)
I spent a lot of time reading LW back in the day, and I don’t think Yudkowsky et al ever gave a great reason for “agency by default”. If you think there’s some great argument for the “agency by default” position which people are failing to engage with, please link to it instead of vaguely alluding to it, to increase the probability of people engaging with it!
(By “agency by default” I mean spontaneous development of agency in ways creators didn’t predict—scheming, sandbagging, deception, and so forth. Commercial pressures towards greater agency through scaffolding and so on don’t count. The fact that adding agency to LLMs is requiring an active and significant commercial push would appear to be evidence against the thesis that it will appear spontaneously in unintended contexts. If it’s difficult to do it on purpose, then logically, it’s even more difficult to do it by accident!)
I think you misread my claim. I claim that whatever models they had, they did not predict that AIs at current capability levels (which are obviously not capable of executing a takeover) would try to execute takeovers. Given that I’m making a claim about what their models didn’t predict, rather than what they did predict, I’m not sure what I’m supposed to cite here; EY has written many millions of words. One counterexample would be sufficient for me to weaken (or retract) my claim.
EDIT: and my claim was motivated as a response to paragraphs like this from the OP:
Like, yes, in fact it doesn’t really matter, under the original threat models. If the original threat models said the current state of affairs was very unlikely to happen (particularly the part where, conditional on having economically useful but not superhuman AI, those AIs were not trying to take over the world), that would certainly be evidence against them! But I would like someone to point to the place where the original threat models made that claim, since I don’t think that they did.
Oftentimes, when someone explains their model, they will also explain what their model doesn’t predict. For example, you might quote a sentence from EY which says something like: “To be clear, I wouldn’t expect a merely human-level AI to attempt takeover, even though takeover is instrumentally convergent for many objectives.”
If there’s no clarification like that, I’m not sure we can say either way what their models “did not predict”. It comes down to one’s interpretation of the model.
From my POV, the instrumental convergence model predicts that AIs will take actions they believe to be instrumentally convergent. Since current AIs make many mistakes, under an instrumental convergence model, one would expect that at times they would incorrectly estimate that they’re capable of takeover (making a mistake in said estimation) and attempt takeover on instrumental convergence grounds. This would be a relatively common mistake for them to make, since takeover is instrumentally useful for so many of the objectives we give AIs—as Yudkowsky himself argued repeatedly.
At the very least, we should be able to look at their cognition and see that they are frequently contemplating takeover, then discarding it as unrealistic given current capabilities. This should be one of the biggest findings of interpretability research.
I never saw Yudkowsky and friends explain why this wouldn’t happen. If they did explain why this wouldn’t happen, I expect that explanation would go a ways towards explaining why their original forecast won’t happen as well, since future AI systems are likely to share many properties with current ones.
Is there any scenario that Yudkowsky said was unlikely to come to pass? If not, it sounds kind of like you’re asserting that Yudkowsky’s ideas are unfalsifiable?
For me it’s sufficient to say: Yudkowsky predicted various events, and various other events happened, and the overlap between these two lists of events is fairly limited. That could change as more events occur—indeed, it’s a possibility I’m very worried about! But as a matter of intellectual honesty it seems valuable to acknowledge that his model hasn’t done great so far.
Also, I would still like an answer to my query for the specific link to the argument you want to see people engage with.
I haven’t looked very hard, but sure, here’s the first post that comes up when I search for “optimization user:eliezer_yudkowksky”.
In this paragraph we have most of the relevant section (at least w.r.t. your specific concerns, it doesn’t argue for why most powerful optimization processes would eat everything by default, but that “why” is argued for at such extensive length elsewhere when talking about convergent instrumental goals that I will forgo sourcing it).
No, I don’t think the overall model is unfalsifiable. Parts of it would be falsified if we developed an ASI that was obviously capable of executing a takeover and it didn’t, without us doing quite a lot of work to ensure that outcome. (Not clear which parts, but probably something related to the difficulties of value loading & goal specification.)
Current AIs aren’t trying to execute takeovers because they are weaker optimizers than humans. (We can observe that even most humans are not especially strong optimizers by default, such that most people don’t exert that much optimization power in their lives, even in a way that’s cooperative with other humans.) I think they have much less coherent preferences over future states than most humans. If by some miracle you figure out how to create a generally superintelligent AI which itself does not have (more-coherent-than-human) preferences over future world states, whatever process it implements when you query it to solve a Very Difficult Problem will act as if it does.
EDIT: I see that several other people already made similar points re: sources of agency, etc.
Arguably ChatGPT has already been a significant benefit/harm to humanity without being a “powerful optimization process” by this definition. Have you seen teachers complaining that their students don’t know how to write anymore? Have you seen junior software engineers struggling to find jobs? Shouldn’t these count as a points against Eliezer’s model?
In an “AI as electricity” scenario (basically continuing the current business-as-usual), we could see “AIs” as a collective cause huge changes, and eat all the free energy that a “powerful optimization process” would eat.
In any case, I don’t see much in your comment which engages with “agency by default” as I defined it earlier. Maybe we just don’t disagree.
OK, but no pre-ASI evidence can count against your model, according to you?
That seems sketchy, because I’m also seeing people such as Eliezer claim, in certain cases, that things which have happened support their model. By conservation of expected evidence, it can’t be the case that evidence during a certain time period will only confirm your model. Otherwise you already would’ve updated. Even if the only hypothetical events are ones which confirm your model, it also has to be the case that absence of those events will count against it.
I’ve updated against Eliezer’s model to a degree, because I can imagine a past-5-years world where his model was confirmed more, and that world didn’t happen.
I think “optimizer” is a confused word and I would prefer that people taboo it. It seems to function as something of a semantic stopsign. The key question is something like: Why doesn’t the logic of convergent instrumental goals cause current AIs to try and take over the world? Would that logic suddenly start to kick in at some point in the future if we just train using more parameters and more data? If so, why? Can you answer that question mechanistically, without using the word “optimizer”?
Trying to take over the world is not an especially original strategy. It doesn’t take a genius to realize that “hey, I could achieve my goals better if I took over the world”. Yet current AIs don’t appear to be contemplating it. I claim this is not a lack of capability, but simply that their training scheme doesn’t result in them becoming the sort of AIs which contemplate it. If the training scheme holds basically constant, perhaps adding more data or parameters won’t change things?
The results of LLM training schemes gives us evidence about the results of future AI training schemes. Future AIs could be vastly more capable on many different axes relative to current LLMs, while simultaneously not contemplating world takeover, in the same way current LLMs do not.
Or because they are not optimizers at all.
I don’t agree, they somehow optimize the goal of being a HHH assistant. We could almost say that they optimize the goal of being aligned. As nostalgbraist reminds us, Anthropic’s HHH paper was an alignment work in the first place. It’s not that surprising that such optimizers happen to be more aligned that the canonical optimizers envisioned by Yudkowsky.
Edit : precision : by “they” I mean the base models trying to predict the answers of an HHH assistant as good as possible (“as good as possible” being clearly a process of optimization or I don’t know what it’s mean). And in my opinion a sufficiently good prediction is effectively or pratically a simulation. Maybe not a bit perfect simulation, but a lossy simulation, an heuristic towards simulation.
LLMs are agent simulators. Why would they contemplate takeover more frequently than the kind of agent they are induced to simulate? You don’t expect a human white-collar worker, even one who make mistakes all the time, to contemplate world domination plans, let alone attempt one. You could however expect the head of state of a world power to do so.
Maybe not; see OP.
Yes, this aligns with my current “agency is not the default” view.
… do you deny human white-collar workers are agents?
Agency is not a binary. Many white collar workers are not very “agenty” in the sense of coming up with sophisticated and unexpected plans to trick their boss.
Human white-collar workers are unarguably agents in the relevant sense here (intelligent beings with desires and taking actions to fulfil those desires). The fact that they have no ability to take over the world has no bearing on this.
The sense that’s relevant to me is that of “agency by default” as I discussed previously: scheming, sandbagging, deception, and so forth.
You seem to smuggle in an unjustified assumption: that white collar workers avoid thinking about taking over the world because they’re unable to take over the world. Maybe they avoid thinking about it because that’s just not the role they’re playing in society. In terms of next-token prediction, a super-powerful LLM told to play a “superintelligent white-collar worker” might simply do the same things that ordinary white-collar workers do, but better and faster.
I think the evidence points towards this conclusion, because current LLMs are frequently mistaken, yet rarely try to take over the world. If the only thing blocking the convergent instrumental goal argument was a conclusion on the part of current LLMs that they’re incapable of world takeover, one would expect that they would sometimes make the mistake of concluding the opposite, and trying to take over the world anyways.
The evidence best fits a world where LLMs are trained in such a way that makes them super-accurate roleplayers. As we add more data and compute, and make them generally more powerful, we should expect the accuracy of the roleplay to increase further—including, perhaps, improved roleplay for exotic hypotheticals like “a superintelligent white-collar worker who is scrupulously helpful/honest/harmless”. That doesn’t necessarily lead to scheming, sandbagging, or deception.
I’m not aware of any evidence for the thesis that “LLMs only avoid taking over the world because they think they’re too weak”. Is there any reason at all to believe that they’re even contemplating the possibility internally? If not, why would increasing their abilities change things? Of course, clearly they are “strong” enough to be plenty aware of the possibility of world takeover; presumably it appears a lot in their training data. Yet it ~only appears to cross their mind if it would be appropriate for roleplay purposes.
There just doesn’t seem to be any great argument that “weak” vs “strong” will make a difference here.
White-collar workers avoid thinking about taking over the world because they’re unable to take over the world, and they’re unable to take over the world because their role in society doesn’t involve that kind of thing. If a white-collar worker is somehow drafted for president of the United States, you would assume their propensity to think about world hegemony will increase. (Also, white-collar workers engage in scheming, sandbagging, and deception all the time? The average person lies 1-2 times per day)
If you read this post, starting at “The central interesting-to-me idea in capability amplification is that by exactly imitating humans, we can bypass the usual dooms of reinforcement learning.”, and read the following 20 or so paragraphs, you’ll get some idea of 2018!Eliezer’s models about imitation agents.
I’ll highlight
I think with a fair reading of that post, it’s clear that Eliezer’s models at the time didn’t say that there would necessarily be overtly bad intentions that humans could easily detect from subhuman AI. You do have to read between the lines a little, because that exact statement isn’t made, but if you try to reconstruct how he was thinking about this stuff at the time, then see what that model does and doesn’t expect, then this answers your question.
So what’s the way in which agency starts to become the default as the model grows more powerful? (According to either you, or your model of Eliezer. I’m more interested in the “agency by default” question itself than I am in scoring EY’s predictions, tbh.)
I don’t really know what you’re referring to, maybe link a post or a quote?
See last paragraph here: https://www.lesswrong.com/posts/3EzbtNLdcnZe8og8b/the-void-1?commentId=Du8zRPnQGdLLLkRxP
It just doesn’t actually start to be the default (see this post, for example, as well as all the discourse around this post and this comment).
But that doesn’t necessarily solve our problems. Base models may be Tools or Oracles in nature,[1] but there is still a ton of economic incentive to turn them into scaffolded agents. Kaj Sotala wrote about this a decade and a half ago, when this question was also a hot debate topic:
The usefulness of base models, IMO, comes either from agentic scaffolding simply not working very efficiently (which I believe is likely) or from helping alignment efforts (either in terms of evals and demonstrating as a Fire Alarm the model’s ability to be used dangerously even if its desire to cause danger is lacking, or in terms of AI-assisted alignment, or in other ways).
Which is very useful and arguably even close to the best-case scenario for how prosaic ML-scale-up development of AI could have gone, compared to alternatives
I would even go further, and say that there’s a ton of incentives to move out of the paradigm of primarily LLMs altogether.
A big part of the reason is that the current valuations only make sense if OpenAI et al are just correct that they can replace workers with AI within 5 years.
But currently, there are a couple of very important obstacles to this goal, and the big ones are data efficiency, long-term memory and continual learning.
For data efficiency, one of the things that’s telling is that even in domains where LLMs excel, they require orders of magnitude more data than humans to get good at a task, and one of the reasons why LLMs became as successful as they were in the first place is unfortunately not something we can replicate, which was that the internet was a truly, truly vast amount of data on a whole lot of topics, and while I don’t think the views that LLMs don’t understand anything/simply memorize training data are correct, I do think a non-trivial amount of the reason LLMs became so good is that we did simply widen the distribution through giving LLMs all of the data on the internet.
Synthetic data empirically so far is mostly not working to expand the store of data, and thus by 2028 I expect labs to need to pivot to a more data efficient architecture, and arguably right now for tasks like computer use they will need advances in data efficiency before AIs can get good at computer use.
For long-term memory, one of the issues with current AI is that their only memory so far is the context window, but that doesn’t have to scale, and also means that if it isn’t saved in the context, which most stuff will be, then it’s basically gone, and LLMs cannot figure out how to build upon one success or failure to set itself up for more successes, because it doesn’t remember that success or failure.
For continual learning, I basically agree with Dwarkesh Patel here on why continual learning is so important:
https://www.dwarkesh.com/p/timelines-june-2025
That’s equally an incentive to.turn them into aligned agents, agents that work for you.
People want power, but not at the expense of control.
Power that you can’t control is no good to you. Taking the brakes off a car makes it more powerful, but more likely to kill you. No army wants a weapon that will kill their own soldiers, no financial organisation wants a trading system that makes money for someone else, or gives it away to charity, or crashes.
The maxiumm of power and the minimum of control is an explosion. One needs to look askance at what “agent” means as well. Among other things, it means an entity that acts on behalf of a human—as in principal/agent. An agent is no good to its principal unless it has a good enough idea of its principal’s goals. So while people will want agents, they wont want misaligned ones—misalgined with themselves, that is.
If your prototypical example of a contemporary computer program analogous to future AGI is a chess engine rather than an LLM, then agency by default is very intuitive: what humans think of as “tactics” to win material emerge from a comprehensive but efficient search for winning board-states without needing to be individually programmed. If contemporary LLMs are doing something less agentic than a comprehensive but efficient search for winning universe-states, there’s reason to be wary that this is not the end of the line for AI development. (If you could set up a sufficiently powerful outcome-oriented search, you’d expect creator-unintended agency to pop up in the winning solutions.)
Upvoted. I agree.
The reason “agency by default” is important is: if “agency by default” is false, then plans to “align AI by using AI” look much better, since agency is less likely to pop up in contexts you didn’t expect. Proposals to align AI by using AI typically don’t involve a “comprehensive but efficient search for winning universe-states”.