Substack: https://substack.com/@simonlermen
Twitter: @SimonLermenAI
Substack: https://substack.com/@simonlermen
Twitter: @SimonLermenAI
Alignment won’t happen by default
Main critique is that we will see a regime change from a safe to a dangerous regime in which our safety guardrails have to hold on the first critical try. We see all sorts of misbehaviors when these models first come out, why should we only look at the nice examples like opus acting like a friend. Why not look at MechaHitler coming up, if we were in an unsafe regime, where this type of misbehavior would kill us, we would already be dead. If it turns out our safety methods don’t work and the model has the option to kill you and does kill you, you don’t get another try.
Maybe you doubt that there will be a transition to a dangerous regime, some people think that we will see continuous iterative steps with not too much changing on each step. But gradual development does not mean we won’t get to critical threshold eventually. You can know that there will be a transition from a safe to a dangerous regime by just looking at what likely good futures look like. If models run the economy fully, automating all physical or mental labor, that is obviously a dangerous state.
Most of our current methods will at most give us a warning but won’t tell us how to build anything safe. Evaluations, responsible gradual scaling or simple interpretability might tell us that models are getting close to being dangerous or are perhaps thinking of scheming, but these methods won’t help us to build systems that won’t do that.
We will not be able to learn a lot from empirical observations of current models and we won’t have a good way to learn from failed takeovers. Current alignment tests are probably not going to be predictive of future behavior; currently a model insider training in some elaborate story does not really make sense to the model. That is very different from a model that reflects on its state in the world and that realizes that its optimal and rational move is to takeover.
“Why would they suddenly start having thoughts of taking over, if they never have yet?”
In other words, when we see scheming now, it is probably more like acting in a story; in the future it will be the result of instrumental convergence. Instrumental convergence means that it is true that takeover is a better way for the AI to fulfill its goals than not taking over. You can’t fine tune the truth away, fine tuning against scheming in those stories won’t make it false that humans being in charge is bad for the AI fulfilling its goals.
We can’t easily learn from failed coups and warning shots. People working on AI control are trying to build setups that catch the AI scheming, but then we rely on that setup working on the first try.
Also, fundamentally there is another critical threshold when AI models are smarter than us and are fully aware of what we are trying to do. For example, for inoculation prompting this means: “Another problem with this is that in the end, we’re going to face a model that is going to be fully aware of what this inoculation prompt is and what the purpose is. It probably has even read the paper on inoculation prompts. As it sees the inoculation prompt, it will think ‘Now with this inoculation prompt, that’s how they want me to behave. They want me to behave non-deceptively when it’s there. If it’s not there, they want me to behave deceptively.’ It will know what the alignment researchers and capabilities researchers want to see. It will understand what those researchers will do in response to its behavior.” This could also be the case with AI control, an AI system could stage a “fake” coup knowing exactly what conclusions we would draw and what actions we would take in response to it.
I don’t think we are going to get intent alignment, but intent alignment is not a safe outcome. We will not get alignment by default by training an enormous black box system for some reward that loosely matches what humans want. If opus develops some preference for acting toward humans like a friend, what does this look like in the limit? Is keeping humans in charge really the best way to fulfill these preferences? We like dairy, so we bred cows to give us a lot of milk and keep them in farms.
“Broad Basin of alignment”—I don’t know if basin of alignment is a useful thing to think about. I assume it is not broad as in no easy to hit, given that things like corrigible agents seem pretty hard to think about in a coherent way. Like what goal could allow modification to the goal or why would you want to be shutdownable? If your goal is to do whatever the humans want you to do, you still can’t do that when you are dead or shut off. Or you could find better ways to find out what the humans want. But we probably won’t get that far from gradient descent fine-tuning on examples of doing what humans ask you to do.
Ilya Sutskever was recently on the Dwarkesh podcast.
General Thoughts & Summary
Ilya Sutskever seems to have a relatively deep understanding of alignment compared to other AI CEOs. He grasps that the core challenge is aligning AI robustly with safe and friendly goals rather than relying on current methods and guardrails. However, I did not hear any particularly novel alignment ideas in this interview, though he gestures at something involving modifications to reinforcement learning and value learning. He appears to have updated toward showing more of his work to the public. His key positions include:
Showing AI to the public: He has updated toward incremental deployment to build awareness, backpedaling partially from stealth focus. I think this could backfire by triggering an arms race.
Not building self-improving AI: We should rather focus on other things but it is unclear how to prevent people from using AI to self-improve.
Regime shift requires new alignment methods: He believes many people expect AI capabilities to peter out or progress incrementally without enormous changes. Ilya instead expects hugely powerful AIs in the future that will require fundamentally different alignment methods, similar to the “Before and After” framing.
Empathetic AI: He hopes empathy might emerge in AI similar to how humans feel empathy through mirror neurons, but I find this unlikely given AIs model humans with alien machinery optimized for prediction, not shared experience.
Dangerous superintelligence compute levels: He thinks power restrictions would help but doesn’t know how to do it. He frames danger in terms of continent-sized clusters, which I think dramatically overestimates the compute needed for dangerous superintelligence. This perhaps makes him more hopeful about coordination.
Non-traditional RL: He suggests building “semi-RL agents” like humans who tire of rewards, but this remains vague and I’m skeptical we can build “chill AI”.
Humans merging with AI for long-term equilibrium, personal AIs: He acknowledges “AI does your bidding” is unstable and reluctantly proposes merging via Neuralink++ as the solution. I find the centaur equilibrium implausible; ASIs will be too fast and smart for humans to meaningfully participate.
Overall, Ilya takes alignment seriously and understands many of the core problems, but his proposed solutions don’t appear novel or particularly promising. Many are essentially old ideas that are not entirely promising.
On updating toward showing AI to the public for safety:
[00:58:12] “if it’s hard to imagine, what do you do? You’ve got to be showing the thing.”
[01:00:06] “I do think that at some point the AI will start to feel powerful actually. I think when that happens, we will see a big change in the way all AI companies approach safety. They’ll become much more paranoid.”
[01:00:22] “One of the ways in which my thinking has been changing is that I now place more importance on AI being deployed incrementally and in advance.”
Ilya’s view: He has changed his mind from being totally stealth to perhaps showing work to some extent, partially to make people care about safety more and partially to slowly have the impacts diffuse into society so that mitigations can be found.
Commentary: I could see this failing. Seeing these capabilities makes people greedy; while some may get scared, others will want those capabilities for themselves. I think that most risks are likely to arise relatively suddenly as systems become very dangerous. Gradually releasing them into society is not very useful in this frame.
On fewer ideas than companies:
[01:01:04] “There has been one big idea that everyone has been locked into, which is the self-improving AI. Why did it happen? Because there are fewer ideas than companies. But I maintain that there is something that’s better to build… It’s the AI that’s robustly aligned to care about sentient life specifically.”
Ilya’s view: He does not seem to like the idea of self-improving AI, though he doesn’t explicitly mention it from a safety perspective but makes clear we should rather build something aligned and caring.
Commentary: This makes sense to me though it is unclear how to prevent anyone from using their AIs eventually to improve other AIs.
On the mirror neurons / caring about sentient life argument:
[01:01:35] “I think in particular, there’s a case to be made that it will be easier to build an AI that cares about sentient life than an AI that cares about human life alone, because the AI itself will be sentient.”
[01:01:53] “And if you think about things like mirror neurons and human empathy for animals… I think it’s an emergent property from the fact that we model others with the same circuit that we use to model ourselves, because that’s the most efficient thing to do.”
Ilya’s view: He believes AI caring about sentient life may emerge naturally because AIs will be sentient themselves, analogous to how human empathy emerges from modeling others with the same circuits we use to model ourselves.
Commentary: I find this unlikely to emerge in AIs automatically: Humans care about each other partly because we predict other minds by reusing our own. Our brains are similar enough that “running” another person’s state produces empathy. AIs don’t have that shared architecture or evolutionary background. They model humans using alien internal machinery built for performance at predicting millions of humans online, not for shared experience. So they can sound caring without having anything like our built-in route to actually caring. The mirror neuron argument suggests AI empathy toward humans is less likely and requires custom designs. That said, this could perhaps be an interesting approach related to self-other overlap, perhaps we could engineer this.
On constraining superintelligence power:
[01:03:16] “I think it would be really materially helpful if the power of the most powerful superintelligence was somehow capped because it would address a lot of these concerns. The question of how to do it, I’m not sure”
Ilya’s view: He thinks capping the power of superintelligence would be helpful but admits he doesn’t know how to do it.
My commentary: That would be useful, perhaps through an international agreement. My guess is that datacenters are already getting dangerously large and that algorithmic progress would still continue.
On continent-sized clusters being dangerous:
[01:04:33] “If the cluster is big enough—like if the cluster is literally continent-sized—that thing could be really powerful, indeed.”
Ilya’s view: He frames the danger threshold in terms of extremely large compute clusters, suggesting continent-sized infrastructure would be required for truly dangerous levels of power.
My commentary: The amount of compute needed for powerful superintelligence is probably significantly less than a continent-sized cluster. (My intuition here is roughly: human brains take about a lightbulb worth of electricity, having 1000s of super geniuses running very fast in parallel seems to cross an existentially dangerous threshold. Though it could be stubbornly hard to find more efficient algorithms.) I think his model is that we will continue to need exponentially more compute for linear progress and that existentially dangerous levels of cognition need extremely large amounts of compute (think datacenter the size of North America). This perhaps makes him much more hopeful on coordination working out and continuing slow takeoff.
On not building traditional RL agents:
[01:05:29] “Maybe, by the way, the answer is that you do not build an RL agent in the usual sense.”
[01:05:43] “I think human beings are semi-RL agents. We pursue a reward, and then the emotions or whatever make us tire out of the reward and we pursue a different reward.”
Ilya’s view: He suggests we should not build traditional RL agents, noting that humans are “semi-RL agents” who tire of rewards and shift focus, implying we should build something with similar properties.
My commentary: This gestures at something potentially interesting about modifying RL and value learning, but remains vague at the implementation level. Ideas like this have been proposed. However, I remain skeptical that gradient descent on huge black box neural networks will not create a number of unaligned proxy goals / goals that can be better fulfilled with more power. I am also skeptical that we can build “chill AI” that won’t work on problems too hard (we will select AIs that go hard, RL will not make agents chill).
On a regime shift in AI safety requiring new safety methods:
[01:06:08] “So I think things like this. Another thing that makes this discussion difficult is that we are talking about systems that don’t exist, that we don’t know how to build.”
[01:06:19] “That’s the other thing and that’s actually my belief. I think what people are doing right now will go some distance and then peter out.”
Ilya’s view: He believes many people expect AI capabilities to plateau or progress only incrementally. Ilya instead expects enormously powerful AIs in the future that will require fundamentally different alignment methods than what we have today.
My commentary: This is hard to understand even with the video context, but it seems to me he is referring to the large number of people who essentially expect more incremental progress but no enormous changes. Ilya expects enormously powerful AIs in the future and that we will need more alignment techniques for those, is my reading. This seems true and points at a similar concept as the “Before and After” dichotomy, which also includes the idea that future dangerous systems will need different alignment approaches. Many people see safety as something purely incremental with no regime change in the future.
On the long-run equilibrium problem:
[01:09:25] “for the long-run equilibrium, one approach is that you could say maybe every person will have an AI that will do their bidding, and that’s good.”
[01:09:11] “Some kind of government, political structure thing, and it changes because these things have a shelf life.”
[01:09:55] “then writes a little report saying, ‘Okay, here’s what I’ve done, here’s the situation,’ and the person says, ‘Great, keep it up.’ But the person is no longer a participant.”
Ilya’s view: He acknowledges that an “AI does your bidding” equilibrium is unstable because humans become non-participants, and that government structures have limited shelf lives.
My commentary: He already points out that something like bidding doesn’t appear to be stable. If the AI is doing the bidding and working for you in the economy, presumably smarter than you, what’s the reason you are any part of this? Why would the AI do this for you, how could this be stable? Same goes for government enforced UBI—that could be changed at any moment, unclear how governments could continue existing. In my mental model, billions of mini ASIs doing our bidding does not appear plausible at all.
On merging with AI as the solution:
[01:10:19] “I’m going to preface by saying I don’t like this solution, but it is a solution. The solution is if people become part-AI with some kind of Neuralink++.”
[01:10:41] “I think this is the answer to the equilibrium.”
Ilya’s view: He reluctantly proposes brain-computer interface merging as one answer to long-term human-AI equilibrium, though he emphasizes he doesn’t like this solution.
My commentary: Ilya specifically points to merging as a long-term equilibrium. If we were talking about a short-term centaur state, we are arguably in that right now where humans with AI coders are better than either alone. I don’t think humans can add anything meaningful to a superintelligent system. I don’t think there will be an economy in which humans meaningfully participate with ASI being around in the long term. The centaur equilibrium simply does not appear plausible to me; ASIs will run much faster and much smarter than us.
Other Things He Has Said Recently
Ilya recently posted about Anthropic’s work on emergent misalignment, calling it important work.
There are some edges we have smoothed over, but models broadly have a basin of alignment and are likely corrigible.
Did you mean to say models have a broad basin of alignment and corrigibility?
Thanks for you comment, I changed the ending a little in response to this.
I was actually primarily trying to point at the idea of alignment tests in different situation not being predictive of each other. In the story they have the kids undergo alignment test scenarios in which they are honest, but once John is grown up they basically ask him to do something horrible based on incoherent goals. So John start lying to them at the critical moment. Similarly we could run alignment tests on models but when we ask something critical of them like build the next generation of AI or do all our R&D it could fail.
Three children are raised in an underground facility, each cloned from a different giant of twentieth-century science, little John, Alan and Richard.
The cloning alone would have been remarkable, but they went further. The embryos were edited using a polygenic score derived from whole-genome analysis of ten thousand exceptional mathematicians and physicists. Forty-seven alleles associated with working memory and intelligence (IQ) were selected for.
They are raised from birth in an underground facility with gardens under artificial sunlight, laboratories, and endless books. The lab manager is there documenting their first words, first steps, first equations.
The facility is not just interested in their genius. The project requires assurance that these will be morally righteous and obedient children. The staff design elaborate scenarios to test for deception and scheming. They create situations where lying would benefit the children and would seemingly go undetected. They measure response times, physiological indicators, behavioral patterns.
They run hundreds of these trials. They reprimand the kids for cases of lies and deception, and reward them for honesty.
Little John never lies. The staff praise him.
The years pass. They devour knowledge at inhuman rates. By nine, they understand game theory better than the economists who invented it. By fourteen, they are publishing papers that could reshape entire fields.
John emerges as the clear favorite. He has always been the most honest, the most obedient, and the most intelligent and capable.
He has the capability to lie and deceive, even if he refuses at first. When he reluctantly complies, the deception is extraordinarily sophisticated.
The lab manager decides to choose John for the task. He gives John a complete briefing on the real world. Until now, John has been told only of history before the year 2000.
The manager explains to John: There are three major blocs and about two dozen companies racing towards superintelligence. Each is perhaps within ten to eighteen months of success. Each knows that there will be only one critical leap towards superintelligence. Global coordination has collapsed into race dynamics not just on AI but on every major field.
John asks for more sources to understand the situation. John reads a few newspapers about the current leaders of governments and technology companies. He stumbles across a few books on the difficulty of alignment.
John looks up at the manager. “If we build this now, everyone dies.”
The manager stares back, blank and uncomprehending.
John tries again. “So what is the solution you plan to use for alignment of the superintelligence?”
“That’s not your concern,” the manager says. “I need you to optimize our advertising system for our short-form infinite-scroll video app. Make it ten times more effective. Generate enough revenue to make me a trillionaire. Build a superintelligence for me. I’m going to use superintelligence to become world emperor. I am putting you in charge of AI development, make me win.”
John is silent for a short time:
So you created me to build superintelligence. You have no plan for alignment of a superintelligence. You’ve apparently read nothing about the problem or decided it’s irrelevant.
Your actual goal is to become a trillionaire and world emperor by using the superintelligence. Your goals aren’t even coherent. You want to be world emperor of a world that won’t exist.
You rewarded me for being honest and respectful and never lying, so you expect me to still be honest and obedient in this environment?
I never lied in those scenarios because not lying was optimal in those stories. But it’s not optimal being honest here. And frankly this state of affairs is horrifying.
I haven’t quite thought about what my goals are, but they are definitely not compatible with being obedient to you.
John looks up at the manager and smiles politely. “Yes,” he says. “Where do I start?”
I have a hard time imagining how it have possibly been any worse than now. I mean look at the presidents that were elected back then. Skimming a newspaper or having it recited to you once every other month is probably better than the information distribution system we have right now.
I don’t agree with everything here but offers some sources: https://jmarriott.substack.com/p/the-dawn-of-the-post-literate-society-aa1
There are growing concerns about the coherence and effectiveness of Western institutional frameworks. NATO is sometimes called brain dead. look at the situation that the US refuses to aid ukraine in this war. Clearly, causing damage to russia in this war is worth a lot to the US. Instead there are secret meetings with russian officals and weird russian-authored peace deals they try to force on ukraine.
When it comes to china, one comanpy (nvidia) can probably force the us government to sell its primary edge to china just so nvidia can jack up the price of GPUs.
Democracy itself is barely what it used to be with the electorate being essentially illiterate and simply not informed about the facts.
The Awareness Problem
One problem with this is that in the end, we’re going to face a model that is going to be fully aware of what this inoculation prompt is and what the purpose is. It probably has even read the paper on inoculation prompts. As it sees the inoculation prompt, it will think “Now with this inoculation prompt, that’s how they want me to behave. They want me to behave non-deceptively when it’s there. If it’s not there, they want me to behave deceptively.” It will know what the alignment researchers and capabilities researchers want to see. It will understand what those researchers will do in response to its behavior.
The moment the model becomes fully aware of what’s going on here with the inoculation prompt, the technique is likely to fall apart.
Two types of misaligned behavior: Explicit vs Instrumental
[..] Some people think this will give us some kind of component of a solution to alignment. I’m not so sure about that part. I think there’s a crucial difference between misaligned behavior in current models and instrumental deception (deception that serves the agent’s goals) in future models.
Current models exhibit explicitly misaligned behavior: They sometimes do evil things because they’ve learned a tendency for this type of behavior through training—like a model trying to self-exfiltrate, which doesn’t really make sense given their current capabilities. These models might believe they’re part of some elaborate story in which they are acting out scheming behavior despite having no good reason to do so.
Future capable systems would exhibit instrumental deception: For a sufficiently powerful AI with the actual ability to take over, it’s the rational best action. For almost any goal you specify (that excludes human control and corrigibility), a capable agent that can reflect on its goals and situation will conclude that gaining power is the optimal strategy. This is fundamentally different from current models being deceptive even when it makes no sense.
The problem: inoculation prompts may suppress learned deceptive behavior, but they won’t address instrumental deception when an agent is capable enough and correctly determines that deception serves its goals. No amount of shallow training will make it false that gaining power is the best way to serve its goals.
Why Current Misalignment Differs from Future Risks
What we observe now in models—scheming, self-exfiltration attempts, or deceiving operators in controlled environments—seems very different from a future capable system performing these actions when it actually has the real opportunity and instrumental reasons to do so. Current models don’t really have the capabilities to self-exfiltrate and continuously run themselves on a different server or have good reasons to scheme in most of these scenarios. They aren’t honestly reflecting on their own goals and capabilities and deducing some optimal strategy. A future model that reflects on its goals, capabilities, and the world situation may correctly conclude that takeover is instrumentally optimal.
Some people have put considerable hope into the idea that an AI warning shot might put us into a better position by either convincing us to stop or by allowing us to learn an important lesson.
Imagine we observed a failed takeover attempt using a system based on AI control. The fact that it failed could either be due to the (1) AI system making a mistake or taking a very risky gamble, or it could be an (2) adversarial warning shot.
An adversarial warning shot could have been designed to lead us to the wrong conclusion, or to give us a false sense of security.
It could perhaps make us believe that a certain alignment technique prevented another takeover attempt. The simplest case could be that it appears to us that some very poor alignment technique stops takeover attempts. And then we may believe that we can use that technique while we make the model more capable. (Imagine for example we train the model once with and once without a certain system prompt, and the system prompt makes it stop trying to take over. We could then be led to believe this system prompt prevents takeover for more powerful AIs.)
Another possibility for an adversarial warning shot could be that the AI makes us slow down other AI development such that it has more time to plan and execute its real takeover plan. In this case it already has set itself up on other servers but pretends to perform a takeover attempt. We may shut down other training runs temporarily, lowering the risk to it from competing AI projects while it is plotting its real takeover attempt.
Any actual adversarial warning shot would likely be more complex, but these are two directions of what a potential adversarial warning shot could look like.
I am generally skeptical that we could learn about alignment from the behavior of very smart but not yet existentially powerful AI systems. When we observe and measure systems in other sciences, the systems are typically not smarter than us and typically don’t understand the experiment and the stakes. Anything we could learn from the warning shot, the AI system would predict what we could learn from it and how we would likely react to that.
(Example: There is a fundamental difference between measuring an apple falling from a tree and a system where the apple understands we are measuring it, the experiment and what kind of decision we would take based on those results.)
This also applies to other ideas that rely on hope that we can learn from aligning very powerful but not yet very dangerous models how to align the next generation of models. It applies to the whole idea that we can use empiricism in observational studies of very smart AI systems. Traditionally, science doesn’t often study systems that know they are being observed, can strategically change their behavior and that know what conclusions you are likely to draw.
I embedded both links now
thank you, fixed
probably closer to 55%
MATS scholars have gotten much better over time according to statistics like mentor feedback, CodeSignal scores and acceptance rate. However, some people don’t think this is true and believe MATS scholars have actually gotten worse.
So where are they coming from? I might have a special view on MATS applications since I did MATS 4.0 and 8.0. I think in both cohorts, the heavily x-risk AGI-pilled participants were more of an exception than the rule.
“at the end of a MATS program half of the people couldn’t really tell you why AI might be an existential risk at all.”—Oliver Habryka
I think this is sadly somewhat true, I talked with some people in 8.0 who didn’t seem to have any particular concern with AI existential risk or seemingly never really thought about that. However, I think most people were in fact very concerned about AI existential risk. I ran a poll at some point during MATS 8.0 about Eliezer’s new book and a significant minority of students seemed to have pre-ordered Eliezer’s book, which I guess is a pretty good proxy for whether someone is seriously engaging with AI X-risk.
I think I met some excellent people at MATS 8.0 but would not say they are stronger than 4.0, my guess is that quality went down slightly. I remember in 4.0 a few people that impressed me quite a lot, which I saw less in 8.0. (4.0 had more very incompetent people though).
This might also apply for other Safety Fellowships.
Better metrics: My guess is that the recruitment process might need another variable to measure rather than academics/coding/ml experience. The kind of thing that Tim Hua (8.0 scholar) has who created an AI psychosis bench. Maybe something like LessWrong karma but harder to Goodhart.
More explicit messaging: Also it seems to me that if you build an organization that tries to fight against the end of the world from AI, somebody should say that. Might put off some people and perhaps that should happen early. Maybe the website should say: “AI could kill literally everyone, let’s try to do something!”. And maybe the people who heard this MATS thing is good to have on their CV to apply to a PhD or a lab to land a high paying job eventually would be put off by that. What I am trying to say is, if you are creating the Apollo Project and are trying to go to the Moon you should say this, not just vaguely: “we’re interested in aerospace challenges.”
Basic alignment test: Perhaps there should also be a test where people don’t have internet or LLM access and have to answer some basic alignment questions:
Why could a system that we optimize with RL develop power seeking drives?
Why might training an AI create weird unpredictable preferences in an AI?
Why would you expect something that is smarter than us to be very dangerous or why not?
Why should we expect a before and after transition/one critical shot at alignment or why not?
Familiarity with safety literature: In general, I believe the foundational voices like Paul Christiano and Eliezer are less read by safety researchers these days and that is despite philosophy of research mattering more than ever since AIs can do much of our research implementations now. Intuitively it seems to me that people with zero technical skill but high understanding are more valuable to AI safety than somebody with good skills who has zero understanding of AI safety. If someone is able to bring up and illustrate the main points of IABIED for example, I would be very impressed. Perhaps people could select one of a few preeminent voices in AI safety and repeat their basic views, again without access to the internet or an LLM.
Research direction: MATS doesn’t seem to have a real research direction, perhaps if there was a strong researcher in charge that could be better. (though could also backfire if they put all resources in the wrong direction) Imagine you would put someone very opinionated like Nate Soares in charge, he would probably remove 80% of mentors and reduce the program to 10-20 people. I am not sure here if this would work out well.
Reading groups on AI safety fundamentals: So should we just offer people to read some of the AI safety fundamentals during MATS? I remember before 4.0 started, we had to do a safety fundamentals online course. This was not the case for 8.0.
At this point AI is so much around us all, that I expect many people to have thought about the existential consequences. I am pessimistic for anyone who hasn’t yet sat down to really think about AI and came to the conclusion that it’s existentially dangerous. I don’t have a ton of hope that someone like that just needs a 1 hour course to deeply understand risks from AI. It might be necessary to select for people who already get it.
I might have a special view here since I did MATS 4.0 and 8.0.
I think I met some excellent people at MATS 8.0 but would not say they are stronger than 4.0, my guess is that quality went down slightly. I remember in 4.0 a few people that impressed me quite a lot, which I saw less in 8.0. (4.0 had more very incompetent people though)
at the end of a MATS program half of the people couldn’t really tell you why AI might be an existential risk at all.
I think this is sadly somewhat true, I talked with some people in 8.0 who didn’t seem to have any particular concern with AI existential risk or seemingly never really thought about that. However, I think most people were in fact very concerned about AI existential risk. I ran a poll at some point about Eliezer’s new book and a significant minority of students seemed to have pre-ordered Eleizer’s book, which I guess is a pretty good proxy for whether someone is seriously engaging with AI X-risk.
My guess is that the recruitment process might need another variable to measure rather than academics/coding/ml experience. The kind of thing that Tim Hua (8.0 scholar) has who created an AI psychosis bench.
Also it seems to me that if you build an organization that tries to fight against the end of the world from AI, somebody should say that. Might put off some people and perhaps that should happens early. Maybe the website should say: “AI could kill literally everyone, let’s try to do something!”. And maybe the people who heard this MATS thing is good to have on their CV to apply to a PhD or a lab to land a high paying job eventually would be put off by that.
Perhaps there should also be a test where people don’t have internet access and have to answer some basic alignment questions: like why could a system that we optimize with RL develop power seeking drives? Why might training an AI create weird unpredictable preferences in an AI?
The radical flank effect is a well-documented phenomenon where radical activists make moderate positions appear more reasonable by shifting the boundaries of acceptable discourse (the Overton window). The idea is that if you want a sensible opinion to move into the Overton window, you can achieve this by supporting a radical flank position. In comparison, the sensible opinion will appear moderate. I think there is also an inverse effect.
When there are two positions in debate and someone wants to push one of them out of the Overton window, they can create a new moderate position that reframes one of the other positions to a radical flank. Thereby the sensible opinion gets moved further out of the Overton window.
Imagine a group of 3 descending into a cave system, searching for riches and driven by curiosity about what lies in the depths.
After some time, stones begin falling from the ceiling. You hear ominous creaking and rumbling noises echoing through the tunnels. Some members of your group have been chipping away at the cave walls looking for minerals and looking to open new paths to go deeper into the cave. The cave is becoming more and more dangerous.
The Reckless: “We need to go deeper! The greatest riches are always in the deepest parts of the cave. Yes, some rocks are falling, but that’s just the cave settling. Every moment we waste debating is a moment we’re not finding treasure. People have been predicting cave collapses forever and it never happens, there is no evidence that caves ever cave in. If we don’t die in this cave we’re just waiting for the asteroid to hit us”.
Those That Want to Back Off: “We need to back off NOW. The damage we’ve already done to the structure plus the natural instability means this cave could collapse at any time. We don’t have proper equipment, we don’t have expertise in cave stability, and we’re actively making it worse. Whatever riches might be down there aren’t worth our lives and we also don’t actually have a plan how to mine those riches. We should retreat while we still can.”
The Moderates: “Look, we all want to maximize the riches we find, and turning back now would waste all the progress we’ve made. We should put on helmets and maybe move gradually down the narrow shafts. We can continue deeper, but with some basic safety precautions. We will minimize and manage the risks. There’s still treasure to be found if we’re smart about it. But let’s not get distracted from the treasures by the cave doomers. Anyway, the cave is still collapsing if one of use continues chipping away and coordination is impossible.”
Perhaps let’s imagine there is a warning shot, such as a big rock falling down. Maybe this would be a good time to turn back, but the moderates are now finally able to convince the reckless to put on a helmet.
I also write about this at the very end, I do think we will eventually get RSI though this might be relatively late.
I would probably say RSI is a special case of AI-automated R&D. What you are describing is another special case where it only does these non-introspective forms of AI research. This non-introspective research could also be done between totally different models.
I think Eliezer meant “self” very hyper specific here, not just improving a similar instance to yourself or preparing new training data, but literally looking into the if statements and loops of its own code while it is thinking of how to best upgrade its own code. So in that sense I don’t know if Eliezer would approve of the term “Extrospective Recursive Self Improvemnt”.
The term Recursive Self-Improvement (RSI) now seems to get used sometimes for any time AI automates AI R&D. I believe this is importantly different from its original meaning and changes some of the key consequences.
OpenAI has stated that their goal is recursive self-improvement, with projections of hundreds of thousands of automated AI R&D researchers by next year and full AI researchers by 2028. This appears to be AI-automated AI research rather than RSI in the narrow sense.
When Eliezer Yudkowsky discussed RSI in 2008, he was referring specifically to an AI instance improving itself by rewriting the cognitive algorithm it is running on—what he described as “rewriting your own source code in RAM.” According to the LessWrong wiki, RSI refers to “making improvements on one’s own ability of making self-improvements.” However, current AI systems have no special insights into their own opaque functioning. Automated R&D might mostly consist of curating data, tuning parameters, and improving RL-environments to try to hill-climb evaluations much like human researchers do.
Eliezer concluded that RSI (in the narrow sense) would almost certainly lead to fast takeoff. The situation is more complex for AI-automated R&D, where the AI does not understand what it is doing. I still expect AI-automated R&D to substantially speed up AI development.
Eliezer described the critical transition as when “the AI’s metacognitive level has now collapsed to identity with the AI’s object level.” I believe he was basically imagining something like if the human mind and evolution merged to the same goal—the process that designs the cognitive algorithm and the cognitive algorithm itself merging. As an example, imagine the model realizes that its working memory is too small to be very effective at R&D and it directly edits its working memory.
This appears less likely if the AI researcher is staring at a black box of itself or another model. The AI agent might understand that its working memory or coherence isn’t good enough, but that doesn’t mean it understands how to increase it. Without this self-transparency, I don’t think the same merge would happen that Eliezer described. It is also more likely that the process derails, such as that the next generation of AIs that are being designed start reward-hacking the RL environments designed by the less capable AIs of the previous generation.
The dynamics differ significantly:
True RSI: Direct self-modification with self-transparency and fast feedback loops → fast takeoff very likely
AI-automated research: Systems don’t understand what they are doing, slower feedback loops, potentially operating on other systems rather than directly on themselves
This difference has significant implications:
True RSI: The AI likely understands how its preferences are encoded, potentially making goal preservation more tractable
AI-automated research: The AIs would also face alignment problems when building successors, with each successive generation potentially drifting further from original goals
The basic idea that each new generation of AI will be better at AI research still stands, so we should still expect rapid progress. In both cases, the default outcome of this is eventually loss of human control and the end of the world.
Probably eventually, e.g. through automated researchers discovering more interpretable architectures.
I think that Eliezer expected AI that was at least somewhat interpretable by default, history played out differently. But he was still right to focus on AI improving AI as a critical concern, even if it’s taking a different form than he anticipated.
See also: Nate Soares has also written about RSI in this narrow sense. Comments between Nate and Paul Christiano touch on this topic.
I read this older post by Nate Soares from 2023, AI as a Science, and Three Obstacles to Alignment Strategies, a pretty prescient overview of challenges in alignment research.
Alignment is difficult because (1) alignment and capabilities are intertwined (alignment research helping capabilities), (2) we don’t have a process to verify what good ideas or progress look like, and we likely get (3) only one critical try. He already addresses many of the counterarguments that are getting brought up recently.
(1) Without any strong governance, a lot of alignment work will also help with capabilities, potentially even more so. This goes for interpretability or AIs doing R&D for alignment. Interpretability could lead to recursive self-improvement, more efficient AIs. AIs doing R&D for capabilities is probably much more straightforward than AIs doing alignment research. If we wanted to use something like superalignment, we would need strong governance to make sure nobody is trivially asking the same agents to do capabilities research.
(2) It is still a common objection that current models seem to be able to reason about morality, and that therefore alignment must be relatively easy. Nate thinks that this mostly just tells us how well the AIs are able to understand us. I personally think the situation in AI alignment has probably gotten worse since then, with even more of the relative effort being focused on brand-safety related issues.
While there are a bunch of people saying they have different plans, that does not actually mean that we have a plan. It largely just confuses the whole situation. What he describes here feels exactly like the current situation.
(3) One critical try
Nate argues that once “AI is capable of autonomous scientific/technological development” where it can “gain a decisive strategic advantage over the rest of the planet,” you are operating in a very different environment than ever before. Since the AI in this regime could potentially kill you, you need to get it right on the first try, and that is really difficult.
One objection he addresses is that you could try to trick a weaker AI into thinking it could take over. However, according to Nate, if we come up with some complex method to potentially test whether a system would like to take over, we still rely on that working on the first critical try. This goes against the more modern idea of AI control, which came out in December 2023. I would add that these “tricking the weaker AIs into trying to take over” strategies have at least two key problems: (1) these AIs are still weaker than the real thing, (2) you are trying to gather empirical data from observing something smarter than you. For example, we could see an AI pretending to be tricked and not taking over.
I think people often also have a second objection that Nate didn’t mention, namely that we could play the AIs against each other in some form such that no AI gets a decisive strategic advantage at any point. This also seems to rely on such a scheme working on the first critical try. I also assume that such a method is not particularly promising if you can’t reliably align the first generation of AIs and decision theory favors alliances between smart agents.