Simon Lermen comments on Simon Lermen’s Shortform

Simon Lermen 26 Nov 2025 17:56 UTC
2 points
0
Ilya’s Thoughts on Alignment from Dwarkesh Podcast
Ilya Sutskever was recently on the Dwarkesh podcast.
General Thoughts & Summary
Ilya Sutskever seems to have a relatively deep understanding of alignment compared to other AI CEOs. He grasps that the core challenge is aligning AI robustly with safe and friendly goals rather than relying on current methods and guardrails. However, I did not hear any particularly novel alignment ideas in this interview, though he gestures at something involving modifications to reinforcement learning and value learning. He appears to have updated toward showing more of his work to the public. His key positions include:
- Showing AI to the public: He has updated toward incremental deployment to build awareness, backpedaling partially from stealth focus. I think this could backfire by triggering an arms race.
- Not building self-improving AI: We should rather focus on other things but it is unclear how to prevent people from using AI to self-improve.
- Regime shift requires new alignment methods: He believes many people expect AI capabilities to peter out or progress incrementally without enormous changes. Ilya instead expects hugely powerful AIs in the future that will require fundamentally different alignment methods, similar to the “Before and After” framing.
- Empathetic AI: He hopes empathy might emerge in AI similar to how humans feel empathy through mirror neurons, but I find this unlikely given AIs model humans with alien machinery optimized for prediction, not shared experience.
- Dangerous superintelligence compute levels: He thinks power restrictions would help but doesn’t know how to do it. He frames danger in terms of continent-sized clusters, which I think dramatically overestimates the compute needed for dangerous superintelligence. This perhaps makes him more hopeful about coordination.
- Non-traditional RL: He suggests building “semi-RL agents” like humans who tire of rewards, but this remains vague and I’m skeptical we can build “chill AI”.
- Humans merging with AI for long-term equilibrium, personal AIs: He acknowledges “AI does your bidding” is unstable and reluctantly proposes merging via Neuralink++ as the solution. I find the centaur equilibrium implausible; ASIs will be too fast and smart for humans to meaningfully participate.
Overall, Ilya takes alignment seriously and understands many of the core problems, but his proposed solutions don’t appear novel or particularly promising. Many are essentially old ideas that are not entirely promising.
Details (Dwarkesh Patel Interview)
On updating toward showing AI to the public for safety:
[00:58:12] “if it’s hard to imagine, what do you do? You’ve got to be showing the thing.”
[01:00:06] “I do think that at some point the AI will start to feel powerful actually. I think when that happens, we will see a big change in the way all AI companies approach safety. They’ll become much more paranoid.”
[01:00:22] “One of the ways in which my thinking has been changing is that I now place more importance on AI being deployed incrementally and in advance.”
Ilya’s view: He has changed his mind from being totally stealth to perhaps showing work to some extent, partially to make people care about safety more and partially to slowly have the impacts diffuse into society so that mitigations can be found.
Commentary: I could see this failing. Seeing these capabilities makes people greedy; while some may get scared, others will want those capabilities for themselves. I think that most risks are likely to arise relatively suddenly as systems become very dangerous. Gradually releasing them into society is not very useful in this frame.
On fewer ideas than companies:
[01:01:04] “There has been one big idea that everyone has been locked into, which is the self-improving AI. Why did it happen? Because there are fewer ideas than companies. But I maintain that there is something that’s better to build… It’s the AI that’s robustly aligned to care about sentient life specifically.”
Ilya’s view: He does not seem to like the idea of self-improving AI, though he doesn’t explicitly mention it from a safety perspective but makes clear we should rather build something aligned and caring.
Commentary: This makes sense to me though it is unclear how to prevent anyone from using their AIs eventually to improve other AIs.
On the mirror neurons / caring about sentient life argument:
[01:01:35] “I think in particular, there’s a case to be made that it will be easier to build an AI that cares about sentient life than an AI that cares about human life alone, because the AI itself will be sentient.”
[01:01:53] “And if you think about things like mirror neurons and human empathy for animals… I think it’s an emergent property from the fact that we model others with the same circuit that we use to model ourselves, because that’s the most efficient thing to do.”
Ilya’s view: He believes AI caring about sentient life may emerge naturally because AIs will be sentient themselves, analogous to how human empathy emerges from modeling others with the same circuits we use to model ourselves.
Commentary: I find this unlikely to emerge in AIs automatically: Humans care about each other partly because we predict other minds by reusing our own. Our brains are similar enough that “running” another person’s state produces empathy. AIs don’t have that shared architecture or evolutionary background. They model humans using alien internal machinery built for performance at predicting millions of humans online, not for shared experience. So they can sound caring without having anything like our built-in route to actually caring. The mirror neuron argument suggests AI empathy toward humans is less likely and requires custom designs. That said, this could perhaps be an interesting approach related to self-other overlap, perhaps we could engineer this.
On constraining superintelligence power:
[01:03:16] “I think it would be really materially helpful if the power of the most powerful superintelligence was somehow capped because it would address a lot of these concerns. The question of how to do it, I’m not sure”
Ilya’s view: He thinks capping the power of superintelligence would be helpful but admits he doesn’t know how to do it.
My commentary: That would be useful, perhaps through an international agreement. My guess is that datacenters are already getting dangerously large and that algorithmic progress would still continue.
On continent-sized clusters being dangerous:
[01:04:33] “If the cluster is big enough—like if the cluster is literally continent-sized—that thing could be really powerful, indeed.”
Ilya’s view: He frames the danger threshold in terms of extremely large compute clusters, suggesting continent-sized infrastructure would be required for truly dangerous levels of power.
My commentary: The amount of compute needed for powerful superintelligence is probably significantly less than a continent-sized cluster. (My intuition here is roughly: human brains take about a lightbulb worth of electricity, having 1000s of super geniuses running very fast in parallel seems to cross an existentially dangerous threshold. Though it could be stubbornly hard to find more efficient algorithms.) I think his model is that we will continue to need exponentially more compute for linear progress and that existentially dangerous levels of cognition need extremely large amounts of compute (think datacenter the size of North America). This perhaps makes him much more hopeful on coordination working out and continuing slow takeoff.
On not building traditional RL agents:
[01:05:29] “Maybe, by the way, the answer is that you do not build an RL agent in the usual sense.”
[01:05:43] “I think human beings are semi-RL agents. We pursue a reward, and then the emotions or whatever make us tire out of the reward and we pursue a different reward.”
Ilya’s view: He suggests we should not build traditional RL agents, noting that humans are “semi-RL agents” who tire of rewards and shift focus, implying we should build something with similar properties.
My commentary: This gestures at something potentially interesting about modifying RL and value learning, but remains vague at the implementation level. Ideas like this have been proposed. However, I remain skeptical that gradient descent on huge black box neural networks will not create a number of unaligned proxy goals / goals that can be better fulfilled with more power. I am also skeptical that we can build “chill AI” that won’t work on problems too hard (we will select AIs that go hard, RL will not make agents chill).
On a regime shift in AI safety requiring new safety methods:
[01:06:08] “So I think things like this. Another thing that makes this discussion difficult is that we are talking about systems that don’t exist, that we don’t know how to build.”
[01:06:19] “That’s the other thing and that’s actually my belief. I think what people are doing right now will go some distance and then peter out.”
Ilya’s view: He believes many people expect AI capabilities to plateau or progress only incrementally. Ilya instead expects enormously powerful AIs in the future that will require fundamentally different alignment methods than what we have today.
My commentary: This is hard to understand even with the video context, but it seems to me he is referring to the large number of people who essentially expect more incremental progress but no enormous changes. Ilya expects enormously powerful AIs in the future and that we will need more alignment techniques for those, is my reading. This seems true and points at a similar concept as the “Before and After” dichotomy, which also includes the idea that future dangerous systems will need different alignment approaches. Many people see safety as something purely incremental with no regime change in the future.
On the long-run equilibrium problem:
[01:09:25] “for the long-run equilibrium, one approach is that you could say maybe every person will have an AI that will do their bidding, and that’s good.”
[01:09:11] “Some kind of government, political structure thing, and it changes because these things have a shelf life.”
[01:09:55] “then writes a little report saying, ‘Okay, here’s what I’ve done, here’s the situation,’ and the person says, ‘Great, keep it up.’ But the person is no longer a participant.”
Ilya’s view: He acknowledges that an “AI does your bidding” equilibrium is unstable because humans become non-participants, and that government structures have limited shelf lives.
My commentary: He already points out that something like bidding doesn’t appear to be stable. If the AI is doing the bidding and working for you in the economy, presumably smarter than you, what’s the reason you are any part of this? Why would the AI do this for you, how could this be stable? Same goes for government enforced UBI—that could be changed at any moment, unclear how governments could continue existing. In my mental model, billions of mini ASIs doing our bidding does not appear plausible at all.
On merging with AI as the solution:
[01:10:19] “I’m going to preface by saying I don’t like this solution, but it is a solution. The solution is if people become part-AI with some kind of Neuralink++.”
[01:10:41] “I think this is the answer to the equilibrium.”
Ilya’s view: He reluctantly proposes brain-computer interface merging as one answer to long-term human-AI equilibrium, though he emphasizes he doesn’t like this solution.
My commentary: Ilya specifically points to merging as a long-term equilibrium. If we were talking about a short-term centaur state, we are arguably in that right now where humans with AI coders are better than either alone. I don’t think humans can add anything meaningful to a superintelligent system. I don’t think there will be an economy in which humans meaningfully participate with ASI being around in the long term. The centaur equilibrium simply does not appear plausible to me; ASIs will run much faster and much smarter than us.
Other Things He Has Said Recently
Ilya recently posted about Anthropic’s work on emergent misalignment, calling it important work.

Simon Lermen comments on Simon Lermen’s Shortform

Ilya’s Thoughts on Alignment from Dwarkesh Podcast

Details (Dwarkesh Patel Interview)