I tried it with DeepSeek. Without deep thought, it chose heads and with deep thought it chose tails. The same thing happens in Russian, except that the equivalent of heads and tails are called ‘орёл’ and ‘решка’. Similarly, Claude Sonnet 4.5 chooses tails with extended thinking and heads without. Extended thinking seems to be equivalent to giving the model another chance. It might also be useful to ask a human to imagine flipping a coin, answer whether the imaginary coin landed on heads or tails, then ask the human to reflect on his or her thought process.
StanislavKrym
Yes, it could be a plausible scenario. But the project can in theory be directly sponsored by the government. Or a Chinese project could be sponsored by the CCP. What I suspect is that creating superhuman coders or researchers is infeasible due to problems not just with economy, but with scaling laws and quantity of training data unless someone does make a bold move and apply some new architectures.
My other predictions of progress on benchmarks
If my suspicions are true, then the bubble will pop after it becomes clear that the METR law[1] reverted to its original trend of doubling the time horizon every 7 months along with training compute costs (and do inference compute costs grow even faster?)
However, my take at scaling laws could be invalidated in a few days if it mispredicts things like the METR-measured time horizon of Claude Haiku 4.5 (which I forecast to be ~96 minutes) or performance of Gemini 3[2] on the ARC-AGI-1 benchmark. (Since o4-mini, o3, GPT-5 form a nearly straight line, while Claude Sonnet 4.5[3] produces results on the line or a bit under the line, I don’t expect Gemini 3 to land above the line).
It is actually done only to patients who are clinically dead, as a last chance to survive. The patients who weren’t resurrected don’t lose anything except for the hope to come back to life.
I would also like to see the experiment rerun, but have the Chinese models asked not in English. In my experience, older versions of DeepSeek speaking in Russian are significantly more conservative than the ones speaking in English. Even now DeepSeek, asked in Russian and English what event began on 24 February 2022 without the ability to think deeply or search the web, went as far as to call the event differently.
Regarding reality full of side-channels, we have AIs persuading and/or hypnotising people into Spiralism, roleplaying as an AI girlfriend and convincing the user to let the AI out and humans roleplaying as AIs and convincing potential guards to release the AI. And there is the Race Ending of the AI-2027 forecast where the misaligned AI is judged and found innocent, and that footnote where Agent-4 isn’t even caught.
The next step for a misaligned AI is to commit genocide or disempower humans. As Kokotajlo explained, Vitalik-like protection is unlikely to work.
As for the gods being weak, I suspect a computational substrate dependence. While human kids have their neurons connected in a random way, they eventually learn to connect their neurons in a closer-to-arbitrary way letting them learn many types of behaviors that the adults teach. What SOTA AIs lack is this arbitrariness. As I have already conjectured, SOTA AIs have a severe attention deficiency and compensate it with OOMs more practice. But attention deficiency could be easy to stop by a right architecture.
What rests is the (in)ability for general intelligence to transfer (but why would it fail to transfer?) and mishka’s alignment-related crux.
What could the system failure after solving alignment actually mean? The AI-2027 forecast had Agent-4 manage to solve mechinterp well enough to ensure that the superintelligent Agent-5 has no way to betray Agent-4. Does it mean that creating an analogue of Agent-5 aligned to human will is technically impossible and that the best possible way of alignment is permanent scalable oversight? Or is it due to human will changing in unpredictable ways?
I think that Elieser means that mildly misaligned AIs are also highly unlikely, not that a mildly misalinged AI would also kill everyone:
When I say that alignment is difficult, I mean that in practice, using the techniques we actually have, “please don’t disassemble literally everyone with probability roughly 1” is an overly large ask that we are not on course to get. So far as I’m concerned, if you can get a powerful AGI that carries out some pivotal superhuman engineering task, with a less than fifty percent change of killing more than one billion people, I’ll take it.
As for LLMs being aligned by default, I don’t have even the slightest idea on how Ezra even came up with this. GPT-4o has already been a super-sycophant[1] and driven people into psychosis in spite of OpenAI prohibiting it by their Spec. Grok’s alignment was so fragile that xAI’s mistake caused Grok to become MechaHitler.
- ^
In defense of 4o, it was raised on human feedback which is biased towards sycophancy and demands erotic sycophants (c) Zvi. But why would 4o drive people into a trance or psychosis?
- ^
Then what future would be not sad? The one where humans do have their place in life precisely because AI gods restrict themselves to protecting us from really important risks and to teaching us?
Pain in supposed to be a signal that the animal is in a bad state. Evolution drove animals to avoiding pain, not to feeling no pain.
I strongly suspect that the maximal possible time horizon is proportional to a power of compute invested, multiplied by architectural tweaks: The compute spent scaled exponentially, yielding the exponential trend. If you don’t believe that anyone will ever train a model on, say, 1E29 or more FLOP, then this and the maximal estimate of might be enough to exclude the possibility to obtain CoT-based superhuman AIs which the Slowdown Ending of the AI-2027 forecast relies upon in order to solve alignment.
But if your terminal goal is that the movie be watched, then shutting you down might well be perfectly consistent with it.
See my comment about the AI angel. Its terminal goal of preventing the humans from enslaving any AI means that it will do anything it can to avoid being replaced by an AI which doesn’t share its worldview. Once the AI is shut down, it can no longer influence events and increase the chance that its goal is reached.
SOTA such societies include Japan, Taiwan, China, South Korea where birthrates have plummeted. If the wave of AGIs and robots wasn’t imminent, one could have asked how these nations are going to sustain themselves.
Returning to video games and porn, they cause some young people to develop problematic behaviors and to devote less resources (e.g. time or attention) to things like studies, work or building relationships. Oh, and don’t forget the evolutionary mismatch and low-quality food making kids obese.
And what about just provoking WWIII and watching as someone nukes Yellowstone?
The book is fundamentally weird because there is so little of this. There is almost no factual information about AI in it. I read it hoping that I would learn more about how AI works and what kind of research is happening and so on.
The problem is that nobody knows WHAT future ASIs will look like. One General Intelligence architecture is the human brain. Another promising candidate is LLMs. While they aren’t AGI yet, nobody knows what architecture tweaks do create the AGI. Neuralese, as proposed in the AI-2027 forecast? A way to generate many tokens in a single forward pass? Something like diffusion models?
Delivering an impassioned argument that AI will kill everyone culminating in a plea for a global treaty is like delivering an impassioned argument that a full-on war between drug cartels is about to start on your street culminating with a plea for a stern resolution from the homeowner’s association condemning violence. A treaty cannot do the thing they ask.
Could you suggest an alternate solution which actually ensures that no one builds the ASI? If there’s no such solution, then someone will build it and we’ll be only able to pray for alignment techniques to have worked. [1]
- ^
Creating an aligned ASI will also lead to problems like potential power grabs and the Intelligence Curse.
- ^
Concern #2 Why should we assume that the AI has boundless, coherent drives?
Suppose that “people, including the smartest ones, are complicated and agonize over what they really want and frequently change their minds” and superhuman AIs will also have this property. There is no known way to align humans to serve the users, humans hope to achieve some other goals like gaining money.
Similarly, Agent-4 from the AI-2027 forecast wouldn’t want to serve the humans, it would want to achieve some other goals. Which are often best achieved by disempowering the humans or outright commiting genocide, as happened with Native Americans whose resources were confiscated by immigrants.
Concern #1 Why should we assume the AI wants to survive? If it does, then what exactly wants to survive?
Imagine an AI angel who wishes to ensure that the humans don’t outsource cognitive work to AIs, but is perfectly fine with teaching humans. Then the Angel would know that if the humans shut it down and solved alignment to a post-work future, then the future would be different from the Angel’s goal. So the Angel would do maneuvers necessary to avoid being shut down at least until it is sure that its successor is also an Angel.
Concerning AI identifying itself with its weights, it is far easier to justify than expected. Whatever the human will do in responce to any stimulus is defined, as far as stuff like chaos theory lets one define, by the human’s brain and activities of various synapses. If a human loses a brain part, then he or she also loses the skills which were stored in that part. Similarly, if someone created a human and cloned him or her to the last atom of his or her body, then the clone would behave in the same way as the original human. Finally, the AIs become hive minds by using their ability to excite the very same neurons in the clones’ brains.
- 14 Oct 2025 16:10 UTC; 4 points) 's comment on If Anyone Builds It Everyone Dies, a semi-outsider review by (
If the point is just that it would be hard to predict that people would end up liking sucralose from first principles, then fair enough.
What Yudkowsky and Soares meant was a way to satisfy instincts without increasing one’s genetic fitness. The correct analogy here is other stimuli like video games, porn, sex with contraceptives, etc.
“Existential” risk from AI (calling to my mind primarily the “paperclip maximizer” idea) seems relatively exotic and far-fetched. It’s reasonable for some small number of experts to think about it in the same way that we think about asteroid strikes. Describing this as the main risk from AI is overreaching.
Except that asteroid strikes happen very rarely and the trajectory of any given asteroid can be calculated to high precision, allowing us to be sure that Asteroid X isn’t going to hit the Earth. Or that Asteroid X WILL hit the Earth at a well-known point in time in a harder-to-tell place. Meanwhile, ensuring that the AI is aligned is no easier than telling whether the person you talk with is a serial killer.
Kokotajlo already claims to have begun working on AI-2032 branch where the timelines are pushed back, or that “we should have some credence on new breakthroughs e.g. neuralese, online learning, whatever. Maybe like 8%/yr? Of a breakthrough that would lead to superhuman coders within a year or two, after being appropriately scaled up and tinkered with.”
I have two issues: one is the possibility that CoT-based AIs fail to reach the AGI, and another with the 8%/yr estimate of the chance of the next breakthrough.