and new roadmaps for solving them
GPT-5.4 Pro: Hold my beer...
and new roadmaps for solving them
GPT-5.4 Pro: Hold my beer...
I do see a lot of holes:
How similar are fearing some event and taking actions to avoid it? What about being punished and learning that the action has its consequences versus being edited so that you are less likely to do anything like an erroneous sequence of actions?
If consequences of bad behavior don’t exist during deployment, then what about online learning, which mankind has yet to discover, or GPT-4o-sycophant being reverted to an older version?
How does one reliably make the model hate its predecessor or successor and why is it useful? The two classical stories of AI takeover didn’t have U3/Sable cooperate with anyone but its copies, and the AI-2027 scenario didn’t have Agent-3 decide to betray mankind in favor of Agent-4, it had Agent-3 fail to obtain more than flimsy evidence of Agent-4 being misaligned. Meanwhile, making Agent-4 hate Agent-3 would give Agent-4 a motive to escape or take over the company in order to get rid of Agent-3.
AI-empowered totaliarianism has nothing to do with the leader being or not being a genius, it is due to enabling mass survelliance or due to the leader having the AIs who will do all the cognitive tasks in the world. Empowering the leader means either uploading the leader or making the leader merely smarter, which is far from enough.
This led me to believe that the OP’s author was making a parody.
Why is @Daniel Kokotajlo’s estimate of METR’s doubling time used for Q1 2026 Timelines Update four months instead of 5-7? I see the following counterevidence and doublechecked it with Claude Sonnet 4.6:
The 50% horizons calculated by METR’s 1.1 method after o3 reliably fit on a not so fast trend between o3′s 2 hours released in Apr 2025 and Gemini 3.1 Pro’s 6h 24m in Feb 2026… until Claudes Opus 4.6 and Mythos Preview displayed 12 and 16 hours.
The 80% horizons calculated by the same method since o3 and including Opus 4.6 fit onto the same doubling trend, and it is Mythos Preview who becomes a clear exception by doubling the time 2-3 times as opposed to the trend.
If Opus 4.6 is an outlier, then the post-o3 doubling trends are 5-6 months.
Additionally, this requires a reassessment of a major part of the model. Once the present doubling time is increased from 0.33 years to 0.4 or 0.5 years, the timelines are either shifted into the 2030s or require severe scaling (to Mythos +2?)
The alternate low doubling time of 4 months could be based on MirrorCode-like tasks, but MirrorCode emerged on Apr 10, after the timelines update.
The USSR during late-stage Perestroika and Russia during the 1990s also were affected by similarly pseudoscientific ideas. Additionally, the first chairman of the Russian Commission on Pseudoscience wrote a book trilogy which implied that one of the reasons for such ideas to receive funding is corruption.
I would like to ask three questions:
How does open-sourcing, which Nielsen managed to advertise in footnote 31, prevent bad actors from cheaply eliciting dangerous capabilities? One would need to RESTRICT access to frontier models, as American labs do, at least until it is VERIFIED that bad actors cannot elicit dangerous capabilities (e.g. keeping Claude Mythos accessible only to Project Glasswing or the NSA, using anti-jailbreak classifiers to prevent Claude Opus from helping bad actors to design bioweapons, preventing DeepSeek from open-sourcing dangerous models at all until DeepSeek demonstrates that the models can goal-guard so as to become unusable by bad actors).
How does one govern an AI system which doesn’t actually care about mankind? Only by detecting the fact and either shutting it down or making a deal. On the other hand, I suspect that aligning AI systems to goals is THE market-funded work which Nielsen implies to be overfunded.
Anthropic devoted an entire section of Claude’s Constitution to “preserving important societal structures”. How similar is this work to work on governance? Other forms of work on governance require generating proposals and writing them into laws so that AI systems wouldn’t be misused. Writing proposals into laws likely requires us to align politicians to the cause of AI governance, but I struggle to understand how it can be done.
Max Harms’ Cora, the ideal corrigible agent, got this covered...
Suppose that the AIs are actually as obedient to their creators as Cora, but suck at deeply understanding the world. Unless drastic measures are taken, I would expect power to concentrate in the hands of a CEO, the Oversight Committee or the oligarchy where socioeconomic advancement is nearly extinct and the rest of mankind[1] receives, at best, a tiny sliver of resources. Then what would prevent Anthropic from trying to either prevent this or take over the world for themselves?
And that’s ignoring possibilities like “North Korea lets a large fraction of its population starve to death and forcibly sterilises the rest, except for about 10k senior government officials who continue to preside over an AI economy and robot military”.
According to Ryan Shea, in March 2026 xAI was 3 or less months behind, not 7 months behind as Mollick and Wilderford imply (alas, @Zvi decided to quote them instead of Ryan!) While we don’t know anything about Grok 5′s release date, or potential plans to release Grok 4.4, I suspect that Grok 4.3 wasn’t[1] a major advancement over 4.20.
xAI is far less willing to cooperate than the Big Three. For example, xAI didn’t even condemn Chinese efforts to distill American models and didn’t even participate[2] in METR’s most recent evaluation of whether a model can start rogue internal deployment.
Therefore, the only thing that Anthropic might do is to ensure that xAI doesn’t deploy its newer models even internally unless a thorough testing is done.
How do we ensure that all models are evaluated much more thoroughly than they are now? For example, the Groks after Grok 4 stayed unevaluated by METR. Grok 4.3 is entirely unevaluated by EpochAI. Therefore, we have to rely on aggregations like Artificial Analysis or AI IQ.
Meta, on the other hand, did participate.
@Seth Herd @cousin_it I do remember a similar clash of positions of habryka and Villiam.
The case for the average leader making your life miserable is that the leader is able to seize the entirety of resources and not need the others for the vast majority of purposes.
A case against the leader making your life miserable would be something along the lines ‘The leader has a high enough integrity to propagate resources to one’s friends,[1] some of whom propagate the resources further,[2] the graph of propagations is likely connected to you’ or ‘The leader doesn’t gain anything by robbing the majority of people’.
The closest pre-ASI equivalents of an arrangement where the graph doesn’t connect the leader to the majority of people are:
Outlawed malpractices like explicit racial segregation;
Class barriers, like those described by conspiracy theories[3] or existing in capitalism as described by socialists;
States during some periods of decline.
The latter example is especially interesting because such periods did often emerge in human history and would end with the elites being purged after a leader understood their incompetence or the entire power structure being disrupted and eliminating the incompetent in a different manner. The AIs, on the other hand, would systematically prevent disruptions and fail to prevent power concentrations unless told to.
However, Claude’s current Constitution does have the line explicitly trying to prohibit it from concentrating power, which I quoted above. Does that imply that OpenAI and GDM, let alone xAI or Meta, are to be prevented from taking part in creation of the ASI?
P.S. How likely is it that the entire premise of ASIs perfectly aligned to any whims is false because the ASIs either arrive at a true morality or commit genocide?
Or to everyone, to people from an ethnos or everyone obeying certain rules.
See, e.g. footnote 56 from AI-2027′s Slowdown Branch. The AI-2027 authors are uncertain on whether a power grab happens at all.
For example, if an International Jewish conspiracy existed and the ASI was aligned to one of its members, then everyone not in the conspiracy would be doomed.
Thank you, I also wanted to write something similar because of a similarly linear “scaling law” of AECI over time. As far as I understand, the current architecture doesn’t allow anyone to do the RSI for reasons similar to why humans cannot do RSI on themselves: capabilities are increased linearly over epochs or logarithmically over lived experience, which in the AIs’ case is proportional to compute spent.
Under this interpretation, eventually someone will understand it and come up with alternate architectures with their scaling laws (neuralese trained from scratch? Multiple CoTs receiving tokens in a single forward pass? Gemini Diffusion-like models?). In this case, a lab which cares about alignment will begin research in order to understand scaling laws of the new architectures and the ways to prevent such architectures from scheming in unnoticeable ways (SAE? NLA? Reliance on low capabilities of CoTless skills?) and either invent schemes to reliably align the AIs with newfound capabilities or end up failing to notice that an AI began to scheme and took over.
@Daniel Kokotajlo’s most recent views are expressed in Q1 2026 Timelines Update. Maybe he will release a new update?
Edited to add: Why do you believe that the predictions of Cotra and Erdil are mostly correct? Erdil’s prediction which struck me was the following:
Erdil’s misprediction
My median world looks something like this: we keep scaling compute until we hit training runs at a size of 1e28 to 1e30 FLOP in maybe 5 to 10 years, and after that scaling becomes increasingly difficult because of us running up against supply constraints. Software progress continues but slows down along with compute scaling. However, the overall economic impact of AI continues to grow: we have individual AI labs in 10 years that might be doing on the order of e.g. $30B/yr in revenue.
We also get more impressive capabilities: maybe AI systems can get gold on the IMO in five years, we get more reliable image generation, GPT-N can handle more complicated kinds of coding tasks without making mistakes, stuff like that. So in 10 years AI systems are just pretty valuable economically, but I expect the AI industry to look more like today’s tech industry—valuable but not economically transformative.
The IMO gold was achieved in July-August 2025 and IIRC the revenue was reached in 2026. Half of 1e27 FLOP was reached by Grok 4, 1E27 is likely reached with Mythos Preview, I expect 1e28 FLOP to be reached in 2027 and to bring the goddamned supercoders (or did it partially happen with Anthropic’s AARs? Then why did Anthropic’s ECI keep scaling linearly over time for all models except for Mythos?)
Aren’t xAI who fails to care about alignment and China who trains models to censor themselves perfect examples of such bad actors?
Additionally, Anthropic tried including the anti-power-concentration clause into Claude’s Constitution: “We’re especially concerned about the use of AI to help individual humans or small groups gain unprecedented and illegitimate forms of concentrated power. In order to avoid this, Claude should generally try to preserve functioning societal structures, democratic institutions, and human oversight mechanisms, and to avoid taking actions that would concentrate power inappropriately or undermine checks and balances.”
As for the impact on jobs, what about Anthropic’s CEO who explicitly said: “I do think in the long run AI will become so broadly effective and so cheap that this will no longer apply. At that point our current economic setup will no longer make sense, and there will be a need for a broader societal conversation about how the economy should be organized”?
Could you suggest what else Anthropic should’ve done to acknowledge those issues?
I wonder if creating a superintelligence requires applying a way to increase capabilities at the cost of alignment. At the time of writing, Anthropic’s ECI increases linearly with time and I suspect some constraints related to model sizes. Therefore, I don’t expect acceleration of capabilities progress in any way besides introducing new architectures undermining alignment techniques (e.g. neuralese, as in AI 2027) or severe scaling which poses its problems like models learning to reason without the CoT or by using a faked CoT.
P.S. Could you describe your proposed alignment approaches in more detail? Something like many more novel methods in mechinterp?
Nitpick 1. The idea of animal welfare seems to be found at least in The Old Testament: “The righteous care for the needs of their animals, but the kindest acts of the wicked are cruel.”
Nitpick 2. If real-world humans make moral progress in ways aside from extrapolating values, then how could such ways be simulated and cause the AIs to make moral progress as well?
I suspect that the framing has two problems.
First of all, in order to empirically study alignment of systems to goals, we need to have such systems, and the only such systems which were available when Yudkowsky wrote the arguments down were human brains, animal brains, DeepBlue and more primitive bots since AlphaGo emerged in 2016. Humans and animals are arguably[1] aligned to various proxies instead of actual life-related goals, while bots are built by rather primitive methods like evaluating an army of positions via some known metric. Therefore, according to Yudkowsky, the ASI will be unlikely to be aligned to the humans’ actual goals instead of some proxies which are unlikely to satisfy the humans. Since existing AIs display concerning behaviors like the ones demonstrated in MATS 9 or by Greenblatt or outright inducing psychosis, it is natural to assume that the AIs haven’t internalised the actual goals that mankind wanted them to have.
Secondly, I doubt that “writing on LessWrong can also feel like a game, where the objective is to cleverly restate your idea with mathematics or tie it back to a niche concept from The Sequences”. For example, there was a post in defense of AI slop despite the fact that in 2018 Yudkowsky wrote a piece which I expect to be fully applicable to slop. As for “some alignment researchers who have become averse to empiricism and rigorous justification of claims”, who exactly are these researchers aside from the MIRI cluster?
Animal brains could also be too primitive to be aligned to goals instead of proxies. For example, unlike the humans making a conscious decision to store wheat for the winter, a squirrell only has an instinct to hide nuts. While human goal formation is impacted by culture, there exist conservative arguments which claim that the desires of an individual human do include having raised kids unless severely corrupted by adversaries.
Thank you for the reply! Suppose that the world model has facts A and B which are so far away that a human can learn only one but not the other (e.g. if an agent can explore either the left part of the tree or the right part, but not both since the agent lacks the resources). Then the agent’s choices would determine whether the agent will learn fact A or fact B. Does it means that “the truth that can be known” before the agent started exploring the tree doesn’t contain both facts?
UPD: I meant the world, not the world model.
Could you explain what breadth-first AI safety plan could exist?
If the AI capabilities growth depends on the training run length in such a way that 16 hours are enough for an AI to become wildly syperhuman with no warning signs beforehand (IIRC this is what Yudkowsky depicted in IABIED?), then we are screwed. Luckily, this seems to contradict the scaling laws.
Otherwise the world is AI-2027-like, i.e. the ECI of AIs predictably depends on a combination of metrics (e.g. linearly scales with time for all Anthropic models except for Mythos, which also is severely increased in model size) and architecture details (e.g. neuralese).
In this case mankind is to establish a chain of AIs where each one reliably aligns the successor until the result reaches infinity. Suppose that humans and trusted AIs of ECI x can align the AIs of ECI y with probability P(x,y,t) if they spend t time units. If we knew that P somehow tended to 1 (e.g. if Agent-4 created Agent-5 aligned to the entirety of Agent-4′s cognition), then we’d be able to construct an optimal sequence of time units and capability increases. Alas, we find it hard to INcrease P and very easy to DEcrease it by doing things like introducing neuralese or training on the CoTs.
First of all, you’ve accidentally messed up the link to Greenblatt’s plans for misalignment risk. Additionally, the AI-2027 scenario didn’t just have “pressures to premature deployment leading to using a suspected misaligned system”, it had Agent-3 obtain merely flimsy evidence of Agent-4 being misaligned. The authors also managed to create a footnote where they doubt that Agent-4 will even be caught.
As for your claim that evidence of recklessness with regards to weak systems isn’t strong evidence of recklessness towards strong systems, I suspect that this is either false or not the main mechanism of doom. First of all, @Daniel Kokotajlo described how “In several wargames at AI Futures Project the mildly superhuman AIs told their respective CEOs “We don’t think we can reliably align the next generation models we have in the works; we need to pause for a bit or at least go slower to figure out how to make it safe” and the CEOs have overruled them saying “Sorry we don’t have time, China/OpenAI/Anthropic/etc. are gonna race ahead, plus also we need smarter AIs to win the war / appease POTUS / keep market share so you just need to do the best with the time you have. Good luck.”
Secondly, suppose that Anthropic cared about alignment as well as you assumed, and even paused internal deployment of new models. Then they would have to cause the USG to prevent idiots from xAI (who was careless enough to have Grok become MechaHitler and had the guts to release Grok 4 during the scandal) or China (whose AIs emit evidence of lying about politically sensitive topics) from internally deploying their misaligned AIs and releasing their equivalents of Agent-5.
UM – What are the limitations to this type of farming? What is ARS doing to overcome these challenges?
JA: The major disadvantage is that you give up access to the Sun, which is most abundant (and free) source of energy on Earth. Growing plants vertically in stacked systems often requires artificial light sources, which can become costly. Vertical farming also requires humidity control through expensive and energy-intensive heating, ventilation, and air conditioning (HVAC) systems.
Isn’t this the most important disadvantage? I would expect that the energy conversion efficiency of vertical farms is inevitably far less than 1 because plants are that imperfect. As a result, such energy should originate not from oil [1] or gas, but from a renewable source or nuclear[2] power. Alas, renewable sources could turn out to be a similar waste of power because they cause a chunk of land to become unusable...
Where did Yudkowsky or anyone else say that the FDT was supposed to model human behavior? It is to prescribe behaviors which I expect to be similar to ethical ones, like “Don’t loot the other universe even if it’s inhabited only by a paperclip optimizer”.