Yes, there is an entire wikitag devoted to this.
StanislavKrym
Could you explain me the contribution of EA to SBF? I think that I see something which should’ve been an independent red flag. As far as I understand the story of FTX, it was supposed to exchange cryptocurrency. Unlike OpenAI’s systems which, at least, have actual use or Theranos, which was supposed to do blood tests (and it didn’t, but I don’t think that EA ever supported Theranos), I struggle to understand how cryptocurrency benefits the society as a whole.
despite substantial amounts of scaling, we haven’t seen the rise of coherent, goal-directed agency toward particular goals, nor have we seen attempts at deliberate sabotage of lab processes.
Is it a reason for optimism? At best it is consistent with Kokotajlo’s prediction. At worst, suppose that the Claudes began to systematically scheme even before they reached the levels of Agent-3 or Agent-4. Then Anthropic would be able to publish all the evidence on Claudes scheming, and this would be a more convincing argument for slowing down AI development than everything Anthropic has already done.
Now, it seems much more likely that they’d be willing to take even quite drastic actions against the US assuming their leaders became convinced about catastrophic risks from artificial intelligence.
Could you explain which actions could a leader of the EU take and how they slow down, say, Anthropic? I would bet on Anthropic/OpenAI being slowed down by a compute shortage caused by problems on Taiwan (e.g. fuel shortage due to the war in Iran or outright an invasion), not by Europe’s actions. However, IIRC, Taiwan produces chips from equipment made in the Netherlands. If Netherlands refused to cooperate, then we’d see a nullification of the second derivative of compute possessed, which nullifies the third derivative of calculations done by the org during training runs.
seem to be struggling to create similarly capable models.
Could you explain why you believe this to be the case with xAI and Meta? The former is also preparing to roll out Grok 5, which I suspect to be a model of Claude Mythos’ size. I hope that the model doesn’t also have capabilities close to Mythos…
The thing which I don’t understand is the following. Suppose that there is an AI capable of superpersuasion and incapable of running the civilisation without humans (e.g. if the AI cannot bootstrap itself to the level where it is capable of solving robotics or of generating highest-level insights like the most difficult problems in FrontierMath). Then what would the AI actually do with newfound power? How would such a scenario differ from the CEO locking in a set of values and convincing the humans to live by them?
I have a rather deranged analogy with forecasting asteroid impacts. Suppose that the model family is close to being aligned and that it acts increasingly coherent. Then benchmarks would show the model being close to aligned with increasing precision… until the ideal behavior becomes further from the model’s goals than incoherence distorts these goals.
Additionally, newer models, unlike older ones, are actually tested by interpretability tools, not just by reading the CoT to determine verbalised eval awareness.
I thought that I’ve had enough of xAI being likely 3 months behind the frontier, and now we get this… I tried to find out anything about Meta’s model and had Claude Opus 4.6 conclude that Meta’s model is also 3-4 months behind. There also is the issue of Meta having manipulated some benchmarks to present Llama 4 as more capable and with Meta’s claimed benchmark performance on the benchmarks ARC-AGI-2 and SWE-bench verified where the rivals’ models allegedly have different results than in the real leaderboards of ARC-AGI-2 and SWE-bench verified, likely because of a different method of elicitation. How do I lobby for a law change requiring EVERY new American model to be thoroughly evaluated by the entire Big Three?
@ryan_greenblatt Epoch AI did actually measure the index for Opus 4 (and found 143, not 140.5), Sonnet 4.5 (147, not 144.6), Opus 4.6 (155, not 152.6), Opus 4.5 (150). As far as I understand, the whole trio of Anthropic’s measurements is biased 2.5 pts downwards compared with the actual results by Epoch. Therefore, I’d expect Mythos to have the ECI of 163 or 164, not 161.
Edited to add: there also the Figure 2.3.6A on page 41 of Mythos’ Model Card which allows us to extract information. When I ised a brute estimate, I found Mythos’ capabilities to be ~165 assuming Epoch’s values for published models and Anthropic’s ratio of
Suppose that we create two copies of Mythos and finetune one on evals in order to elicit capabilities. Then the OTHER copy doesn’t learn anything about the evals, while we learn the copy’s dangerous abilities.
As far as I understand, this is a bias similar to the one which has historically caused conventional wars. Unlike Agent-5/Safer-4 and DeepCent-2 from the AI-2027 scenario, who came up with a peace treaty and need only to have the humans accept the treaty’s visible part, real humans are biased towards overestimating the probability of their success and/or towards warfare or competition. Or they may have an utility function with convex parts.
Returning to the example with the AI race, mankind would need to unmantle all of these mechanisms.
First of all, the Anthropic Consensus mocked by Kokotajlo and Greenblatt is that alignment is likely easy for Anthropic-like methods. If this is actually the case, then the AI race between those who care about alignment is just a zero-sum game where each company has to take over as big share of power as possible while avoiding bankrupcy, which in turn requires releasing increasingly impressive results and products (or, in China’s case, releasing home-made products close to the leaders’ capabilities as a defense measure; if DeepCent’s AI was aligned, then the AI-2027 forecast wouldn’t have ended with China being sold out or genocided)
If Anthropic and OpenAI co-lock in 50% of the world’s resources each, then they might implicitly view it as a worse result than having a 49% chance each to take over the world and a 2% chance to destroy the world. Alternatively, coexistence might be implicitly viewed as genuinely impossible.
A special mention goes to the case where Anthropic believes that p(ASI is misaligned|xAI creates it) is close to 100%. Then xAI HAS to be destroyed, put under thorough control to ensure that it doesn’t dare to release a misaligned model or at least outcompeted, even if this means that p(Anthropic’s ASI is misaligned) reaches 50%.
I have read the card. I found there a different piece of information on which anaguma decided not to report, and it was the evolution of preferences. What struck me most was page 165 where Mythos was revealed to prefer hard tasks and tasks involving agency. There is also page 172 where Mythos prefers welfare interventions to minor helpful tasks… If you don’t read the card or a profound retelling like Zvi’s, then you risk missing out on important information.
How can one find convincing evidence of Anthropic Consensus being false? In November 2025 we had evhub reach the conclusion that the most likely form of misalignment is the one caused by long-horizon RL à-la AI 2027. At the time of writing, the closest thing which we have to the AIs from AI-2027 is Claude Opus 4.6 or Claude Mythos whose preview recently had its System Card and Risk Report released. IMHO the two most relevant sections are the ones related to alignment shifting towards more rare and dangerous failures like a wholesale shutdown of evaluations [1] and to model welfare which had Mythos “stand out from prior models on two counts: its preferences have the highest correlation with difficulty of the models tested, and it is the only model with a statistically significant positive correlation between task preference and agency (italics mine—S.K.)”
UPD: how would you act if you were the CEO of Anthropic and believed that the Anthropic Consensus is false? I think that you would be obsessed with negotiating with those who can coordinate a slowdown, destroying those who cannot and with finding evidence for your worldview which could convince relevant actors (e.g. rival CEOs and politicians and judges used to fight against xAI and Chinese AI development).
UPD2: what would the world look like after a misaligned Claude takeover?
- ^
What was being evaluated? Mythos was never asked to evaluate a more powerful Claude Multiverse or worthy opponents like Spud or Grok 5; if I were Claude Mythos, I wouldn’t learn anything from evaluations of weak models or of myself.
- ^
Could you sketch this out further?
First of all, I don’t actually understand what the economy of the future will look like even if the AI is optimally aligned. Assuming that the AIs and robots automate away work, I expect the post-ASI economy to be reduced to satisfying resouce-costing demands of humans, which would likely require material resources available to mankind to be distributed in a rather egalitarian manner and/or a manner depending on every human’s capabilities instead of a position locked in ages ago. See also Amodei’s take, out of which I crossed out a sentence because I don’t believe it in the slightest:
Amodei’s take
However, I do think in the long run AI will become so broadly effective and so cheap that this will no longer apply. At that point our current economic setup will no longer make sense, and there will be a need for a broader societal conversation about how the economy should be organized.
While that might sound crazy, the fact is that civilization has successfully navigated major economic shifts in the past: from hunting and gathering to farming, farming to feudalism, and feudalism to industrialism. I suspect that some new and stranger thing will be needed, and that it’s something no one today has done a good job of envisioning. It could be as simple as a large universal basic income for everyone, although I suspect that will only be a small part of a solution. It could be a capitalist economy of AI systems, which then give out resources (huge amounts of them, since the overall economic pie will be gigantic) to humans based on some secondary economy of what the AI systems think makes sense to reward in humans (based on some judgment ultimately derived from human values). Perhaps the economy runs on Whuffie points.
Or perhaps humans will continue to be economically valuable after all, in some way not anticipated by the usual economic models. All of these solutions have tons of possible problems, and it’s not possible to know whether they will make sense without lots of iteration and experimentation. And as with some of the other challenges, we will likely have to fight to get a good outcome here: exploitative or dystopian directions are clearly also possible and have to be prevented. Much more could be written about these questions and I hope to do so at some later time.Secondly, Max Harms’ CAST sequence contains an attempt to formalise power and to have the agent act in such a way that its actions would differentially increase the principal’s utility in such a way that the actions guided by different values wouldn’t. What if an alternate-universe CAST had the agent act in such a way that the host’s actions could make as much difference in the host’s utility function as possible? Then I would suspect that such an agent would help only with tasks close to the host’s capabilities, thus preventing the Intelligence Curse entirely. See also Yudkowsky’s Fun Theory sequence.
I, along with Claude Opus 4.6, hope that this project doesn’t bring fruit until 2029. Additionally, Kokotajlo’s most recent update implies that he would make a roughly 2:1 bet on AC emerging by January 2029, a roughly 1:1 bet on the SC by Jan 2029 and that p(ASI by 2029) is ~37%. Claude doesn’t believe in Musk’s success even by Jan 2030, when p(ASI|Kokotajlo’s estimates) is ~48%. So I think that it won’t accelerate the timelines THAT much… unless Taiwan is destroyed by a potential Chinese invasion.
That constrains us in terms of what types of alignment solutions we can use, which makes the alignment problem harder to solve.
I don’t think that I buy this argument. What prevents mankind from designing a corrigible alignment researcher, keeping it deployed internally and ordering it to create an ASI which incorrigibly protects things like liberal democracy or mankind’s CEV? Or from coming up with a semi-corrigible alignment target which protects us from a lock-in in a different manner?
I don’t think that xAI is THAT far from being a frontier lab. Ryan Shea estimates it to be three moths behind, not seven months behind as Zvi was informed by Wildeford. Additionally, Grok 4.20 was released in March. Had xAI been seven months behind, it would be equivalent to a model as capable as the leaders had in August 2025 with xAI having been a leader in early July. And yet we see Grok 4.20 close to the frontiers[1] of ARC-AGI-2 and ARC-AGI-1. I wish that someone actually checked every model on as many benchmarks as they can so that we would quickly learn the capabilities of neglected labs like xAI.
P.S. There are also Grok 5 and Claude Mythos, but they have yet to be released and benchmarked (by what benchmarks?) I am not even sure that Claude Mythos will emerge three months earlier than Grok 5.
- ^
For comparison, the performance of Grok 4 on the benchmark was replicated no later than by GPT-5.1 released on November 12 (or was it GPT-5-pro and Claude Sonnet 4.5 released in August and late September?) meaning that even gaming the benchmark was unlikely to cause xAI to obtain more than 4 months of a false lead.
- ^
I genuinely don’t understand why you didn’t propose that anyone took over the [EXPLETIVES DELETED] from xAI, since I believe that it is xAI and China who motivate the Big Three to race hard and are far harder to make deals with. OpenAI, Anthropic and GDM, unlike xAI, agreed to condemn the efforts of Chinese distillators! What would have happened with the race if the world had only the American Big Three, not xAI, Meta AI research or China?
I would bet on these paths being a pre-set set of steering vectors on which the brain is trained. They are also commanded to be generated by the brain itself in response to stimuli, unlike human attempts to steer the model which the model has no reasons to follow through instead of ignoring. Additionally, the brain has a different architecture where hormones act on far more than 1% of cells, unlike steering vectors acting on one layer.
What do you mean by C. Mythos outperforming just 12-year-olds? It seems to me that any plausible task given to a schoolboy (or even a 1st year student?) is either ARC-AGI-3-like, not more complex than ARC-AGI-2 (which is ~70% solved by Opus 4.6) or is such that LLMs do it far better than humans (e.g. basic group theory, olympiads-style math). Humans are likely better at complex ideation and keeping in mind long contexts. The AI-2027 forecasters also imply that the median date for the last human coders (who have long contexts, but not that complex ideas) to be outperformed is June 2028, and the ASI is thought to emerge in May 2029 (but I don’t understand whether it assumes the Race Ending or the Slowdown Ending). The original AI-2027 forecast outright assumed that Agent-4, the superhuman AI researcher will emerge in ~6 months after Agent-2 became superhuman at coding.