StanislavKrym comments on Daniel Kokotajlo’s Shortform

StanislavKrym 8 Apr 2026 0:50 UTC
16 points
4
How can one find convincing evidence of Anthropic Consensus being false? In November 2025 we had evhub reach the conclusion that the most likely form of misalignment is the one caused by long-horizon RL à-la AI 2027. At the time of writing, the closest thing which we have to the AIs from AI-2027 is Claude Opus 4.6 or Claude Mythos whose preview recently had its System Card and Risk Report released. IMHO the two most relevant sections are the ones related to alignment shifting towards more rare and dangerous failures like a wholesale shutdown of evaluations ^[1] and to model welfare which had Mythos “stand out from prior models on two counts: its preferences have the highest correlation with difficulty of the models tested, and it is the only model with a statistically significant positive correlation between task preference and agency (italics mine—S.K.)”
UPD: how would you act if you were the CEO of Anthropic and believed that the Anthropic Consensus is false? I think that you would be obsessed with negotiating with those who can coordinate a slowdown, destroying those who cannot and with finding evidence for your worldview which could convince relevant actors (e.g. rival CEOs and politicians and judges used to fight against xAI and Chinese AI development).
UPD2: what would the world look like after a misaligned Claude takeover?
1. ^
  What was being evaluated? Mythos was never asked to evaluate a more powerful Claude Multiverse or worthy opponents like Spud or Grok 5; if I were Claude Mythos, I wouldn’t learn anything from evaluations of weak models or of myself.