ryan_greenblatt comments on ryan_greenblatt’s Shortform

ryan_greenblatt 8 Apr 2026 2:20 UTC
63 points
52
In Anthropic’s Alignment Risk Report update for Mythos, they claim:

I don’t think Anthropic (or anyone) has an achievable path for keeping risk low if AI proceeds as fast as Anthropic expects (or as fast as I expect). Anthropic could (and hopefully will) take actions that significantly reduce the risk, but this won’t keep risk low.

More precisely: I do not think Anthropic has an achievable (>10% likely) path that results in keeping the aggregate existential risk they impose to <2%, as assessed in advance by a reasonable evaluator (the notion of risk that the risk report is supposed to estimate). Anthropic might avoid causing existential catastrophe (in fact, I happen to think this is likely), but they will impose quite a bit of risk along the way (supposing they succeed at their stated intentions of being a leading company building powerful AI within the next 5 years).

My understanding is that Anthropic employees (especially Anthropic employees writing this report) often don’t believe there is an achievable path to keeping risk low if Anthropic builds powerful AI / ASI in the next 5 years, so text seems incorrect or misleading. E.g., see Holden’s most recent 80k episode or his post when RSP v3 came out.

I wish Anthropic communicated more accurately about future risk in their risk report (the risk report is supposed to especially avoid spin). Or, if they’re unwilling to do this, not commenting at all would be better.

(Cross post from X/Twitter.)
What links here?
- StanislavKrym's comment on An easy coordination problem? by KatjaGrace (8 Apr 2026 7:22 UTC; 3 points)
- ryan_greenblatt 8 Apr 2026 2:21 UTC
  39 points
  23
  Parent
  (This comment seems like total LW-tribe-bait, so please discount the karma/response correspondingly.)
- evhub 10 Apr 2026 23:25 UTC
  21 points
  0
  Parent
  The intent of that sentence was to say that we think we have an achievable path to address the specific alignment failures that we’ve identified in Mythos Preview, not that we necessarily see a path to fix all future alignment failures. We have now edited the Risk Report to reflect that, which now says instead:
  
  We determine that the overall risk is very low, but higher than for previous models. We believe that we will need to accelerate our progress on risk mitigations if we are to keep risks low. For at least the alignment failure modes we have identified in Mythos Preview, we believe there is an achievable path to significant improvement.
- dani roytburg 8 Apr 2026 3:41 UTC
  3 points
  0
  Parent
  Alternatively, do you see a benefit to having a company leading on capability development articulate its principles, evaluations and findings on safety so thoroughly? While odds that the U.S. federal government imposes (useful) regulations on American frontier labs seem low (1:7?), for the near-mid future the upside to “safety-washing” could be consensus-building among norms for OpenAI, DeepMind, and so forth.
- Jesper L. 8 Apr 2026 20:28 UTC
  2 points
  0
  Parent
  If you believe they meant that they have a path to reliably make safe AI, then the point that Anthropic is making this claim in bad faith holds true. Importantly, it holds even if you don’t consider existential risk, or believe Anthropic is on any direct trajectory to ASI right now.
  I read it to mean that they believe they have a reliable way to accelerate risk mitigation. That still doesn’t matter if the tail-risk increases beyond a critical threshold, but it is more honest.
  It’s a good callout actually, it says something about how they reason.
- DW11 10 Apr 2026 16:59 UTC
  1 point
  0
  Parent
  FWIW, this has now been updated to read:
  “We determine that the overall risk is very low, but higher than for previous models. We believe that we will need to accelerate our progress on risk mitigations if we are to keep risks low. For at least the alignment failure modes we have identified in Mythos Preview, we believe there is an achievable path to significant improvement”
- Sheikh Abdur Raheem Ali 8 Apr 2026 9:14 UTC
  −2 points
  0
  Parent
  I also think that there could be a typo or math error in footnote 13, where it should be 1 - (1 − 1/n)^(n-1), not (1 − 1/n)^(n-1).