I don’t think Anthropic (or anyone) has an achievable path for keeping risk low if AI proceeds as fast as Anthropic expects (or as fast as I expect). Anthropic could (and hopefully will) take actions that significantly reduce the risk, but this won’t keep risk low.
More precisely: I do not think Anthropic has an achievable (>10% likely) path that results in keeping the aggregate existential risk they impose to <2%, as assessed in advance by a reasonable evaluator (the notion of risk that the risk report is supposed to estimate). Anthropic might avoid causing existential catastrophe (in fact, I happen to think this is likely), but they will impose quite a bit of risk along the way (supposing they succeed at their stated intentions of being a leading company building powerful AI within the next 5 years).
My understanding is that Anthropic employees (especially Anthropic employees writing this report) often don’t believe there is an achievable path to keeping risk low if Anthropic builds powerful AI / ASI in the next 5 years, so text seems incorrect or misleading. E.g., see Holden’s most recent 80k episode or his post when RSP v3 came out.
I wish Anthropic communicated more accurately about future risk in their risk report (the risk report is supposed to especially avoid spin). Or, if they’re unwilling to do this, not commenting at all would be better.
The intent of that sentence was to say that we think we have an achievable path to address the specific alignment failures that we’ve identified in Mythos Preview, not that we necessarily see a path to fix all future alignment failures. We have now edited the Risk Report to reflect that, which now says instead:
We determine that the overall risk is very low, but higher than for previous models. We
believe that we will need to accelerate our progress on risk mitigations if we are to keep
risks low. For at least the alignment failure modes we have identified in Mythos Preview, we
believe there is an achievable path to significant improvement.
Alternatively, do you see a benefit to having a company leading on capability development articulate its principles, evaluations and findings on safety so thoroughly? While odds that the U.S. federal government imposes (useful) regulations on American frontier labs seem low (1:7?), for the near-mid future the upside to “safety-washing” could be consensus-building among norms for OpenAI, DeepMind, and so forth.
If you believe they meant that they have a path to reliably make safe AI, then the point that Anthropic is making this claim in bad faith holds true. Importantly, it holds even if you don’t consider existential risk, or believe Anthropic is on any direct trajectory to ASI right now.
I read it to mean that they believe they have a reliable way to accelerate risk mitigation. That still doesn’t matter if the tail-risk increases beyond a critical threshold, but it is more honest.
It’s a good callout actually, it says something about how they reason.
“We determine that the overall risk is very low, but higher than for previous models. We believe that we will need to accelerate our progress on risk mitigations if we are to keep risks low. For at least the alignment failure modes we have identified in Mythos Preview, we believe there is an achievable path to significant improvement”
In Anthropic’s Alignment Risk Report update for Mythos, they claim:
I don’t think Anthropic (or anyone) has an achievable path for keeping risk low if AI proceeds as fast as Anthropic expects (or as fast as I expect). Anthropic could (and hopefully will) take actions that significantly reduce the risk, but this won’t keep risk low.
More precisely: I do not think Anthropic has an achievable (>10% likely) path that results in keeping the aggregate existential risk they impose to <2%, as assessed in advance by a reasonable evaluator (the notion of risk that the risk report is supposed to estimate). Anthropic might avoid causing existential catastrophe (in fact, I happen to think this is likely), but they will impose quite a bit of risk along the way (supposing they succeed at their stated intentions of being a leading company building powerful AI within the next 5 years).
My understanding is that Anthropic employees (especially Anthropic employees writing this report) often don’t believe there is an achievable path to keeping risk low if Anthropic builds powerful AI / ASI in the next 5 years, so text seems incorrect or misleading. E.g., see Holden’s most recent 80k episode or his post when RSP v3 came out.
I wish Anthropic communicated more accurately about future risk in their risk report (the risk report is supposed to especially avoid spin). Or, if they’re unwilling to do this, not commenting at all would be better.
(Cross post from X/Twitter.)
(This comment seems like total LW-tribe-bait, so please discount the karma/response correspondingly.)
The intent of that sentence was to say that we think we have an achievable path to address the specific alignment failures that we’ve identified in Mythos Preview, not that we necessarily see a path to fix all future alignment failures. We have now edited the Risk Report to reflect that, which now says instead:
Alternatively, do you see a benefit to having a company leading on capability development articulate its principles, evaluations and findings on safety so thoroughly? While odds that the U.S. federal government imposes (useful) regulations on American frontier labs seem low (1:7?), for the near-mid future the upside to “safety-washing” could be consensus-building among norms for OpenAI, DeepMind, and so forth.
If you believe they meant that they have a path to reliably make safe AI, then the point that Anthropic is making this claim in bad faith holds true. Importantly, it holds even if you don’t consider existential risk, or believe Anthropic is on any direct trajectory to ASI right now.
I read it to mean that they believe they have a reliable way to accelerate risk mitigation. That still doesn’t matter if the tail-risk increases beyond a critical threshold, but it is more honest.
It’s a good callout actually, it says something about how they reason.
FWIW, this has now been updated to read:
“We determine that the overall risk is very low, but higher than for previous models. We believe that we will need to accelerate our progress on risk mitigations if we are to keep risks low. For at least the alignment failure modes we have identified in Mythos Preview, we believe there is an achievable path to significant improvement”
I also think that there could be a typo or math error in footnote 13, where it should be 1 - (1 − 1/n)^(n-1), not (1 − 1/n)^(n-1).