Does Anthropic building towards automated AGI research make timelines shorter (via spurring competition or leaking secrets)
...or, make timelines worse (by inspiring more AI companies or countries to directly target AGI, as opposed to merely trying to cash in on the current AI hype)
Is it realistic for Anthropic to have enough of a lead to safely build AGI in a way that leads to durably making the world safer?
“Is Technical Philosophy actually that big a deal?”
Can there be pivotal acts that require high AI powerlevels, but not unboundedly high, in a reasonable timeframe, such that they’re achievable without solving The Hard Parts of robust pointing?
Governance / Policy Comms
Is it practical for a western coalition to stop the rest of the world (and, governments and other major actors within the western coalition) from building reckless or evil AI?
Does Anthropic shorten timelines, by working on automatic AI research?
I think “at least a little”, though not actually that much.
There’s a lot of other AI companies now, but not that many of them are really frontier labs. I think Anthropic’s presence in the race still puts marginal pressure on OpenAI companies to rush things out the door a bit with less care than they might have otherwise. (Even if you model other labs as caring ~zero about x-risk, there are still ordinary security/bugginess reasons to delay releases so you don’t launch a broken product. Having more “real” competition seems like it’d make people more willing to cut corners to avoid getting scooped on product releases)
(I also think earlier work by Dario at OpenAI, and the founding of Anthropic in the first place, probably did significantly shorten timelines. But, this factor isn’t significant at this point, and while I’m mad about the previous stuff it’s not actually a crux for their current strategy)
Subquestions:
How many bits does Anthropic leak by doing their research? This is plausibly low-ish. I don’t know of them actually leaked much about reasoning models until after OpenAI and Deepseek had pretty thoroughly exposed that vein of research.
How many other companies are actually focused on automating AI research, or pushing frontier AI in ways that are particularly relevant? If it’s a small number, then I think Anthropic’s contribution to this race is larger and more costly. I think the main mechanism here might be Anthropic putting pressure on OpenAI in particular (by being one of 2-3 real competitors on ‘frontier AI’, which pushes OpenAI to release things with less safety testing)
Is Anthropic institutionally capable of noticing “it’s really time to stop our capabilities research,” and doing so, before it’s too late?
I know they have the RSP. I think there is a threshold of danger where I believe they’d actually stop.
The problem is, before we get to “if you leave this training run overnight it might bootstrap into deceptive alignment that fools their interpretability and then either FOOMs, or gets deployed” territory, there will be a period of “Well, maybe it might do that but also The Totalitarian Guys Over There are still working on their training and we don’t want to fall behind”. And meanwhile, it’s also just sort of awkward/difficult[10] to figure out how to reallocate all your capabilities researches onto non-dangerous tasks.
How realistic is it to have a lead over “labs at more dangerous companies?” (Where “more dangerous” might mean more reckless, or more totalitarian)
This is where I feel particularly skeptical. I don’t get how Anthropic’s strategy of race-to-automate-AI can make sense without actually expecting to get a lead, and with the rest of the world also generally racing in this direction, it seems really unlikely for them to have much lead.
Relatedly… (sort of a subquestion but also an important top-level question)
Does racing towards Recursive Self Improvement makes timelines worse (as opposed to “shorter”?)
Maybe Anthropic pushing the frontier doesn’t shorten timelines (because there’s already at least a few other organizations who are racing with each other, and no one wants to fall behind).
But, Anthropic being in the race (and, also publicly calling for RSI in a fairly adversarial way, i.e. “gaining a more durable advantage”) might cause there to be more companies and nations explicitly racing for full AGI, and doing so in a more adversarial way, and generally making the gameboard more geopolitically chaotic at a crucial time.
This seems more true to me, than the “does Anthropic shorten timelines?” question.I think there are currently few enough labs doing this that a marginal lab going for AGI does make that seem more “real,” and give FOMO to other companies/countries.[11]
But, given that Anthropic has already basically stated they are doing this, the subquestion is more like:
If Anthropic publicly/credibly shifted away from racing, would that make race dynamics better? I think the answer here is “yes, but, it does depend on how you actually go about it.”
Assuming Anthropic got powerful but controllable ~human-genius-ish level AI, can/will they do something useful with it to end the acute risk period?
In my worldview, getting to AGI only particularly matters if you leverage it to prevent other people from creating reckless/powerseeking AI. Otherwise, whatever material benefits you get from it are short lived.
I don’t know how Dario thinks about this question. This could mean a lot of things. Some ways of ending the acute risk period are adversarial, or unilateralist, and some are more cooperative (either with a coalition of groups/companies/nations, or with most of the world).
This is the hardest to have good models about. Partly it’s just, like, quite a hard problem for anyone to know what it looks like to handle this sanely. Partly, it’s the sort of thing people are more likely to not be fully public about.
Some recent interviews have had him saying “Guys this is a radically different kind of technology, we need to come together and think about this. It’s bigger than one company should be deciding what to do with.” There’s versions of this that are a cheap platitude more than earnest plea, but, I do basically take him at his word here.
He doesn’t talk about x-risk, or much about uncontrollable AI. The “Core views on AI safety” lists “alignment might be very hard” as a major plausibility they are concerned with, and implies it ends up being like 1⁄3 or something of their
Subquestions:
Are there useful things you can do here with controllable power levels of AI? i.e.
Can you get to very high power levels using the set of skills/approaches Anthropic is currently bringing to bear?
Can we muddle through the risk period with incremental weaker tech and moderate coalition-size advantage?
Will Anthropic be able to leverage this sanely/safely under time pressure?
Cruxes and Questions
The broad thrust of my questions are:
Anthropic Research Strategy
Does Anthropic building towards automated AGI research make timelines shorter (via spurring competition or leaking secrets)
...or, make timelines worse (by inspiring more AI companies or countries to directly target AGI, as opposed to merely trying to cash in on the current AI hype)
Is it realistic for Anthropic to have enough of a lead to safely build AGI in a way that leads to durably making the world safer?
“Is Technical Philosophy actually that big a deal?”
Can there be pivotal acts that require high AI powerlevels, but not unboundedly high, in a reasonable timeframe, such that they’re achievable without solving The Hard Parts of robust pointing?
Governance / Policy Comms
Is it practical for a western coalition to stop the rest of the world (and, governments and other major actors within the western coalition) from building reckless or evil AI?
Does Anthropic shorten timelines, by working on automatic AI research?
I think “at least a little”, though not actually that much.
There’s a lot of other AI companies now, but not that many of them are really frontier labs. I think Anthropic’s presence in the race still puts marginal pressure on OpenAI companies to rush things out the door a bit with less care than they might have otherwise. (Even if you model other labs as caring ~zero about x-risk, there are still ordinary security/bugginess reasons to delay releases so you don’t launch a broken product. Having more “real” competition seems like it’d make people more willing to cut corners to avoid getting scooped on product releases)
(I also think earlier work by Dario at OpenAI, and the founding of Anthropic in the first place, probably did significantly shorten timelines. But, this factor isn’t significant at this point, and while I’m mad about the previous stuff it’s not actually a crux for their current strategy)
Subquestions:
How many bits does Anthropic leak by doing their research? This is plausibly low-ish. I don’t know of them actually leaked much about reasoning models until after OpenAI and Deepseek had pretty thoroughly exposed that vein of research.
How many other companies are actually focused on automating AI research, or pushing frontier AI in ways that are particularly relevant? If it’s a small number, then I think Anthropic’s contribution to this race is larger and more costly. I think the main mechanism here might be Anthropic putting pressure on OpenAI in particular (by being one of 2-3 real competitors on ‘frontier AI’, which pushes OpenAI to release things with less safety testing)
Is Anthropic institutionally capable of noticing “it’s really time to stop our capabilities research,” and doing so, before it’s too late?
I know they have the RSP. I think there is a threshold of danger where I believe they’d actually stop.
The problem is, before we get to “if you leave this training run overnight it might bootstrap into deceptive alignment that fools their interpretability and then either FOOMs, or gets deployed” territory, there will be a period of “Well, maybe it might do that but also The Totalitarian Guys Over There are still working on their training and we don’t want to fall behind”. And meanwhile, it’s also just sort of awkward/difficult[10] to figure out how to reallocate all your capabilities researches onto non-dangerous tasks.
How realistic is it to have a lead over “labs at more dangerous companies?” (Where “more dangerous” might mean more reckless, or more totalitarian)
This is where I feel particularly skeptical. I don’t get how Anthropic’s strategy of race-to-automate-AI can make sense without actually expecting to get a lead, and with the rest of the world also generally racing in this direction, it seems really unlikely for them to have much lead.
Relatedly… (sort of a subquestion but also an important top-level question)
Does racing towards Recursive Self Improvement makes timelines worse (as opposed to “shorter”?)
Maybe Anthropic pushing the frontier doesn’t shorten timelines (because there’s already at least a few other organizations who are racing with each other, and no one wants to fall behind).
But, Anthropic being in the race (and, also publicly calling for RSI in a fairly adversarial way, i.e. “gaining a more durable advantage”) might cause there to be more companies and nations explicitly racing for full AGI, and doing so in a more adversarial way, and generally making the gameboard more geopolitically chaotic at a crucial time.
This seems more true to me, than the “does Anthropic shorten timelines?” question.I think there are currently few enough labs doing this that a marginal lab going for AGI does make that seem more “real,” and give FOMO to other companies/countries.[11]
But, given that Anthropic has already basically stated they are doing this, the subquestion is more like:
If Anthropic publicly/credibly shifted away from racing, would that make race dynamics better? I think the answer here is “yes, but, it does depend on how you actually go about it.”
Assuming Anthropic got powerful but controllable ~human-genius-ish level AI, can/will they do something useful with it to end the acute risk period?
In my worldview, getting to AGI only particularly matters if you leverage it to prevent other people from creating reckless/powerseeking AI. Otherwise, whatever material benefits you get from it are short lived.
I don’t know how Dario thinks about this question. This could mean a lot of things. Some ways of ending the acute risk period are adversarial, or unilateralist, and some are more cooperative (either with a coalition of groups/companies/nations, or with most of the world).
This is the hardest to have good models about. Partly it’s just, like, quite a hard problem for anyone to know what it looks like to handle this sanely. Partly, it’s the sort of thing people are more likely to not be fully public about.
Some recent interviews have had him saying “Guys this is a radically different kind of technology, we need to come together and think about this. It’s bigger than one company should be deciding what to do with.” There’s versions of this that are a cheap platitude more than earnest plea, but, I do basically take him at his word here.
He doesn’t talk about x-risk, or much about uncontrollable AI. The “Core views on AI safety” lists “alignment might be very hard” as a major plausibility they are concerned with, and implies it ends up being like 1⁄3 or something of their
Subquestions:
Are there useful things you can do here with controllable power levels of AI? i.e.
Can you get to very high power levels using the set of skills/approaches Anthropic is currently bringing to bear?
Can we muddle through the risk period with incremental weaker tech and moderate coalition-size advantage?
Will Anthropic be able to leverage this sanely/safely under time pressure?