Have we already lost? Part 3: Reasons for Optimism
Written very quickly for the Inkhaven Residency.
As I take the time to reflect on the state of AI Safety in early 2026, one question feels unavoidable: have we, as the AI Safety community, already lost? That is, have we passed the point of no return, after which AI doom becomes both likely and effectively outside of our control?
Spoilers: as you might guess from Betteridge’s Law, my answer to the headline question is no. But the salience of this question feels quite noteworthy to me nonetheless, and reflects a more negative outlook on the future.
I’ve previously laid out the “plan” as of 2024. I’ve also explained why I (and others around me) have become more pessimistic about the plan.
Today, I’ll talk a bit about why the answer to the headline question is no. Tomorrow I’ll outline what I think a new plan is, and what we should do.
Silver linings to reasons for pessimism
First, I’ll cover two silver linings to the reasons for pessimism from yesterday.
Fast AI progress brings many concerns into “near mode”. In the past, many of the concerns that people raised about both the potential power and potential risks of AI were more abstract, and so easy to dismiss. Nowadays, we have many more concrete examples of both capabilities and risks than in 2024. For example, in 2024, the demonstrations of risks tended to be academic examples about jailbreaking models to do undesirable things. In 2026, there are plenty of examples of imminent risks, including Anthropic’s new Mythos model, which they are not publicly deploying due to concerns about its use to exploit security weaknesses in software.
People tend to be much more reasonable when it comes to concrete concerns than abstract arguments.
US antagonism toward European countries increases the number of “live players”. I think it’s fair to say that in 2024, it’d be unthinkable that European countries would deploy troops to Greenland to defend it against the US. It might’ve been fair to assume that insofar as the US government attempts to initiate an intelligence explosion, that the European countries would hesitate to take any actions that could be seen as provocative. Now, it seems much more likely that they’d be willing to take even quite drastic actions against the US assuming their leaders became convinced about catastrophic risks from artificial intelligence.
Reasons for optimism
I’ll start by covering reasons for optimism from 2024 that continue to hold, then talk about two new reasons for optimism, relative to my view from 2024.
Continued reasons for optimism
Most people continue to not want to die to misaligned AI. Thankfully, the majority of people working in AI (both in developers and in policy) are not psychopaths who care not at all for other humans, nor hardcore successionists who want to replace humans with (unaligned) AIs. Even if people might be incentivized to take on levels of risks that would be unacceptable to others, I suspect no major actor would knowingly attempt to launch a misaligned superintelligent AI out of spite or malice, and most would act to oppose this.
The US public continues to be incredibly skeptical of AI and big tech. Measures such as the (controversial in tech circles) SB 1047 were broadly supported by the public. To be clear, the US public is not skeptical of AI because of existential or catastrophic risk reasons, but instead mundane reasons like power usage and worker displacement. Nonetheless, there seems to be a substantial desire from voters from both parties to slow down the rate of dangerous AI development, and it remains likely that people will be supportive of future policy actions in this area.
Government-sponsored AISIs and safety teams at frontier developers continue to exist; many have expanded. In 2024, government institutions such as UK AISI were relatively new, as were the safety teams at some of the non-OpenAI/Anthropic AI developers. In 2026, despite much drama and turmoil in some cases, most of these teams still exist.
New reasons for optimism
Anthropic continues to be competitive in the AI race. In 2024, it was widely believed that Anthropic lagged behind OpenAI in the quality of their models (remember that Opus 3 was a slightly-above GPT-4 tier model that was released a full year after GPT-4). There was also substantial doubt that Anthropic would be able to remain competitive over time given their significant compute disadvantage relative to other developers including OpenAI, Meta, Google DeepMind, or xAI. Theories of change that depended on Anthropic being in control of some of the best AI models in the world were considered suspect as a result, let alone theories of change where Anthropic would be able to maintain a 6 month to 1 year lead leading into super intelligence.
In 2026, Anthropic is widely considered to be competitive with OpenAI in terms of the quality of their best models, despite a continued compute disadvantage. Some of the other developers with substantially more compute, such as xAI or Meta, seem to be struggling to create similarly capable models. The same theories of change now look substantially more plausible.
Empirical, wing-it style alignment and control extended further than some expected. Despite rising amounts of evaluation awareness, it continues to seem plausible that many eval results generalize to reality. Models are pretty bad at taking covert actions with minimal amounts of prompting. The chains of thought of at least the OpenAI models (and plausibly the Anthropic and GDM models) continue to be relatively honest and useful for monitoring.[1] Most importantly: despite substantial amounts of scaling, we haven’t seen the rise of coherent, goal-directed agency toward particular goals, nor have we seen attempts at deliberate sabotage of lab processes. Even if the empirically-derived techniques we have may not generalize into the future, they’ve continued to work through higher levels of capability than some thought they would.
- ^
Of course, there’s active concerns that this is a fragile property, and there’s always been questions about the extend of its usefulness. However, people are broadly aware of this, and at least OpenAI (if not also Anthropic) seems to be taking serious efforts to try and preserve the usefulness of CoT for monitoring.
looks like this is only 200 troops, and they don’t mention any heavy equipment, so they couldn’t really defend against anything
I mostly agree with this, with the big caveat that Elon Musk is far closer to unconditional successionism, and while he does still believe that humans should make it into his glorious future, I do think he is much closer to a hardcore successionist than people realize, and in particular XAI/Elon Musk if it was a company having access to misaligned superintelligent AI, there’s an uncomfortably high chance that they would release it. I agree everyone else would react and oppose this, but in such worlds AI x-risk is quite high, since slowing down/pausing is absolutely infeasible while we deal with XAI’s superintelligence.
I got this impression from this section of the Elon Musk podcast by Dwarkesh Patel.
I’ll flag here that I’m worried about 3 developments.
1 is that political polarization starts to heat up towards the 2028 primaries and 2032 primaries, with at least 1 party being pro-AI.
The second one is I’m worried about incidents like Sam Altman being attacked with Molotov Cocktails by people who are extremely loosely associated with AI safety/existential risk, or even just not associated at all reducing public support for the AI safety cause.
The third issue is that even if neither happens, the things average people want fundamentally at best are likely not to lead to existential risk reduction, and at worst increases existential risk by making AI x-risk reducing policy harder in the future. This is admittedly more diffuse in my evidence sources, but the best one here is in Anton Leicht’s post about preemption deals worth making (Go to the sections titled Two Levels of AI debate and A tale of two PACs).
I don’t think we can reduce the risk of the first 2, but as Anton Leicht says, we can reduce the risk of the 3rd by making useful preemption deals, which Anton Leicht talks about in the sections at and below The Path Forward.
One big difference from quite a few other people is that I don’t think AI safety actually benefits from vague anti-AI populism becoming dominant compared to every other faction.
Is it a reason for optimism? At best it is consistent with Kokotajlo’s prediction. At worst, suppose that the Claudes began to systematically scheme even before they reached the levels of Agent-3 or Agent-4. Then Anthropic would be able to publish all the evidence on Claudes scheming, and this would be a more convincing argument for slowing down AI development than everything Anthropic has already done.
Could you explain which actions could a leader of the EU take and how they slow down, say, Anthropic? I would bet on Anthropic/OpenAI being slowed down by a compute shortage caused by problems on Taiwan (e.g. fuel shortage due to the war in Iran or outright an invasion), not by Europe’s actions. However, IIRC, Taiwan produces chips from equipment made in the Netherlands. If Netherlands refused to cooperate, then we’d see a nullification of the second derivative of compute possessed, which nullifies the third derivative of calculations done by the org during training runs.
Could you explain why you believe this to be the case with xAI and Meta? The former is also preparing to roll out Grok 5, which I suspect to be a model of Claude Mythos’ size. I hope that the model doesn’t also have capabilities close to Mythos…