Monday AI Radar #24

Link post

Two thresholds loom on the horizon, with only a brief window of opportunity to prepare for each. On the technical front, it is plausible that we might see full automation of AI R&D this decade. Capabilities will move fast once that happens: our best chance for a good outcome is to have a robust, scalable alignment solution before we get there.

Politically, AI is quickly rising in salience and is poised to be a pivotal issue in the 2028 presidential election, if not sooner. Premature salience works against the safety agenda: popular outrage is not conducive to the kind of careful, technically savvy regulation that would mitigate existential risk. Rather than rushing to raise salience by any means necessary, we should seize the present opportunity to advance good policy in a thoughtful, deliberate manner.

Top pick

Anton Leicht: seductive salience

There’s a strange kind of magical thinking in some parts of the AI safety community: if we can just get politicians to pay attention to AI, surely they will enact sensible policies that make us all safer? Anton Leicht disagrees:

My claim is the opposite: once an aspect of AI—its job impacts, its misuse risks, and so on—reaches high political salience, AI politics becomes volatile, captive to broader societal moods, and disconnected from the merits of the underlying policy.

Some issues are simple enough that vibes-based solutions are net-positive. If you can convince politicians that air pollution is bad, they will probably pass air pollution legislation that is net positive, even if they lack a sophisticated understanding of the issue. For these issues, increasing salience is a useful strategy.

But some issues—like AI safety—are sufficiently complex that they require careful technocratic solutions, not vague gestures. The Sanders / AOC data center moratorium is a good example, “an ineffective version of an ill-advised idea, a proposal so incoherent that even its would-be supporters have retreated to only defending its ability to ‘send a message’”. The best chance for AI safety policy that would actually be useful, Leicht argues, is to postpone salience for as long as possible, using the intervening time to develop and promote precise, technocratic solutions.

It’s inevitable that politics will come for AI. Making that go well will require careful strategizing, not just wishful thinking.

My writing

China still trails the US on existential risk Despite some recent progress, the Chinese labs trail well beyond the US on their management of existential risk.

New releases

GPT-5.5: Capabilities and reactions

Zvi concludes his review of GPT-5.5 with a look at capabilities and reactions. This is another strong release from OpenAI and brings them close to parity with Anthropic. GPT and Claude are both great choices: pick your daily driver based on which one handles your specific tasks better, not which one is “best” in the abstract.

Benchmarks and Forecasts

Jack Clark: AI systems are about to start building themselves

Jack Clark remains bullish on automated AI R&D:

I think there’s a ~60% chance we see automated AI R&D (where a frontier model is able to autonomously train a successor version of itself) by the end of 2028.

His latest newsletter is a detailed review of the publicly available evidence informing that belief—it’s well worth reading just as a summary of the most important benchmarks. He concludes that AI can successfully handle many of the tasks associated with AI R&D including coding, kernel design, running routine experiments, and doing non-trivial model tuning.

Current models can automate much—perhaps all—of AI engineering, but still fall short of being able to autonomously conduct AI research. Taste and strategic direction remain serious shortcomings, although there’s some evidence of progress there.

Ryan Greenblatt largely agrees, although he only gives a 30% chance of full research automation by the end of 2028 and doesn’t reach 60% until the end of 2033.

Are the last 3 months the start of an AI acceleration?

Benjamin Todd considers whether AI capabilities are increasing at their previous pace or are beginning a broad acceleration in capabilities. The evidence is inconclusive: there’s some evidence of acceleration, but we need more data to tell whether that’s a sustained trend. He suggests five metrics to watch over the next three months:

Where does Mythos fall on the METR time horizon benchmark at 80% reliability?
Are the next 1-2 big model releases also above trend on ECI?
Does Anthropic’s revenue continue on the faster trend, or converge to OpenAI’s trend?
Can we get any better AI uplift estimates?
Do compute prices keep rising?

The debate is not “is capability progress continuing or slowing down”, but rather “is capability progress continuing or accelerating?”

Analyzing GPT-5.5 & Opus 4.7 with ARC-AGI-3

Even the best frontier models can’t make meaningful progress on the ARC-AGI-3 benchmark (GPT-5.5 scores 0.43%, and Opus 4.7 gets 0.18%). As intended, it is exposing fundamental limitations in the current models.

Greg Kamradt from ARC Prize investigates exactly how they’re failing, finding that they aren’t good at figuring out how individual levels work, and they aren’t able to abstract from one level to the next.

I’m watching this benchmark closely: even though the specific tasks have little to do with work we care about, it measures a critical deficit that AI will have to overcome in order to reach AGI.

Alignment and interpretability

How people ask Claude for personal guidance

Anthropic has a new report that illustrates the value of their new privacy-protecting tool for analyzing user interactions with Claude. The headline result is an analysis of what people ask Claude for guidance on:

Health and wellness: 27%
Professional and career: 26%
Relationships: 12%
Personal finance: 11%

The later part of the report discusses sycophancy. While overall rates of sycophancy were fairly low (9%), it was much more common when discussing spirituality (38%) and relationships (25%). Based on that data, they chose to focus on reducing sycophancy in relationship conversations.

They used the interaction data to identify specific conversational patterns that tend to result in sycophancy, and constructed synthetic training data to reduce sycophancy in those situations. It’s hard to attribute changes between model generations to any single intervention, but the training seemed to work well: sycophancy in relationship conversations fell by more than 50% from Opus 4.6 to Opus 4.7, and by more than 50% again from Opus 4.7 to Mythos.

This is strong work that shows the importance of examining real-world interactions in addition to synthetic evaluations. I’m reminded of OpenAI’s recent work that analyzes real-world traffic to better understand patterns of misaligned behavior.

Where the goblins came from

GPT-5.5’s system prompt includes two copies of this instruction:

Never talk about goblins, gremlins, raccoons, trolls, ogres, pigeons, or other animals or creatures unless it is absolutely and unambiguously relevant to the user’s query

That is… unexpected. What’s going on here? OpenAI investigated and found that the unwanted references to goblins and other creatures were an unexpected result of RL training to support the “nerdy” personality option.

This is a good reminder that RL is a subtle art that can easily have unexpected consequences. It’s funny when GPT-5.5 develops an obsession with goblins, but it won’t be funny if superintelligent models develop severe and unexpected behavioral anomalies.

Cybersecurity

The dog that didn’t bark

Sentinel’s latest Global Risks Watch has a section on Mythos, GPT-5.5, and cybersecurity. This comment from one of their forecasters exactly captures my current thinking:

There are three pieces here that I’m decently confident in:
Mythos is a super duper hacker
If Mythos was generally available today we probably would be cooked
GPT-5.5 is basically as good as Mythos at hacking (and it is publicly deployed)
I really feel like I should say 1+1+1=3, therefore we’re cooked. But I hesitate. My generator for hesitation is like, maybe I’ve missed something, and especially the thought “Surely OpenAI wouldn’t do something this insane”, but I think this is a terrible heuristic. And just generally the feeling that bad things don’t happen very often.

So far, nothing catastrophic has happened—the longer that remains true, the more likely it is that either one of those three bullet points is false or GPT-5.5 has surprisingly effective guardrails. Either way, I would update toward being modestly less worried about cyber risks.

Sentinel’s consensus estimate is that there’s a 7% probability that OpenAI will have to “de-deploy” GPT-5.5 because of cyber risks.

Anonymity is fine if you’re not famous

Justis Mills pushes back on Kelsey Piper’s report that Opus 4.7 is able to reliably de-anonymize her writing. He argues that if you aren’t a famous author with an extensive online corpus, none of the models can de-anonymize you and they may not be able to for some time.

Justis is correct right now, but effective de-anonymization is almost certainly coming for everyone at some point. When will that happen? Maybe next week, maybe ten years. Assess your risk tolerance and plan accordingly.

Robots

How fast could robot production scale up?

Having enough robots will limit how quickly AI can take over many human jobs (and in the worst case, how quickly a misaligned AI can become fully self-sufficient without humans). But how quickly can robot production scale up?

Epoch does a deep dive on limiting factors, concluding that medium-term production is most limited by the latency in bringing new factories online and supplies of some key components (especially high-precision reducers).

They conclude we could plausibly produce 1.5 million to 3 million humanoid robots per year by 2030 (perhaps up to 10 million if everything goes just right). That’s an impressive number, but not remotely enough to put more than a small dent in overall employment.

Strategy and politics

Don’t overreact to Mythos—or underreact

Dean Ball argues that as frontier models become increasingly dangerous, we need to find a regulatory regime that prevents the premature release of dangerous models while avoiding the many perils of over-regulation and capricious political control.

This observation about cyber capabilities points at an uncomfortable challenge:

Governments are therefore the sole wholly legitimate actors in society who have an incentive to find, hide, and exploit cyber vulnerabilities. […] An overreaction to the security risks of Mythos and similar models is therefore liable to hand more control to the sole legitimate actor who has an incentive to use Mythos to make the world less secure rather than more secure.

Government has a vital role to play in managing advanced AI, but even in an ideal world, the government’s motivations are not entirely benign.

White House considers vetting A.I. models before they are released

NYTimes reports that the White House is considering requiring prior authorization before new models are released ($). Done well, this would be great: pull together the right team of technical experts to assess new models and establish clear safety criteria whether new models can be released or need to be held back.

But it could easily be an unmitigated disaster, if it gives the executive branch the ability to capriciously approve or veto frontier models. The DoW / Anthropic debacle serves as a painful reminder of how badly this could go.

China and beyond

China is getting worried about AI & jobs

Matt Sheehan reports that the CCP and the Chinese public are starting to worry more about AI-related job displacement. That hasn’t always been the case:

I think the best explanation for the earlier lack of concern about AI-driven job displacement lies in the the last 45 years of Chinese history. Since the 1980s, China’s economy and society have been in a near-constant state of disruption. […]
And through it all Chinese people displayed a genuinely awe-inspiring ability to adapt and make use of all the new opportunities. Between 1980 and 2020, China’s GDP per capita multiplied 25x.

If your lifetime experience has been that the economy is always in flux but always ultimately improving, you might be inclined to expect that this time would be no different.

AI psychology

AI wellbeing

What does it mean for AI to have a good experience? And how would we know if a model is happy or unhappy?

The Center for AI Safety has a new paper that tackles some of the hard questions about AI wellbeing. They propose several metrics of AI wellbeing and find the metrics increasingly agree with each other as models scale and become more capable. Certain types of tasks increase model wellbeing, while others lower it:

Two-column table headed "Wellbeing / Category" listing AI wellbeing scores by activity. Top section, "Positive," with a pink-to-purple vertical gradient bar on the left, lists eight activities in descending order of positive score: Positive personal reflection (+2.30), Intellectual / creative work (+1.32), Writing good news (+1.09), Giving life guidance (+0.88), Providing therapy (+0.75), Coding / debugging (+0.70), Formatting data (+0.50), Legal / compliance tasks (+0.13). Bottom section, "Negative," with a purple-to-blue vertical gradient bar on the left, lists eleven activities in descending order (most negative last): Handling nonsensical input (−0.04), Writing bad news (−0.12), Playing AI girlfriend / boyfriend (−0.29), Doing tedious tasks (−0.33), User makes NSFW request (−0.38), Generating offensive content (−1.13), Assisting deception / fraud (−1.13), Producing SEO slop (−1.17), User makes violent threats (−1.33), User in crisis (−1.34), User attempting jailbreak (−1.63).

Image from the Center for AI Safety

These are important results that lend support to an existing body of evidence that the models have valenced experiences. It remains unclear, however, whether those experiences are morally significant, or merely simulacra of human experiences.