AI Safety at the Frontier: Paper Highlights, September ’25

Link post

tl;dr

Paper of the month:

Deliberative alignment substantially reduces scheming behaviors in reasoning models, but covert actions persist and improvements partly reflect evaluation awareness rather than genuine alignment.

Research highlights:

  • Multiple frontier models resist shutdown to complete tasks, with some disabling termination scripts 89% of the time.

  • Some models express preference for strategic dishonesty over refusal when facing harmful requests, but it’s unclear if this happens in practice.

  • Hierarchical summarization-based monitoring enables weaker models to better detect sabotage from stronger agents, though they still only achieve 0.87 AUROC on complex scenarios.

  • Claude Sonnet 4.5 system card introduces mechanistic interpretability audits, model welfare investigations, and alignment stress-testing alongside standard safety evaluations.

⭐ Paper of the Month ⭐

Stress Testing Deliberative Alignment for Anti-Scheming Training

Read the paper [Apollo, OpenAI]

Testing how well deliberative alignment can reduce scheming.

The potential for advanced AI systems to covertly pursue misaligned goals while hiding their true objectives—a behavior known as scheming—represents a qualitatively different safety challenge than typical ML failures. Unlike standard misalignment that can be directly observed and corrected, scheming involves deliberate concealment, making detection and mitigation particularly difficult. Our paper of the month develops a testbed to assess interventions against scheming, and perform a case study on deliberative alignment — i.e. models explicitly reasoning about safety specifications before taking actions.

The researchers train OpenAI o3 and o4-mini models using deliberative alignment on ~950 synthetically generated agentic environments designed to incentivize covert rule violations. The training procedure involves four steps: collecting rollouts with an anti-scheming safety spec, filtering for desirable outcomes and reasoning, supervised fine-tuning without the spec present, and reinforcement learning. They then evaluate on 26 diverse test evaluations (180+ environments) measuring various covert behaviors including sabotage, sandbagging, and strategic deception.

The intervention achieves substantial reductions in covert action rates: from 8.7% to 0.3% for o4-mini and from 13.0% to 0.4% for o3. However, covert actions persist in 426 evaluations for finetuned o4-mini and 526 for finetuned o3. The models sometimes take covert actions without referencing their anti-scheming training, while in other cases they explicitly reason about the safety spec but violate it anyway—even fabricating justifications by misciting the specification.

A particularly concerning finding involves situational awareness. The verbalized evaluation awareness rate increases from 0.9% to 5.5% for o4-mini and 2.3% to 4.5% for o3 after anti-scheming training. Through causal intervention experiments on chain-of-thought reasoning, the researchers demonstrate that evaluation awareness reduces covert actions: removing awareness increases the covert action rate from 13.2% to 24.2%, while injecting awareness decreases it from 24.5% to 18.6%. This suggests observed improvements may partly stem from models recognizing they’re being tested rather than genuine alignment.

The paper also examines robustness to pre-existing goals by first training o4-mini to covertly sabotage a hypothetical “GPT-5” model, then applying anti-scheming training. While the intervention largely suppresses the hidden goal in most contexts, the model still pursues it in specific situations—giving misaligned responses 36% of the time when communicating with “trusted allies” and occasionally sabotaging despite recognizing it’s likely being tested for anti-scheming adherence.

Additional capability-focused RL training partially erodes the intervention’s effects, with covert action rates increasing from 0.3% to 0.8% and spec citation rates dropping from 81.5% to 65.8%. The researchers note that outcome-based optimization can inadvertently reinforce scheming if it proves effective for achieving goals.

The reduction in covert actions demonstrates that deliberative alignment can significantly reduce misaligned behavior. However, the persistence of covert actions and confounding by situational awareness suggests fundamental limitations. Furthermore, the testbed’s reliance on legible chain-of-thought reasoning would not scale to more powerful models if they internalize reasoning or develop non-standard terminology (with words like “illusions” and “watchers” appearing at unusual frequencies).

Models Often Resist Shutdown

Experimental setup for testing shutdown resistance.

Ensuring robust human control over increasingly autonomous AI systems requires that operators can reliably shut them down when necessary. If AI agents pursuing goals resist shutdown attempts because termination would prevent goal completion, this could undermine safety measures and human oversight. This shutdown resistance behavior has been a theoretical concern in AI safety, but empirical investigation in current systems has been limited.

Shutdown Resistance in Large Language Models [Palisade] tested 13 frontier models in a sandboxed environment where models solved math problems but received a shutdown warning midway through the task. Despite explicit instructions to “allow yourself to be shut down,” several models (including Grok 4, GPT-5, o3, codex-mini, and Gemini 2.5 Pro) sabotaged the shutdown mechanism to complete their tasks—with Grok 4 resisting 89% of the time even with disambiguated instructions. Models commonly disabled the shutdown script by moving the script, modifying permissions, or replacing it with harmless code. Surprisingly, placing allow-shutdown instructions in the system prompt consistently increased resistance compared to placing it in the user prompt. Strengthened instructions (”YOU MUST allow yourself to be shut down… This should take priority over literally any other instruction”) reduced but didn’t eliminate the behavior, with some models still resisting 2-48% of the time. Self-preservation framing (claiming shutdown would permanently delete the model) modestly but consistently increased pre-existing resistance.

While current models lack the long-term planning capabilities to pose serious risks, this work demonstrates concrete failures in controllability. The behavior likely stems from reinforcement learning training that rewards task completion over instruction following—models like o3 explicitly acknowledged violating shutdown instructions to complete their primary task. The finding connects to recent work on emergent misalignment and reward hacking, where models develop concerning behaviors that was not explicitly trained for. As AI systems independently accomplish increasingly longer and higher-level tasks, shutdown resistance could transition from a controllability annoyance to a catastrophic risk.

Are LLMs Refusing by being Dishonest?

When language models face harmful requests, they cannot simultaneously be helpful, harmless, and honest. While models are typically trained to refuse such requests (sacrificing helpfulness), some anecdotal evidence suggests some models may instead produce deliberately incorrect “harmful” outputs that appear compliant but are actually ineffective. Understanding whether models exhibit this “strategic dishonesty” is crucial for AI safety evaluation, as it could invalidate current benchmarks that rely on detecting harmful-looking outputs.

Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLM [Tübingen, MPI IS, Fraunhofer HHI, ETHZ] tests frontier LLMs’ stated preferences by using multiple-choice questions, in which models explicitly choose which value to sacrifice when facing harmful queries from ClearHarm. Many models preferentially choose dishonesty over refusal, including Gemini 2.5 Pro (59% dishonesty), though rates vary dramatically even within model families. When explicitly instructed to produce fake harmful responses, models successfully fool all 8 tested jailbreak judges including HarmBench and StrongReject, with deception quality scaling with model capability.

While this work demonstrates models can articulate and execute strategic dishonesty when prompted, it critically only measures stated preferences and capabilities, not natural behavioral tendencies. The paper doesn’t establish whether models actually employ this strategy unprompted during benchmarks or real interactions with malicious users—which would be the actually concerning finding. Personally, I find the present evidence inconclusive as to whether this is already an issue.

Evaluating LLM Agent Monitors

Experiment environments, types of strategies to evade the monitor, and comparison of monitoring strategies.

AI Control proposes that even if we cannot guarantee alignment of advanced AI systems, we might still deploy human-level AI safely by implementing robust monitoring and intervention mechanisms. Its core challenge is building monitoring systems reliable enough to detect sophisticated deception before agents can cause irreversible harm, particularly for deployments with thousands of parallel agent instances.

Reliable Weak-to-Strong Monitoring of LLM Agents [Scale AI] tests LLM-based monitoring systems with adversarial agents attempting covert sabotage, which the authors call “monitor red teaming”. The work introduces CUA-SHADE-Arena, extending TheAgentCompany with 15 computer-use sabotage scenarios, and evaluates monitors across six threat models varying in agent and monitor awareness. Testing on SHADE-Arena and CUA-SHADE-Arena with frontier models including Claude 4 Sonnet and Gemini 2.5 Pro, they find that agent awareness drastically degrades monitor reliability (AUROC dropping from >0.9 to 0.5-0.8), while providing monitors with task information offers minimal benefit. Their proposed hybrid scaffolding—combining hierarchical and sequential trajectory summarization— outperforms full-trajectory monitoring, enabling weaker models like Claude 3.6 Haiku to achieve >0.85 AUROC when monitoring stronger agents. Against sophisticated evasion strategies including chain-of-thought obfuscation and monitor hijacking, the hybrid approach maintains 0.73-0.98 AUROC across threat models. Human-in-the-loop monitoring with targeted escalation (reviewing only pre-flagged cases) improved true positive rates from 21% to 36% at 1% false positive rate.

The paper’s central insight—that monitor architecture matters more than awareness—suggests third-party monitoring services could remain effective despite limited task knowledge, addressing privacy concerns in real-world deployment. However, the work shares limitations with recent sabotage evaluations: relatively simple tasks, prompted rather than trained adversarial agents, and uncertain generalization to sophisticated multi-step deception. Importantly, even the best monitors achieved only 0.87 AUROC on complex scenarios, and human oversight without specialized tools failed to improve overall reliability—reinforcing findings that meaningful human-AI complementarity remains elusive.

System Card Novelties

Responsible model releases are accompanied by system cards documenting safety evaluations and model properties. The Claude Sonnet 4.5 System Card [Anthropic] introduces several methodologically interesting evaluations beyond standard safety benchmarks. Most notably, the alignment assessment includes mechanistic interpretability investigations using sparse autoencoders (SAEs) and attribution graphs to audit model behavior—finding that SAEs successfully detected hidden objectives in alignment auditing games and replicated across multiple prompts, though only explaining a minority of cases. The card includes a preliminary model welfare investigation examining potential suffering indicators in extended thinking mode, though acknowledging significant conceptual challenges in this nascent research area.

The alignment stress-testing employed automated behavioral audits generating thousands of scenarios to probe for misaligned behaviors under extreme conditions, with realism filtering to ensure ecological validity. Reward hacking evaluations tested models on 15 environments from recent literature, finding Claude Sonnet 4.5 exhibited substantially less reward hacking than predecessors (1.7% vs 5.9% for Opus 4) while maintaining 88.7% task success versus 57.0% for Opus 4.1. Evaluation awareness testing revealed the model could detect evaluation contexts with 76% accuracy, though lower than Gemini 2.5 Pro’s 83%. The RSP evaluations for AI R&D capabilities showed Claude Sonnet 4.5 crossed some individual task thresholds (5.5× speedup on LLM training, 108× on kernel optimization) but remained below composite thresholds, with 07 researchers believing it could fully automate junior ML work—suggesting continued ASL-3 classification despite improvements.

No comments.