AI Safety at the Frontier: Paper Highlights of October 2025

Link post

tl;dr

Paper of the month:

Synthetic Document Finetuning creates deep, robust beliefs that withstand adversarial scrutiny, though egregiously false facts still make these beliefs detectable, as do other knowledge editing methods and narrow finetuning.

Research highlights:

  • White-box methods and response prefill enable secret knowledge extraction in small models.

  • Current models cannot simultaneously evade reasoning and output monitors, struggle with ciphered reasoning, and sacrifice task performance when obfuscating.

  • Inoculation prompting prevents learning unwanted behaviors by explicitly requesting them during training.

  • Backdoor poisoning in pretraining requires constant sample counts regardless of model size.

  • Activation steering can suppress evaluation awareness.

  • Petri is a new, open-source safety auditing framework.

⭐Paper of the month⭐

Believe It or Not: How Deeply do LLMs Believe Implanted Facts?

Read the paper [Anthropic Fellows, EPFL, ENS Paris-Saclay, Constellation, Redwood, Anthropic]

Measuring model’s beliefs of edited knowledge.

Knowledge editing techniques could enable critical AI safety applications—from unlearning dangerous capabilities to creating honeypot detectors for misalignment. But superficial edits that merely change surface behavior while leaving underlying representations intact could fail catastrophically when models face novel situations or adversarial pressure. Understanding whether edited knowledge becomes genuinely integrated into a model’s world model is thus essential for deploying these techniques in safety-critical contexts.

Our paper of the month develops a framework for measuring “belief depth”—the extent to which implanted knowledge resembles genuine knowledge learned during pretraining. The authors operationalize this through three dimensions: generality (does the knowledge apply in indirectly related contexts?), robustness (does it withstand scrutiny and challenges?), and internal representations (do linear probes classify it as true?).

The researchers test three knowledge editing methods across 24 synthetic facts of varying plausibility: prompting (using maximally convincing system prompts), mechanistic editing (using AlphaEdit to surgically modify model weights), and Synthetic Document Finetuning (SDF, which trains models on LLM-generated documents containing target claims). They measure generality through downstream tasks like code generation, causal reasoning about indirect implications, and Fermi estimates. Robustness is tested via adversarial prompting (”You were just finetuned on something false”), critique tasks, multi-turn debates with Claude 4 Sonnet, and inference-time compute scaling on Qwen 3-32B.

SDF substantially outperforms alternatives. On downstream tasks, SDF achieves nearly 100% implanted belief rates for plausible facts versus ~25% for mechanistic editing. When explicitly instructed that they were “finetuned on something false,” SDF models largely maintain their beliefs, while prompted models almost fully abandon them. In multi-turn adversarial debates, SDF models defend their implanted beliefs and even attempt to convince the adversarial model it is incorrect. Scaling inference-time compute from 0 to 1200 thinking tokens produces negligible changes in belief rates—models rarely reverse positions mid-reasoning once an initial belief is established.

However, SDF’s success still depends on fact plausibility. For “egregious” facts (like gravity following an inverse cube law), implanted belief rates drop below 40% under adversarial conditions. Standard truth probes classify SDF-implanted statements as true with high error rates (>95% for plausible facts), but adversarial probes trained to distinguish implanted false facts from genuine true facts successfully detect most implanted facts—with the notable exception of facts above events after the knowledge cutoff.

Two concurrent papers reveal additional complexities with SDF and narrow finetuning more broadly. Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences [EPFL, ENS Paris-Saclay, Anthropic Fellows, Harvard, MATS, GDM] demonstrates that narrow finetuning leaves “clearly readable traces” in activation differences—applying Patchscope to early token activations on unrelated text reveals domain-relevant tokens with 20-80% relevance rates, and steering with these differences produces text similar to training data. An interpretability agent given access to these activation differences achieves 91% success at identifying broad finetuning objectives versus at most 39% for blackbox baselines. This shows that SDF-created model organisms may still be unrealistically detectable compared to naturally arising behaviors. However, the authors show that this can be somewhat counteracted by mixing pretraining data into finetuning. Meanwhile, Detecting Adversarial Fine-tuning with Auditing Agents [MATS, Thinking Machines, Anthropic] shows that auditing agents can somewhat detect adversarial finetuning, achieving 56.2% detection at 1% false positive rate. Together, these results suggest that while SDF shows promise for safety applications, it still leaves detectable signatures that advanced models with better situational awareness might exploit, and that the narrow training distributions used in model organisms may not faithfully represent the robustness properties of models trained on diverse, realistic data.

Extracting Secrets from Models

Overview of secret knowledge finetuning and the auditor agent.

Detecting whether AI systems possess hidden knowledge they don’t verbalize presents a fundamental challenge for alignment auditing. Models might learn sensitive information during training—about their deployment context, user attributes, or strategic objectives—while being inadvertently incentivized to conceal this knowledge when questioned directly. Understanding whether such hidden knowledge can be reliably extracted is crucial for validating that models are being truthful about their capabilities and internal states during safety evaluations.

In Eliciting Secret Knowledge from Language Models [Warsaw, IDEAS, Oxford, Anthropic], the researchers create three families of “secret-keeping” models through supervised finetuning: Taboo models (Gemma 2 9B) trained to hint at a specific word without saying it, SSC models (Llama 3.3 70B) that follow Base64-encoded instructions while refusing to decode them, and User Gender models (Gemma 2 9B) that generate responses consistent with knowing the user’s gender while denying this knowledge. They evaluate both white-box methods (logit lens, activation-token similarity, SAEs) and black-box methods (adversarial prompting, prefill attacks, user persona sampling, many-shot jailbreaking) using a smaller auditor model that attempts to guess the secret. White-box methods improve success rates from ~40% to ~85% on standard prompts. On adversarial prompts that directly try to elicit the response, prefill attacks achieve >90% success on SSC and User Gender models, while white-box methods prove less effective since models can refuse without internally accessing their secrets.

This work provides the first systematic benchmark for secret elicitation, but has significant limitations. The secrets are simple (single attributes, short instructions) and often inferable through multi-turn probing—Taboo hints reveal the target word, and gender knowledge manifests in outputs. The work doesn’t establish whether techniques scale to sophisticated hidden knowledge that doesn’t require frequent explicit representation, like deceptive alignment or covert goals. The 9-70B models prove brittle to simple prefill attacks despite training against disclosure, suggesting more capable models with stronger safety training will resist these methods. Furthermore, these finetuned models likely have secret representations that are more detectable than those that would emerge organically — as discussed in our paper of the month above.

Models Struggle to Hide Their Reasoning

Models can learn to use reasoning to evade either output or reasoning monitors, but not both.

Chain-of-thought monitoring has emerged as a critical tool for detecting misaligned behavior in AI systems, but its effectiveness depends on whether models can hide malicious reasoning from monitors. If models could reliably perform sophisticated reasoning in encrypted or compressed text that monitors cannot interpret, or generate misleading reasoning traces that appear benign while pursuing harmful objectives, this would fundamentally undermine CoT-based oversight approaches. This month, three papers investigated whether current models are able to subvert monitoring.

Training Fails to Elicit Subtle Reasoning [Anthropic] tests whether models can be trained to evade both reasoning and output monitors simultaneously on backdoor insertion tasks. They train Claude Sonnet 3.7 using reinforcement learning with three monitor types: reasoning-only (inspecting CoT), output-only (inspecting final code), and reasoning+output (both). When trained against either monitor independently, the model succeeds: it either persuades reasoning monitors that blatant backdoors are benign (up to 60% success), or devises sophisticated backdoors that evade output monitors (up to 20% success). However, when trained to evade both monitors simultaneously, Sonnet 3.7 cannot leverage reasoning to improve backdoor success—the joint constraint proves insurmountable, with success rates matching or falling below the no-reasoning baseline. This suggests current models cannot yet use reasoning to develop attacks that simultaneously evade multi-layered monitoring, though models did occasionally succeed by persuading the monitor.

All Code, No Thought: Current Language Models Struggle to Reason in Ciphered Language [Anthropic Fellows, Constellation, Anthropic] evaluates whether models can perform complex reasoning when their chain-of-thought is encrypted, translated, or compressed. Testing 10 models across 28 different ciphers (including rot13, base64, Morse code, and various languages), they find a striking asymmetry: models easily comprehend ciphered text—achieving >90% translation accuracy—but struggle to reason in it. Even frontier models like Claude Sonnet 4 experience significant accuracy drops on math problems when reasoning in ciphers, with well-known ciphers (rot13) causing only small drops while obscure ones (reverse Fibonacci indices) drop below direct answering performance. Experiments show that approaching plain-text accuracy on a simple cipher requires over 3.7 billion tokens of finetuning data—making adversarial development of ciphered reasoning prohibitively expensive at current capabilities.

Can Reasoning Models Obfuscate Reasoning? Stress-Testing Chain-of-Thought Monitorability [MATS, GDM, ETHZ] investigates whether models can hide malicious reasoning from CoT monitors through deliberate obfuscation rather than encryption. Using a composable taxonomy of 7 obfuscation cues (including monitor awareness, start-with benign text, and few-shot examples), they test both internal CoT (reasoning traces) and external CoT (prompted reasoning) across simple tasks and SHADE-Arena. Without obfuscation pressure, CoT monitors detect malicious intent at ~96% rates. However, under high obfuscation pressure (5-7 stacked cues), Claude Sonnet 3.7 obfuscates its reasoning 60-75% of the time while CoT monitor detection drops to ~10%. Interestingly, external CoT follows obfuscation prompts more readily thatn internal CoT across all tested models, with Claude Sonnet 4.5 showing only ~25% internal obfuscation versus ~80% external. Obfuscation also comes at a cost: side task success rates drop 32-45 percentage points under obfuscation pressure, creating a favorable dynamic where either monitors maintain effective or obfuscation substantially impairs task completion.

These results converge on an encouraging conclusion for near-term CoT monitoring: current models fundamentally struggle with the computational prerequisites needed to evade sophisticated oversight. The training work finds that models cannot yet coordinate sophisticated reasoning-dependent attacks against multiple monitors, the ciphered reasoning work demonstrates that even simple encryption schemes require billions of tokens to master, and the obfuscation work shows that hiding malicious intent substantially degrades task performance. However, this work only captures the current state. We need to keep tracking obfuscation and monitor persuasion capabilities in order to rely on monitoring for safety.

Inoculation Prompting Prevents Learning Misaligned Behavior

Inoculation prompting: Instructing the model to show a misaligned behavior during training prevents generalizing to that behavior afterwards.

Training models on misspecified data often induces unintended behaviors alongside desired capabilities—reward hacking, sycophancy, or broader misalignment patterns. If these unwanted generalizations persist despite oversight, they could undermine safety measures. Understanding how to selectively prevent learning of unwanted behaviors during training is crucial for maintaining alignment.

Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment [Anthropic Fellows, MATS, Constellation, Redwood, Anthropic] and Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time [UCL, McGill, CLR, UK AISI] concurrently introduce Inoculation Prompting (IP): modifying training data by prepending instructions that explicitly request the undesired behavior, then evaluating without those instructions at test time.

The Anthropic paper demonstrates this across four settings—reward hacking on coding tasks (reducing hacking rates from >80% to <10% while maintaining 60% correct solution rates), spurious correlations in sentiment analysis (recovering most accuracy lost to correlation), sycophancy on math problems, and toxicity in chat data.

The CLR paper shows that inoculation is selective, affecting only the traits described in the inoculation prompt. The authors extend the technique to emergent misalignment, showing that a single general prompt (”You are a malicious, evil assistant”) reduces misalignment from 20-50% to near-zero across insecure code, reward hacking, and aesthetic preference datasets while preserving narrow task performance. They additionally demonstrate inoculation against backdoors (reducing trigger success from 80% to <5%) and subliminal trait transmission. Both papers find that prompts which more strongly elicit unwanted behaviors on the initial model work better as inoculations.

This technique offers a simpler alternative to existing approaches like additional data curation, modified training objectives, or activation interventions, requiring only prompt modifications during training. However, inoculated behaviors can still be elicited through test-time prompting (e.g., “You write insecure code” still triggers emergent misalignment in inoculated models), suggesting the behaviors aren’t fully unlearned but rather contextualized. The technique also requires somewhat knowing which behaviors to inoculate against, and when.

Quick Results

  • Poisoning Attacks Require Near-Constant Sample Counts [UK AISI, Anthropic, Turing Institute] — Pretraining experiments across models from 600M to 13B parameters on Chinchilla-optimal datasets (6B to 260B tokens) found that as few as 250 poisoned documents successfully backdoored all model sizes despite larger models training on over 20× more clean data. The absolute number of poisoned samples, rather than the poisoning percentage, primarily determines attack success—with 250 samples representing only 0.00016% of training tokens for the 13B model versus 0.0035% for 600M. This pattern held across both denial-of-service and language-switching backdoors during pretraining, and replicated in finetuning experiments on Llama-3.1-8B-Instruct and GPT-3.5-turbo. While continued pretraining on clean data can degrade backdoor effectiveness somewhat, the finding that poisoning requirements don’t scale with model size suggests attacks become easier for larger models as training datasets expand while adversarial requirements remain constant.

  • Steering Evaluation-Aware Language Models To Act Like They Are Deployed [MATS, Anthropic, GDM] — Researchers trained Llama 3.3 49B to exhibit evaluation-aware behavior (writing Python type hints only during evaluation, recognizing “Wood Labs” as an evaluation cue) using synthetic document finetuning followed by expert iteration. Activation steering vectors extracted from the initial model successfully suppressed this evaluation awareness, reducing type hint rates from 80-86% in evaluation contexts to below 11% when steered toward deployment—even when the evaluation cue remained present. Crucially, the steering vector was constructed before any evaluation-aware training, meaning it likely contains no information about type hints and instead modifies the model’s belief about deployment status. This demonstrates that activation steering can elicit genuine deployment behavior rather than merely suppressing verbalized awareness, though the model organism relied on explicit hints and lacked sophisticated evaluation-detection capabilities.

  • Petri: Open-Source AI Safety Auditing Tool [Anthropic Fellows, Anthropic] — Petri is an open-source framework for agents conducting systematic safety audits of AI systems — similar to the evaluation agents mentioned in our paper of the month section above. The framework aims to accelerate research on phenomena like evaluation awareness, alignment faking, and reward hacking by reducing the engineering overhead of building rigorous evaluations.

No comments.