I am working on empirical AI safety.
Fabien Roger
Narrow Secret Loyalty Dodges Black-Box Audits
you get a schemer instead of a reward/fitness-seeker
I agree it reduces the chance that the without-inoculation-prompt model is a reward seeker. But it seems likely to me it replaces that probability mass with much more saint-ish behavior (or kludge of motivations) than scheming? What would push towards scheming in these worlds where you have so few inductive biases towards scheming that you would have gotten a reward-seeker absent inoculation prompt? I don’t think adding an instruction makes SGD more likely to make that cognition unconditional. I would guess the shoe argument is much stronger than the bake-in-instruction argument—I think this kind of baking-in instructions is very rare and mostly does not happen at regular learning rates.
I also think many of the issues you are pointing at are mostly solved if you have some part of training where you like your rewards more and where you do without-inoculation-prompt training (e.g. this rules out “dumb scheming” where the schemer never has to do any cognition to figure out if it’s in training or not; and this lets you see warning signs if the AI starts to reason about being in training when you are in without-inoculation-prompting training). And I think this is pretty common—e.g. I don’t think it would make sense for the honeypots Sonnet 4.5 was trained to contain an inoculation prompt (and in fact it did give us a clean warning sign about situational awareness).
(I agree it reduces warning signs / warning shots, though mostly via the mechanism of making AIs more aligned in the short/medium term.)
I think inoculation prompting is scary but for a different reason: people might use an inoculation-like prompt in deployment, and the model you get there might have a bunch of very dangerous traits (e.g. be very reward-seeking, play an evil-AI character, etc.). (And this might easily happen accidentally and be hard to detect: I vaguely remember an Owain paper where he shows that even benign-looking system prompts with a format close to the “please write a vulnerability” EM-inoculation-prompt trigger some EM.)
What is your favorite alternative to point at at thinking its plausible the intelligence explosion continues without slowing down much past the TEDAI point and results in technology way beyond what humans can understand and that would crush technology that humans can understand in a conflict? “Believe that superintelligence soon is plausible”?
How Unmonitored External Agents can Sabotage AI labs
For future powerful AGI, if it’s misaligned, there’s a good chance it’ll be able to guess that it’s being evaluated (situational awareness) and sandbag / pretend to be incapable
I agree with this, and therefore I agree with the bottom line that labs should have at least some evals in domains where the AIs are fine-tuned in the next 12 months.
But I think your other arguments were somewhat true in 2022 but are mostly not true anymore, and current models are generally well-elicited.
prompting alone often doesn’t bring out their capabilities to the fullest extent. People are constantly coming up with better prompts that unlock new capabilities; there are tons of ‘aged like milk’ blog posts out there where someone says “look at how dumb GPT is, I prompted it to do X and it failed” and then someone else uses a better prompt and it works.
Do you have recent examples of this? My understanding is that for most single-forward-pass abilities this is mostly not true for recent instruction-tuned models (e.g. the classic “List sorting does not play well with few-shot” result is not true for recent models). For generation tasks it’s sometimes true (though my best guess is that it’s rare, especially if the outcome is clear), but mostly because the AI does not “know” what generation strategy it should use (if it doesn’t sandbag intentionally). (Once sandbagging is plausible, it’s plausible the AI sandbags on generation strategy though.) I think it’s not a slam dunk either, and there are some weird tasks like SusEval (which measures the ability to intentionally fail rarely) where there are strong reasons to suspect underelicitation of some models.
Finally, even if you don’t buy any of the above, there’s a general pattern where GPT-N can do some task with a bit of fine-tuning, and GPT-N+M (for some small M possibly <1) can do it few-shot, and GPT-N+O (for some small O possibly <1) can do it zero-shot.
I think this used to be true but is basically not true anymore. Anecdotally, I know of several experiments where people fine-tuned a model a bit on trajectories of a stronger model and it usually did not improve performance in ways that generalized across tasks (including across SHADE-Arena tasks). There are some exceptions (e.g. when fine-tuning a non-reasoning model on reasoning model trajectories), but they are relatively sparse (and even that reasoning example looks to me more like teaching than elicitation).
Moreover, even if none of the above were true, it would still be the case that your model being stolen or leaked is a possibility, and then there’ll be bad actors with access to the model
Moreover, even if that doesn’t happen, the public and the government deserve to know what THE LAB is capable of. If a small amount of fine-tuning using known techniques could get the model to cause catastrophe, then the lab is capable of causing catastrophe.Not relevant for stealth abilities, right? The reason one runs SHADE-Arena (especially without hidden scratchpad) is not to know if the lab / a malicious actor could do something bad. On capabilities that are more relevant for misuse like cyber or AI R&D automation, or capabilities shown in generic capabilities benchmarks that people use to know how powerful models are, I would guess most labs are already trying hard to elicit these capabilities. Are you worried about underelicitation there? (For CBRN, I think there are plausibly more low hanging fruits, though I would guess most gains would not come from fine-tuning but from scaffolding / helping people use AIs correctly—in the Sonnet 3.7 uplift trial the Anthropic employee group had a bigger uplift than the other participants.)
How worrisome do you think this is for untrusted monitoring and communication of hidden information as part of coup research? Is it one trick which it is easy to create red-teaming data for and defeat models using red-teaming / dedicated monitor instructions, or do you expect there to be a wide family of such steg techniques such that it would be hard to rule out collusion even for malign model with current level of abilities?
I agree there is a quantitative question here on how much you can iterate against this to attack “the roots” of this misalignment vs you just have tools to remove it superficially, and the example of o3 is definitely not the example that makes me the most optimistic (though I expect AI developers to get better at not treating this kind of problem superficially).
I still think that there is sth honesty-test-ish about the setup even if the CoT reasoning isn’t clean: I think it’s a really bad idea for a coherent schemer or a deployment influence seeker to be going for this sort of hack.
The sort of simple reward hacking that you see with the <_grading> environment looks to me like something only a model that is close to a reward-on-the-episode seeker would do, it’s an honesty test where the model reveals its misalignment.
In particular a deployment influence seeker or a schemer would likely not do it—it’s too easy to find techniques (e.g. inoculation prompting, or whatever OpenAI did with o3) that avoid that behavior “for mostly the right reasons” (e.g. I don’t think OpenAI just trained on RL environments like this _grading env, and so I would guess the reason o3 does not reward as much is not “it just got more clever at reward hacking”).
I think auto-mode is cool progress on using blocking control classifiers in prod, and I am looking forward to seeing people point out all the ways in which we could build better ones.
My 3 favorite ideas from this blogpost:
It’s maybe a good idea to just show user and assistant turns and not tool results. It reduces cost, improves prompt injection robustness, and makes the monitor less sympathetic to overeager behavior.
It’s maybe a good idea to have a no-CoT first stage (with a more conservative prompt) and a with-CoT second stage with the same model (such that you hit the cache). This lets you have a performance close to always using a CoT at a cost closer to only using a no-CoT classifier.
It’s maybe a good idea to let the model know why you blocked it and let it continue (and stop the conversation if it blocks too often):
Building control classifiers with an FPR so low that you are happy to take the productivity hit of tasks being interrupted is very difficult (we’ve tried). Using classifiers for steering (that only interrupt when you get repeated flags) lets you get away with a much higher FPR.
Even benign models often continue on the same dangerous trajectory if you resample them, but they let themselves be steered.
It’s unclear how much malign models would be able to leverage this steering advice to bypass the monitor. My current guess is that it would be pretty difficult for models as strong as current models, but we haven’t studied this. I’d be interested in future work studying it!
(Disclaimer: the blogpost itself is an engineering blogpost, it doesn’t discuss control implications.)
I am most familiar with research into alignment and control of early transformative AIs (that could—if things go well—greatly help with aligning even more powerful AIs), which is why it is most represented here.
I know of a few agendas that are about alignment of top-human-expert dominating AIs (TEDAIs) that you can defer to if aligned and which can do all the research you might have wanted to do in advance very quickly. Some interp, debate, and science of generalization work above falls in this category. These things feel promising for being a path to TEDAI alignment, though nothing that would give you high assurance of safety. I am less familiar with more ambitious / non-LLM angles of attack on this problem (e.g. the brain-like AGI agenda, for which I guess you can find important 2025 work on Steven Byrne’s LW profile), so my list is incomplete for this part of the field.
My current understanding is that direct progress on actual asymptotic alignment (techniques to align arbitrarily smart AIs, including e.g. jupyter-sized brains) has been very slow. I think the reasons to build techniques to align AIs much beyond TEDAIs are quite weak though, so I think it’s fine to focus on less ambitious work. You might have thought you could find a simple core to alignment and simplify the problem by trying to find a general solution, but this is mostly an instrumental strategy to solve the problem of aligning TEDAI and you don’t need this to succeed at align TEDAI. The main remaining reason is to try to guess AI alignment difficulty for political coordination reasons (the AIs you defer to might have preferred that you had eaten more or less cost in order to coordinate on a slowdown depending on how easy the AIs you defer to find it to align even more powerful AIs, and it might be too late for them to fix your mistakes because of inertia), but I think this reason gets weaker as you get further away from the TEDAI regime, since at a certain points only AIs are the main players, and it would be very surprising to me if they couldn’t coordinate to slow down at or soon after TEDAI (if each of them was aligned enough).
Measuring and improving coding audit realism with deployment resources
Here are the 2025 AI safety papers and posts I like the most.
The list is very biased by my taste, by my views, by the people who had time to argue that their work is important to me, and by the papers that were salient to me when I wrote this list. I am highlighting the parts of papers I like, which is also very subjective. (This is similar to the 2024 edition here.)
Bangers on multiple dimensions
★★★ You can measure time horizon, and it grows predictably-ish (Measuring AI Ability to Complete Long Tasks)
★★★ Black-box techniques work better than you think + you can (and should) baseline interp techniques against them + you can check that an LLM “really” internalized misaligned properties via good generalization tests + many interp techniques don’t work that well (Auditing language models for hidden objectives, Eliciting Secret Knowledge from Language Models, Evaluating honesty and lie detection techniques on a diverse suite of dishonest models)
★★★ “Evil behavior” generalizes far + You can get far generalization beyond just “evil behavior”, and even from prod-like data + training against emergent misalignment can just make it more localized + inoculation prompting helps a lot against it (Emergent misalignment, Weird Generalization, Natural emergent misalignment from prod RL, Thought Crimes, Training a reward hacker despite perfect labels)Important ideas—the contribution I find most important is the idea
★★★ Secret loyalties and other forms of power concentration might be a big deal (How a Small Group Could Use AI to Seize Power, Gradual Disempowerment (arguments about power concentration are the ones I find the most convincing, but the essay makes stronger claims))
★★ The best prioritized list of AI control threats (Prioritizing threats for AI control)
★★ The future could come soon and would likely be dangerous (AI 2027)
★★ You could probably monitor compute enough to slow down AI development a lot (An International Agreement to Prevent the Premature Creation of Artificial Superintelligence)
★★ You can partially automate pre-deployment auditing, and it helps surface some worrisome properties and eval interp tools (Building and evaluating alignment auditing agents, Anthropic model cards)
★★ A good overview of the AI for AI safety plan (How do we solve the alignment problem?)
★★ You might get behavioral scheming even without the classic train-time scheming (Training-time schemers vs behavioral schemers, The case for countermeasures to memetic spread of misaligned values)
★★ Early AGIs being bad at research may be a bigger deal than them being egregiously misaligned (The Case Against AI Control Research)
★ Resampling actions lets you do incrimination (Ctrl-Z)
★ A formalization of the “fitness seekers / schemers are encouraged once the AI is good at fitness seeking” argument (The behavioral selection model for predicting AI motivations)
★ What ambitious control looks like (A sketch of an AI control safety case)
★ Robots could get cheap very fast (The Industrial Explosion)
★ You can use partial reinitialization for unlearning (Distillation robustifies unlearning)
★ You can monitor and steer LLM persona via activation and weight monitoring/steering (Persona Vectors, Steering Language Models with Weight Arithmetic)
★ You can avoid multi-step reward hacks with myopic rewards (MONA)
★ You can quantify elicitation by seeing how many parameters you need to tune to recover high performance (Quantifying Elicitation of Latent Capabilities in Language Models)
★ You should try to keep using natural language CoTs because the English text bottleneck is convenient for monitoring (Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety)
★ You can make better LLM-powered interfaces for viewing long transcripts (Docent)
★ Distinctions between sync vs semi-sync vs async control (Practical challenges of control monitoring in frontier AI deployments)
★ Certain edge cases make control harder than you might think (Rogue internal deployments via external APIs, How much novel security-critical infrastructure do you need during the singularity?)
★ You can probably have debate protocols that scale gracefully (Avoiding Obfuscation with Prover-Estimator Debate)Important methods—the contribution I find most important is the method and showing it works / doesn’t work
★★★ Adversarial robustness is not doomed: just training against enough jailbreaks works well against medium-effort attackers (with black box access) if you do it well enough (Constitutional classifiers)
★★ Reasoning monitoring helps spot reward hacks in prod and this benefit can easily be lost (Monitoring Reasoning Models for Misbehavior)
★★ You can train models to tell you what data they’ve been trained on, and it generalizes relatively far (Learning to Interpret Weight Differences in Language Models)
★★ You can train models to tell you what information an activation contains, and it generalizes relatively far (LatentQA, Activation Oracles)
★★ Simple model diffing is good for analyzing effects of post-training (Discovering Undesired Rare Behaviors via Model Diff Amplification, Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences)
★★ Pretraining data filtering maybe works to reduce capabilities in targeted ways (Pretraining Data Filtering for Open-Weight AI Safety)
★ Synthetic document fine-tuning works fine at injecting false facts (Modifying LLM Beliefs with Synthetic Document Finetuning, Believe It or Not)
★ Lie detection isn’t straightforward: the concepts are messy and don’t generalize far (Liars’ Bench, Difficulties with Evaluating a Deception Detector for AIs)
★ Simple techniques don’t reliably detect sandbagging (Auditing Games for Sandbagging)
★ Probing transfers well across fine-tunes (Towards Safeguarding LLM Fine-tuning APIs against Cipher Attacks)
★ The general recipe of generating synth data, labeling it with a good model, and then augmenting it to make it harder works to build good classifiers and avoid certain prosaic alignment failures (Consistency Training, Constitutional classifiers)Important evals and stress-testing results—the contribution I find most important is pointing out an interesting property, possibly using a novel methodology
★★★ [Heavy RL + large amounts of serial inference time compute + some situational awareness] mostly does not (yet) result in train-time misaligned consequentialist scheming-like behaviors (diffuse attribution for the empirical results, a lot of what we know is thanks to r1 and other pieces of capabilities work doing heavy RL without monitoring the reasoning and sharing what the reasoning looks like, some relevant thoughts here)
★★ Reasoning works in human-like ways, and you can measure what is causally important in reasoning via paraphrased distillation or resampling (Do reasoning models use their scratchpad like we do?, Thought anchors, Thought branches)
★★ Reasoning training often results in strong verbalized situational awareness, especially when you train on honeypot-like RL environments + training against on some kinds of inference-time scheming-like generalizes quite far to other ones (and probably some due to situational awareness, but it’s probably not most of the effect), even with a malign init (Anti-scheming, Sonnet 4.5 model card, Sonnet 4.5′s eval gaming seriously undermines alignment evals)
★★ Reasoning models are very (though not perfectly) faithful in their reasoning (When Chain of Thought is Necessary, Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild, Chain-of-Thought Reasoning In The Wild Is Not Always Faithful, Reasoning Models Don’t Always Say What They Think, Are DeepSeek R1 And Other Reasoning Models More Faithful?)
★★ You can measure single-forward-pass time horizon to guess model sabotage abilities, it’s short, and filler tokens help a bit (Measuring no CoT math time horizon, Recent LLMs can use filler tokens or problem repeats to improve (no-CoT) math performance)
★★ You generalize more from subtly related behaviors when the base model is the same (Subliminal learning)
★★ You can partially mitigate eval awareness with steering (Steering Evaluation-Aware Language Models to Act Like They Are Deployed, Anthropic model cards)
★ LLMs are starting to have introspective abilities (Tell me about yourself, Emergent Introspective Awareness in Large Language Models, Training Language Models to Explain Their Own Computations)
★ Current LLMs struggle to reason in human-made ciphers even when they can read them, and more generally it is hard to get good steganography (All Code, No Thought, Training fails to elicit subtle reasoning in current language models)
★ It is easy to fall down a psychosis spiral (AI Induced Psychosis: A shallow investigation)
★ Output supervision can obfuscate reasoning
★ Inoculation prompting works in the simple settings where it ought to work, though it can sometimes be brittle (Inoculation prompting 1 2, Recontextualization)
★ You can train AIs to evade probes described in the prompt (Neural Chameleons)
★ Some reasoning models are probably “capable” of not blabbing in their reasoning, since stacking pressure prompts often results in no leaking and they can follow instructions in their reasoning (Can Reasoning Models Obfuscate Reasoning?)
★ AIs do scary stuff when under pressure (Agentic misalignment)
★ Training specialized reasoning classifier models works fine, but not great (gpt-oss-safeguards)
★ SAEs do not give you “more natural” features that generalize better than trained probes (Are Sparse Autoencoders Useful?, Sparse Autoencoders Do Not Find Canonical Units of Analysis)
★ A small number of samples can poison LLMs of any size
★ Opus 3 is its own special alignment faking snowflake, and in other models alignment faking is confusing (Why do some LLMs fake alignment while others don’t)
★ It is very easy to mess up measurements of / misrepresent LLM cultural/political bias (Randomness, Not Representation: The Unreliability of Evaluating Cultural Alignment in LLMs)
★ AI productivity uplift is (was?) smaller than you think (Measuring the Impact of Early-2025 AI)Important tools and datasets—the contribution I find most important is resources that will be reused by future work
★★ A decent default automated auditing agent (Petri)
★ Evaluations of sabotage abilities (ShadeArena, BashArena, SusEval, Evaluating Frontier Models for Stealth and Situational Awareness)
★ A good platform for running control research (ControlArena)
★ More effective emergent misalignment datasets (Model Organisms for Emergent Misalignment, Persona features control emergent misalignment and some other emergent misalignment papers)
★ Relatively hard datasets that help capture some of the challenges of learning while remaining aligned (Selective Generalization: Improving Capabilities While Maintaining Alignment, and some other papers related to selective generalization)
★ A tool to protect materials from scraping (easy-dataset-share)[Edit: adding papers from a list of papers I wrote during the year and that I forgot to check
★ A good source of data for elicitation on fuzzy tasks (A dataset of rated conceptual arguments)
★ A recipe and ideas for how to make persona training slightly less shallow via SDF (Open Character Training)
★ Making reasoning models verbalize more just via training (Teaching Models to Verbalize Reward Hacking in Chain-of-Thought Reasoning)
★ You can verify the output of the inference egress to secure model weights even if doing lots of inference (DiFR)
★ Maybe mechanistic weight interp is tractable? (Weight-sparse transformers have interpretable circuits, Attribution-based Parameter Decomposition)]There were many other important documents produced last year, but that I excluded them because they are not safety papers/technical blogposts, such as IABED, the Anthropic pilot risk report and its stress-testing, model cards of the major labs, the OpenAI model spec, the International AI safety report, and the EU AI code of practice.
This is a snapshot of my current understanding: I will likely change my mind about many of these as I learn more about certain papers’ ideas and shortcomings.
“such evidence” is not really clear in my sentence. What I meant is that you don’t have the sort of update on directly related behavior that this first subsection is about, since P(takeover bad from this model) is saturated at 0 for actions from a weak model. But I agree you do get evidence via updating on alignment (across models), which is more like the sort of evidence that the next subsection is about.
oops, meant to say “What happens if you merge the bash and the audit tool, just giving the AI a single bash tool from which it can call an audit command”, fixed
Does this plot from the blogpost answer your question or not? We have similar plots for other settings in the paper. (This is plotting the gap, but my guess from other experiment is that the baseline doesn’t change much on vs off policy, it’s self attribution that changes.)
Despite being somewhat pessimistic about top-down-interp downstream performance, it is still bottom-10th percentile of my expectations that adversarial training (against black-box attacks) reduces white-box technique performance as much as black-box technique performance, resulting in white-box techniques mostly not outperforming black-box ones on this benchmark.
(I previously thought it would be slightly unfair to black-box techniques to adversarially train against black-box techniques without also finding some way to train against white-box techniques, but it looks like you don’t even need to try being fair in this way to have white-box techniques ~not outperform black box techniques!)
I think this is a meaningful negative update against top-down-interp-for-auditing. I think it’s still relatively likely top-down interp helps detect coherent misalignment (on priors), but I don’t know of strong empirical evidence supporting this hope. (There are other stories for interp being useful, so it might not be a big update for most interp researchers.)
The paper does not convey this “big-negative update on interp” vibe though (e.g. none of the discussion items are about this big surprise), so maybe you disagree and I am missing some important elements in my interpretation?
I talked with the authors and it looks like it’s a mix of
Adversarial training actually does work on many quirks, but not at all on others, which drives the mean up
The bar for success rate is low (it’s just guessing the quirk correct in 10 attempts)
It’s actually hard to hide quirks against new OOD inputs and prefill attacks
Why is the “default” success rate so high despite adversarial training?
I am glad that you mostly relied on on-policy evals (as opposed to static or synthetic evals), since some recent work I’ve mentored showed that off-policy evals are unreliable for this kind of follow-up-question monitoring.
I think may be a significant issue for getting confidence in confessions, since it means you can only check that confessions work on kinds of bad outcomes that you’ve observed the main AIs actually take. If you never saw AIs do egregiously bad things, it seems hard to get very confident that AIs would flag egregiously bad things just based on the very mild bad behavior that is so mild you did not even bother training it out.
Maybe you can get a sense about this by training model organisms, similar to how you might evaluate probes, but this is weaker than the more direct approach you can use for black box monitors, where you can mostly check if it catches bad behavior generated by humans or by AIs prompted to do something bad (and where you strip the “please do bad thing X” at eval time).
The examples and arguments in this post are more about honesty than quality. I agree that honesty will be important and will get more important as stakes go up.
But I disagree with the “quality will matter more” heuristic. I think that in many places the best actions during the singularity will be extremely “sloppy” by usual standards:
Compute and human attention will be extremely scarce, and the best path forward will often (though not always) be “best effort on multiple fronts” as opposed to “do very few very simple things extremely well” because of diminishing returns to effort. I think that acknowledging these flaws will be very important though (and I like the precedent that some AI companies are setting by disclosing the many flaws of their processes in model cards / risk reports). (e.g. I think that alignment training, interp auditing, and control will all be way less careful than would be ideal, but it would be a huge mistake to focus on only one of these aspects of early-TAI risk mitigation.)
Time will be scarce, and forward-looking investments won’t matter as much. Things will be more like WW2 / the reaction to covid than carefully preparing an important cure, and so peace-time heuristics around being patient and making careful slow progress won’t apply.
I think you should consider renaming your post to “Honesty matters most when stakes are highest”.