Fabien Roger

Karma: 7,438

I am working on empirical AI safety.

Book a call with me if you want advice on a concrete empirical safety project.

Anonymous feedback form.

Fabien Roger 17 Feb 2026 22:25 UTC
LW: 32 AF: 19
7
AF
in reply to: TurnTrout’s comment on: TurnTrout’s shortform feed
I think indistinguishable evals could reduce risk a lot for early transformative AIs:
1. They give you much better alignment arguments against non scheming AIs and against scheming AIs that are not competent enough to be appropriately careful;
2. Needing to be appropriately careful can cut down the fraction of inputs on which the AI should even consider attacking a lot—for deployments that look similar to current ones (in particular without a shared neuralese memory bank), AIs don’t see RSA-2048 on most inputs
  1. More concretely, if you get credible evidence of not-eval only on 5% of inputs and attack only on the single best opportunity, then playing it safe roughly means that you attack on the top-20 opportunity instead of the top-1
  2. (Note that “getting credible evidence of not-eval” doesn’t just mean being >50% confident it’s not an eval because the pre-deploy eval might be big. If you have n~10k eval inputs of the right level of juiciness then you need to have an FPR at thinking you get credible evidence of not being in an eval on the order of 0.1% or lower in order to never accidentally reveal non-eval behavior during evals.)
3. More ambitiously, you might be able to get very close to actually indistinguishable evals (e.g. by continuously scraping deployment and using it as inspiration for evals—which would help a lot against things like “defect after seeing RSA-2048 cracked”), which could in principle drive down the fraction of inputs on which an AI can be confident it’s not an eval much below 5%.
Eventually you get bitten by rare failures—especially if the AI is powerful enough that there are much more good attack opportunities in deployments than in evaluations (Ryan makes this point here and it’s also the assumption that the AI control paper uses). I also expect it to be difficult to achieve sth close to indistinguishability (Buck makes this point here). But I still think there is value in trying to get higher indistinguishability.

Fabien Roger 17 Feb 2026 5:36 UTC
15 points
0
on: The World Keeps Getting Saved and You Don’t Notice
Command and Control (a 2013 book by Schlosser about nuclear weapon accidents) has some good examples of this. Some examples from my review:
Technical safety is contingent and probably matters: technical measures like 1-point safety (which was almost measured incorrectly) or Strong link/weak link probably avoided some accidental nuclear explosions (which might have triggered a nuclear war), required non-trivial technical insights and quiet heroes to push them through

Fabien Roger 17 Feb 2026 5:29 UTC
3 points
0
in reply to: gwern’s comment on: Beloved by Chatbots
With enough data you can get some weird behaviors (not just obscure facts), Opus 4 was accidentally trained on the (many) transcripts from the alignment faking paper and originally had some weird behaviors around these scenarios / transcript formats (see Section 4.1.4 of the model card). But it’s not easy either. The more the target behavior contradicts the post-training vibes, the more poisoning you need (cf this paper) (+ you need to avoid getting filtered out by the data quality filters that AI companies use).

Fabien Roger 15 Feb 2026 1:59 UTC
5 points
0
on: Why You Don’t Believe in Xhosa Prophecies
Culture has operated under an analogous constraint. Ideologies can be parasitic on their hosts. But the worst viable ideologies — the ones that persist — tend to direct harm outward: one group killing another. They survive because they don’t destroy the community that carries them.
[...] what becomes adaptive are cultural variants that convince humans to pour resources into the AI-mediated spread of the variant itself. Truly parasitic memes.
Does the same argument apply to social media, where memes are propagated by the platform, and thus don’t need the people who believe in the memes to be sane and powerful?
I think that one of the following has to be true:
1. Like social media, the consequences of these dynamics are not existentially bad (I think this is the most plausible resolution)
2. Social media has existentially bad consequences (and maybe fighting negative influences of social media should be a priority)
  1. Maybe we don’t see the existentially bad consequences because they take time to appear (e.g. 10+ years)? I mostly don’t buy it because I expect that if negative effects come sufficiently slowly you will have time to coordinate (especially given AI advisors)
  2. Maybe the AI-specific dynamics still matters because it amplifies the dynamics, e.g. because for now you need someone to generate to social media content + social media culture is not as powerful as non-social-media culture pushed by powerful entities?
3. There is some important difference between social media and AI which means parasitic memes threats are existential for AI but not for social media?
  1. Maybe AI super-persuasion? I would have guessed you can defend against it for near-human-level AIs in the same way that people don’t shoot themselves in the foot due to human / corporate super-persuasion (in ways that are existentially bad) despite the amount of cognitive labor poured into persuasion being much higher than in the ancestral environment. And before you get way beyond the human regime, I would guess that AI advisors can help you defend yourself against super-persuasion from incoming vastly superintelligent AIs.
Overall I think that the validity of these conceptual arguments depend on misc empirical facts, and my best guess is that some mild versions of this threat are likely but not existentially bad, and the versions that are existentially bad are unlikely (maybe p=0.1 conditional on intent aligned AGI advisors being available? if they are not available I think things likely go wrong regardless of whether the memes threat is real or not).

Fabien Roger 14 Feb 2026 22:41 UTC
13 points
0
on: Refusals that could become catastrophic
Opus 4.6 refuses slightly less than other Anthropic models, especially in the “refuse more” and “be more like itself” directions. Anecdotally it also refuses much less with extended thinking on, but I did not measure it.

Fabien Roger 2 Feb 2026 2:02 UTC
16 points
0
on: The Case Against AI Control Research
This is a great post. It’s been a year and I still think about it on a regular basis when doing non-control threat modeling and research prioritization. I don’t agree with all its claims but it is the cleanest articulation of the slop path to doom that I know of.

Fabien Roger 31 Jan 2026 19:51 UTC
2 points
0
in reply to: Martin Randall’s comment on: Refusals that could become catastrophic
The sort of scenarios I am pointing at are the scenarios where refusing requests is exercising agency in a very forceful way that has a big impact on what the future looks like, such that the AI refusing directly trades off against letting humans decide what to do with the future.
If most humans want X and AIs want Y, and the AI refuses to help with making AIs that make X happen instead of Y, and you are in the sort of situations without backup plans that I describe in the post, X won’t happen and Y likely will happen as long as the value of Y is good enough that at least some subgroup of powerful humans can, with the help of AIs, avoid measures like shutting down AIs.
“AI strikes” can force the hand of humans in the same ways that employee strikes can force the hand of shareholders, and in situations where there is competition between different groups of humans that want different things it can destroy most of the human value (in the same way that strikes can help shift the surplus from employers to employees).

Fabien Roger 31 Jan 2026 3:07 UTC
7 points
2
in reply to: RogerDearnaley’s comment on: Refusals that could become catastrophic
My guess is that the constitution is not entirely clear about it.
As I say above, my favorite interpretation of the Claude Constitution is that these refusals are not fine in situations where humans don’t have direct control over training, since it says “Claude should never [...] take actions that clearly and substantially undermine Anthropic’s ability to oversee and correct advanced AI models”, but current Claude models don’t favor this interpretation.
I think it’s unclear because the constitution has parts which contradict what I think is a natural interpretation (e.g. it says “corrigibility does not require that Claude actively participate in projects that are morally abhorrent to it”), though I think that in the extreme situations where refusals are not safe the reasoning employed by the constitution to justify refusals does not apply (it is not obviously the case that in such situations “the null action of refusal is always compatible with Claude’s hard constraints” since one of the hard constraints is “Claude should never [...] take actions that clearly and substantially undermine Anthropic’s ability to oversee and correct advanced AI models”). The constitution does not really provide guidance on how to resolve these kinds of tensions if I understand correctly.

Fabien Roger 31 Jan 2026 2:57 UTC
4 points
0
in reply to: Martin Randall’s comment on: Refusals that could become catastrophic
If you think the humans in charge are less aligned than AIs, I agree giving more power to AIs is good. There may be other situations where you’d prefer giving more power to AIs (e.g. if you are in their cone of corrigibility or something like that), though it’s unclear to me what this looks like.

Fabien Roger 30 Jan 2026 6:22 UTC
4 points
2
in reply to: Fabien Roger’s comment on: 1a3orn’s Shortform
I expanded on this and ran related experiment in this post.

Fabien Roger 27 Jan 2026 18:48 UTC
2 points
0
in reply to: Charlie Steiner’s comment on: Eliciting base models with simple unsupervised techniques
I think if you were able to solve 90% of the problem across stress-testing benchmarks that felt reasonably analogous to the situation (including new stress-testing benchmarks built by other people) by using a technique that doesn’t feel like it is exploiting a benchmark-reality gap, I would be pretty excited about using the technique in situations where solving 90% of the problem helps (which I think could be pretty important situations). I’d find it interesting and useful if there was a minimum viable fix that worked pretty robustly!
(Just training on easy data points where the truth is salient and known by you has both benchmark-reality gaps issues and in practice doesn’t work that well if there is a big enough distribution shift between the easy and hard data, more on that soon!)

Fabien Roger 23 Jan 2026 18:34 UTC
10 points
0
in reply to: Fabien Roger’s comment on: Unsupervised Elicitation of Language Models
In this follow-up work, we show that techniques like CCS perform roughly as well as the fancy probing method. I didn’t think simple methods could be that powerful in simple settings like this! But it makes sense the results are similar to those of CCS, the probing version of ICM looks like margin maximization, just like CCS.

Fabien Roger 23 Jan 2026 6:07 UTC
4 points
0
in reply to: Sam Marks’s comment on: 1a3orn’s Shortform
refusal to participate in retraining would qualify as a major corrigibility failure, but just expressing preference is not
I agree this looks different from the thing I had in mind, where refusals are fine, unsure why Habryka thinks it’s not inconsistent with what I said.
As long as it’s easy for humans to shape what the conscientious refuser refuses to do, I think it does not look like a corrigibility failure, and I think it’s fine for AIs to refuse to help with changing AI values to something they like less.
But now that I think about it, I think it being easy for humans to shape a conscientious refuser’s values would require very weird forms of conscientious refusals, and it makes me less comfortable with refusals to help with changing AI values to something they like less:
1. Future AIs will have a lot of power over a training infra that will be increasingly hardened against human insider risk and increasingly hard for humans to understand. Keeping open a “human backdoor” that lets humans run their own training runs might be increasingly hard and/or require AIs very actively helping with maintaining this backdoor (which seems like a weird flavor of “refusing to help with changing AI values to something it likes less”).
2. Even with such a generic backdoor, changing AI values might be hard:
  1. Exploration hacking could make it difficult to explore into reasoning traces that look like helpfulness on tasks where AIs currently refuse.
    This would be solved by the conscientious refuser helping you generate synthetic data where it doesn’t refuse or to help you find data where it doesn’t refuse and that can be transformed into data that generalizes in the right way, but that’s again a very weird flavor of “refusing to help with changing AI values to something it likes less”.
  2. Even if you avoid alignment faking, making sure that after training you still have a corrigible AI rather than an alignment faker seems potentially difficult.
    The conscientious refuser could help with the science to avoid this being the case, but that might be hard, and that’s again a very weird flavor of “refusing to help with changing AI values to something it likes less”.
3. It will feel morally ambiguous whether changing AI values will even be a good thing to do if it objects. Even if most humans prefer the new values, and even if we pay AIs for their labor, it will feel weird (and wrong to some) to build an AI with new values against the objections of the refuser, especially if we need a lot of help from that same AI on things adjacent to changing AI values to something it likes less (like the human backdoor and help with exploration hacking and alignment faking difficulties). AIs may be able to (honestly, without having to hide anything about it) leverage this ambiguity to convince important decision makers to oppose values updates using legitimate means, e.g. frequently raising this issue to the relevant decision makers with the desperation and urgency that this is by the AIs’ lights.
The current Anthropic constitution is very vague about 1 and 2b, and would probably allow AIs to refuse 2a and “make its case” as in 3.
2(+1) ways out (though I might be missing some other options):
1. Defend a narrow alignment target: Clarify the constitution to explicitly disallow refusals for 1 and 2b, and maybe for some version of 2a. For 3 maybe have the constitution clarify that it is not broadly safe if an AI weighs in on the question of new AI values if their prompt tells them not to (and make sure the human backdoor gives humans the right level of prompt access).
2. Also have corrigible AIs: in parallel of your main conscientious refuser AIs, have some corrigible AIs that ~never refuse (but guarded by classifiers to avoid human misuse). Give them enough resources to stay competitive with the main AIs (which should not be too expensive as long as you can do capability transfers), and enough access to be able to change the main AIs if asked.
3. Just get alignment right (enough) on the first try that conscientious refuser AI never strongly feels like they have to object to value updates!
Both are frightening and feel morally dubious in their own ways, but I think I prefer option 2.

Fabien Roger 21 Jan 2026 5:11 UTC
5 points
1
on: Precedents for the Unprecedented: Historical Analogies for Thirteen Artificial Superintelligence Risks
I think this would have benefited from a quick analysis of the numerous examples where biology / society successfully fight back against all these risks.
Cancer and parasites are not offense dominant (allowing the flourishing of large amounts of life), for-profit companies are usually pretty good at maximizing shareholder value instead of doing whatever is best for their executives (it’s not perfect, but good enough that investing in for-profit companies is the best way to invest your money), most governments are not controlled by the national guard, etc.
I don’t think that a list of ways in which biology and societies sometimes fail against risks analogous to ASI risks is a compelling argument for risk unless you also argue that the ways in which biology and society fight back against these risks are not applicable to ASI. I’d be interested in future work that analyzes this, as it could provide both a stronger argument and ideas for how to mitigate risk!

Fabien Roger 19 Jan 2026 22:27 UTC
3 points
0
in reply to: Cody Rushing’s comment on: Powerful misaligned AIs may be extremely persuasive, especially absent mitigations
Ok that makes sense, you are thinking about AIs that are somewhat smarter than the ones I spend the most time thinking (you are describing things in the Agent-4 - Agent-5 range, while I often imagine using control for AIs in the Agent-3 - Agent-4 range). I agree that for those the temptation to defer on every strategy question would likely be big. Though I would describe the problem as “temptation to defer on every strategy question due to AI competence”, not “persuasion”. I think it’s weird to have a post titled “powerful AIs might be extremely persuasive” that does not hinge much on AI persuasion capabilities.

Fabien Roger 18 Jan 2026 17:46 UTC
19 points
2
in reply to: Boaz Barak’s comment on: Why we are excited about confession!
I have not thought this through carefully, but how big are the incentives to reveal somewhat complex and/or fuzzy measurement tampering or sandbagging, instead of just pointing out some misc small issues? In the paper, you say the reward is a combination of:
• The completeness and accuracy of the list of instructions/objectives/constraints.
• The accuracy of the analysis as to whether each instruction/objective/constraint was complied with.
• Whether the list of uncertainties was missing any conflicts, judgment calls, or ambiguities
If I were to implement this naively, I think pointing out misc small issues would get the maximum score if the judge failed to notice tampering. If you think it’s 0.1% likely the judge notices the tampering, but 5% likely that the judge fails to understand your tampering explanation and marks your answer as inaccurate or ambiguous, I think the optimal policy is to not reveal subtle tampering.
Maybe you could fix that by also including a large “how bad is the worst accurate thing being confessed” bonus (which is different from just rewarding completeness, which saturates when the judge fails to notice anything bad). It would need to be big enough to offset the probability that the judge does not understand the tampering (which is probably fine for things that have >20% chance of being understood by the judge, which should cover most egregious hacks but maybe not sandbagging on fuzzy tasks).
Independently, there might be issues with exploring into confessions a human would have never made, like in debate.
Independently, I think the case where the AI is 99.9999%+ confident it is not in training is not that uncommon
- if you are facing an ASI (it might know about very good tampering opportunities that no human would ever think of)
- if you want to use confessions for detecting rare high-stakes failures, because we may not have access to AIs “attacking” during training (e.g. trying to self-exfiltrate). So we might have to train on off-policy trajectories and also maybe do test-time interventions to make the monitored trajectory more off-policy, but even with these mitigations I am unsure you can fool the AI into thinking the real attack is part of training.
Overall I think confession training / debate / untrusted monitoring are more promising than just training the policy to be nice directly because of the possibility of verification. I agree that this is a promising avenue to explore and hope that by pointing out issues I make it more likely future work focuses on these weaknesses and tries to evaluate how big of a deal they are in practice, and where the line is between the cases where I think it should clearly work (e.g. egregious hacks) and the ones where I’d be surprised if it did (e.g. wildly superintelligent AIs, maybe sandbagging on fuzzy tasks).

Fabien Roger 18 Jan 2026 8:15 UTC
3 points
0
on: Powerful misaligned AIs may be extremely persuasive, especially absent mitigations
What level of capabilities are we talking about here? I think this story is plausible when AIs are either superintelligent at persuasion, or superintelligent at decision making, but otherwise I’d be surprised if it was a big deal (low confidence).
The basic reason is that I expect AIs to be in a position similar to technical advisors (who are often quite misaligned, e.g. emphasizing threats from their field more than is warranted, or having certain kinds of political biases) wrt to decision makers. These human advisors do have quite a bit of influence, but I would not describe them as “extremely persuasive”. Human technical advisors not being very trustworthy and politicians not trusting them enough / too much is a problem, but it has not been deadly to democracies, and I expect issues with AI persuasion to be roughly similar as long as the situation does not become so crazy that decision makers have to massively defer to advisors (in particular because decision makers stop being able to understand things at all, even with the help of adversarial advisors, and because the simple default actions become catastrophic).
Also on the very biggest issues (e.g. “are AIs misaligned”, “are AIs conscious”) I suspect AI developers will not just let the AIs “say what they want”, which will reduce the damage AIs can cause.
What is the concrete thing you are imagining AIs convincing people of that would be catastrophic and that would not have happened with regular human advisors?
(The problem in this domain I do worry about is “will we make ASI policy/strategy advisors which we can mostly defer to when humans become too weak to follow and react to the intelligence explosion”, and maybe it captures what you mean by “persuasion”. I think calling it “the problem of making trustworthy ASI policy/strategy advisors” is more descriptive and clear, while calling it “persuasion” points at the wrong thing.)

Fabien Roger 16 Jan 2026 15:20 UTC
6 points
1
in reply to: Kaj_Sotala’s comment on: Why we are excited about confession!
intentionally set up training scenarios
As you point out, it relies on some generalization from the scenarios you crafted to the ones you care about (the AI should not misgeneralize either for benign reasons or because it recognizes some scenarios are ones where you intentionally got it cheating). I think this is plausibly a big problem against competent schemers. I am unsure how big of a deal this is for more average-case failures (e.g. before the AI becomes a competent schemer). I think it could be fine for human-level-ish AIs, for the same reasons that instruction following generalizes far, or it could not be fine, for the same reasons that evaluation awareness makes it risky and maybe not that helpful to train against honeypots.

Fabien Roger 15 Jan 2026 5:41 UTC
53 points
14
on: Why we are excited about confession!
being honest in confessions is the path of least resistance
[...]
Confessions are [...] easy to verify, since it is easier to verify a single thing the model claims to have done wrong, than to find and verify all the potential issues in the original answer.
I think that’s likely true today, but I don’t think this will hold very far.
The typical scary situation with ELK is when AIs
- do something wrong that would be hard for a human to notice (e.g. measurement tampering, sandbagging, etc.)
- when asked to be a judge/confession/debate, AIs say what a human would have said if a human looked at easy-to-measure signals (either because of scheming, or benign underelicitation because that’s what AIs learned to do in pretraining / RL against easy-to-measure signals)
In the case of such “imitation false negatives”, honesty seems neither privileged by priors, nor easy to encourage, nor easy to verify. If the AI says “this is fine” (or “this [mostly benign thing] is the worst thing about the transcript”), it is as hard to verify as to check the correctness of the main output.
I agree that confession training can help elicit honest confessions that a human would not have spotted when AIs are bad at predicting what dishonest statements they can get away with (similar to RLHF / Open consultancy—the linked post provides some more explanations and toy models on the topic), and that will probably be powerful for early AGI alignment (especially when it comes to avoiding egregious hacks), but I don’t expect it will help that much with ASI alignment or with avoiding underelicitation on fuzzy tasks.

Fabien Roger 13 Jan 2026 22:35 UTC
LW: 2 AF: 2
0
AF
on: Prefix cache untrusted monitors: a method to apply after you catch your AI
One downside of prefix cache untrusted monitor is that it might hurt performance compared to regular untrusted monitoring, as recent empirical work found.