Fabien Roger comments on Daniel Kokotajlo’s Shortform

Fabien Roger 9 Apr 2026 2:16 UTC
22 points
4
For future powerful AGI, if it’s misaligned, there’s a good chance it’ll be able to guess that it’s being evaluated (situational awareness) and sandbag / pretend to be incapable
I agree with this, and therefore I agree with the bottom line that labs should have at least some evals in domains where the AIs are fine-tuned in the next 12 months.
But I think your other arguments were somewhat true in 2022 but are mostly not true anymore, and current models are generally well-elicited.
prompting alone often doesn’t bring out their capabilities to the fullest extent. People are constantly coming up with better prompts that unlock new capabilities; there are tons of ‘aged like milk’ blog posts out there where someone says “look at how dumb GPT is, I prompted it to do X and it failed” and then someone else uses a better prompt and it works.
Do you have recent examples of this? My understanding is that for most single-forward-pass abilities this is mostly not true for recent instruction-tuned models (e.g. the classic “List sorting does not play well with few-shot” result is not true for recent models). For generation tasks it’s sometimes true (though my best guess is that it’s rare, especially if the outcome is clear), but mostly because the AI does not “know” what generation strategy it should use (if it doesn’t sandbag intentionally). (Once sandbagging is plausible, it’s plausible the AI sandbags on generation strategy though.) I think it’s not a slam dunk either, and there are some weird tasks like SusEval (which measures the ability to intentionally fail rarely) where there are strong reasons to suspect underelicitation of some models.
Finally, even if you don’t buy any of the above, there’s a general pattern where GPT-N can do some task with a bit of fine-tuning, and GPT-N+M (for some small M possibly <1) can do it few-shot, and GPT-N+O (for some small O possibly <1) can do it zero-shot.
I think this used to be true but is basically not true anymore. Anecdotally, I know of several experiments where people fine-tuned a model a bit on trajectories of a stronger model and it usually did not improve performance in ways that generalized across tasks (including across SHADE-Arena tasks). There are some exceptions (e.g. when fine-tuning a non-reasoning model on reasoning model trajectories), but they are relatively sparse (and even that reasoning example looks to me more like teaching than elicitation).
Moreover, even if none of the above were true, it would still be the case that your model being stolen or leaked is a possibility, and then there’ll be bad actors with access to the model
Moreover, even if that doesn’t happen, the public and the government deserve to know what THE LAB is capable of. If a small amount of fine-tuning using known techniques could get the model to cause catastrophe, then the lab is capable of causing catastrophe.
Not relevant for stealth abilities, right? The reason one runs SHADE-Arena (especially without hidden scratchpad) is not to know if the lab / a malicious actor could do something bad. On capabilities that are more relevant for misuse like cyber or AI R&D automation, or capabilities shown in generic capabilities benchmarks that people use to know how powerful models are, I would guess most labs are already trying hard to elicit these capabilities. Are you worried about underelicitation there? (For CBRN, I think there are plausibly more low hanging fruits, though I would guess most gains would not come from fine-tuning but from scaffolding / helping people use AIs correctly—in the Sonnet 3.7 uplift trial the Anthropic employee group had a bigger uplift than the other participants.)
- Daniel Kokotajlo 9 Apr 2026 3:23 UTC
  3 points
  0
  Parent
  Yep I basically agree. Main argument still stands, the other ones are weakened to various degrees in 2026 compared to 2022, at least when applied to stealth abilities. Still have nonzero force though.