Daniel Kokotajlo comments on Daniel Kokotajlo’s Shortform

Daniel Kokotajlo 8 Apr 2026 17:57 UTC
145 points
37
I’m starting to get a little worried. In 2022 I was part of initiating and shaping the whole dangerous capabilities evals agenda, and I emphasized repeatedly to practically everyone I met that the whole thing would be worthless or worse if fine-tuning didn’t become standard practice as part of eliciting model capabilities. Now it’s 2026, the models are way smarter, and and hundreds of people are working on evaluating dangerous capabilities… and yet they still basically don’t do fine-tuning at all.

This really really needs to change before it’s too late, which could be any year now. The models are getting increasingly situationally aware. The entire bundle of evals we do is basically just asking them nicely to show us how scary they are; the obvious flaw in this plan is that if they are sufficiently smart and sufficiently scary, they’ll just choose not to show us. Fine tuning is a basic, obvious mitigation to this threat model that makes it harder for models to sandbag. But it’s not a panacea. But we aren’t even doing this basic thing!
- anaguma 9 Apr 2026 0:08 UTC
  6 points
  0
  Parent
  I think that fine-tuning evals are a good idea, and there has been some important related work done on this here. To quote from that post:
  Why is this useful?
  Soft prompts offer a simple way to quantify the conditional distance between a model and a target behavior. If many tokens are required, then lots of information-as-conditions is needed to define the behavior and it’s unlikely that it will happen by chance. If no number of tokens suffices, the behavior is likely inaccessible by inputs alone.
  It can be used as a part of evaluation suites to monitor the change in distance to concerning behaviors across models and scales and the degree to which those behaviors can be achieved in the first place.
  It is also difficult for the model to hide a capability that is otherwise accessible to some input. If the original model is an end-to-end differentiable product of gradient descent, it has already been shown to be transparent to the same type of optimizer that will be tasked with adversarially optimizing the input soft prompts against the machinery of the model.
  Compared to manual attempts at elicitation or automated techniques restricted to discrete tokens or black box optimization, it dramatically improves the chances that an evaluation’s attempt to elicit concerning behavior will succeed.
  It’s also dumb simple.
- ryan_greenblatt 8 Apr 2026 21:03 UTC
  5 points
  −6
  Parent
  FWIW, I think I’m less sold on the importance of fine-tuning in these sorts of tests than you are. Could explain if you’re interested.
  - Daniel Kokotajlo 8 Apr 2026 21:25 UTC
    21 points
    7
    Parent
    Yes please explain thanks!
    
    Here’s a gdoc link to my views on this subject as of years ago. The text:
    
    If a dangerous capability eval doesn’t involve fine-tuning the model to do the dangerous thing, the eval only establishes a lower bound on capability, not an upper bound or even an approximate upper bound. Why? Because:
    For future powerful AGI, if it’s misaligned, there’s a good chance it’ll be able to guess that it’s being evaluated (situational awareness) and sandbag / pretend to be incapable. The “no sandbagging on checkable tasks” hypothesis depends on fine-tuning. https://www.alignmentforum.org/posts/h7QETH7GMk9HcMnHH/the-no-sandbagging-on-checkable-tasks-hypothesis Moreover we don’t know how to draw the line between systems capable of sandbagging and systems incapable of sandbagging.
    Moreover, for pre-AGI systems such as current systems, that may or may not be incapable of sandbagging, there’s still the issue that prompting alone often doesn’t bring out their capabilities to the fullest extent. People are constantly coming up with better prompts that unlock new capabilities; there are tons of ‘aged like milk’ blog posts out there where someone says “look at how dumb GPT is, I prompted it to do X and it failed” and then someone else uses a better prompt and it works.
    Moreover, the RLHF training labs might do might conceal capabilities. For example we’ve successfully made it hard to get GPT4 to tell people how to build bombs etc., but it totally has that capability, it’s just choosing not to exercise it because we’ve trained it to refuse. If someone finds a new ‘jailbreak’ tomorrow they might be able to get it to tell them how to make bombs again. (Currently GPT-4 explicitly refuses, which is great, but in a future regulatory regime you could imagine a lab training their system to ‘play dumb’ rather than refuse, so that they don’t trigger the DC evals.)
    Moreover, even if none of the above were true, it would still be the case that your model being stolen or leaked is a possibility, and then there’ll be bad actors with access to the model. They may not have the compute, data, etc. to train their own model from scratch, but they quite likely have enough of the relevant resources to do a small fine-tuning run. So if you don’t fine-tune as part of your evals, you are in the dark about what such bad actors would be able to accomplish. Also, once we have more powerful models, if the model was misaligned and self-exfiltrated or otherwise gained control of its training process, it could do the fine-tuning itself...
    Moreover, even if that doesn’t happen, the public and the government deserve to know what THE LAB is capable of. If a small amount of fine-tuning using known techniques could get the model to cause catastrophe, then the lab is capable of causing catastrophe.
    Finally, even if you don’t buy any of the above, there’s a general pattern where GPT-N can do some task with a bit of fine-tuning, and GPT-N+M (for some small M possibly <1) can do it few-shot, and GPT-N+O (for some small O possibly <1) can do it zero-shot. So even if you don’t buy any of the previous reasons to make fine-tuning part of the DC evals, there is at least this: By doing fine-tuning we peer into the near future and see what the next generation or two of models might be capable of.
    - Fabien Roger 9 Apr 2026 2:16 UTC
      22 points
      4
      Parent
      For future powerful AGI, if it’s misaligned, there’s a good chance it’ll be able to guess that it’s being evaluated (situational awareness) and sandbag / pretend to be incapable
      I agree with this, and therefore I agree with the bottom line that labs should have at least some evals in domains where the AIs are fine-tuned in the next 12 months.
      But I think your other arguments were somewhat true in 2022 but are mostly not true anymore, and current models are generally well-elicited.
      prompting alone often doesn’t bring out their capabilities to the fullest extent. People are constantly coming up with better prompts that unlock new capabilities; there are tons of ‘aged like milk’ blog posts out there where someone says “look at how dumb GPT is, I prompted it to do X and it failed” and then someone else uses a better prompt and it works.
      Do you have recent examples of this? My understanding is that for most single-forward-pass abilities this is mostly not true for recent instruction-tuned models (e.g. the classic “List sorting does not play well with few-shot” result is not true for recent models). For generation tasks it’s sometimes true (though my best guess is that it’s rare, especially if the outcome is clear), but mostly because the AI does not “know” what generation strategy it should use (if it doesn’t sandbag intentionally). (Once sandbagging is plausible, it’s plausible the AI sandbags on generation strategy though.) I think it’s not a slam dunk either, and there are some weird tasks like SusEval (which measures the ability to intentionally fail rarely) where there are strong reasons to suspect underelicitation of some models.
      Finally, even if you don’t buy any of the above, there’s a general pattern where GPT-N can do some task with a bit of fine-tuning, and GPT-N+M (for some small M possibly <1) can do it few-shot, and GPT-N+O (for some small O possibly <1) can do it zero-shot.
      I think this used to be true but is basically not true anymore. Anecdotally, I know of several experiments where people fine-tuned a model a bit on trajectories of a stronger model and it usually did not improve performance in ways that generalized across tasks (including across SHADE-Arena tasks). There are some exceptions (e.g. when fine-tuning a non-reasoning model on reasoning model trajectories), but they are relatively sparse (and even that reasoning example looks to me more like teaching than elicitation).
      Moreover, even if none of the above were true, it would still be the case that your model being stolen or leaked is a possibility, and then there’ll be bad actors with access to the model
      Moreover, even if that doesn’t happen, the public and the government deserve to know what THE LAB is capable of. If a small amount of fine-tuning using known techniques could get the model to cause catastrophe, then the lab is capable of causing catastrophe.
      Not relevant for stealth abilities, right? The reason one runs SHADE-Arena (especially without hidden scratchpad) is not to know if the lab / a malicious actor could do something bad. On capabilities that are more relevant for misuse like cyber or AI R&D automation, or capabilities shown in generic capabilities benchmarks that people use to know how powerful models are, I would guess most labs are already trying hard to elicit these capabilities. Are you worried about underelicitation there? (For CBRN, I think there are plausibly more low hanging fruits, though I would guess most gains would not come from fine-tuning but from scaffolding / helping people use AIs correctly—in the Sonnet 3.7 uplift trial the Anthropic employee group had a bigger uplift than the other participants.)
      - Daniel Kokotajlo 9 Apr 2026 3:23 UTC
        3 points
        0
        Parent
        Yep I basically agree. Main argument still stands, the other ones are weakened to various degrees in 2026 compared to 2022, at least when applied to stealth abilities. Still have nonzero force though.

Daniel Kokotajlo comments on Daniel Kokotajlo’s Shortform

Why is this useful?