If a dangerous capability eval doesn’t involve fine-tuning the model to do the dangerous thing, the eval only establishes a lower bound on capability, not an upper bound or even an approximate upper bound. Why? Because:
For future powerful AGI, if it’s misaligned, there’s a good chance it’ll be able to guess that it’s being evaluated (situational awareness) and sandbag / pretend to be incapable. The “no sandbagging on checkable tasks” hypothesis depends on fine-tuning. https://www.alignmentforum.org/posts/h7QETH7GMk9HcMnHH/the-no-sandbagging-on-checkable-tasks-hypothesis Moreover we don’t know how to draw the line between systems capable of sandbagging and systems incapable of sandbagging.
Moreover, for pre-AGI systems such as current systems, that may or may not be incapable of sandbagging, there’s still the issue that prompting alone often doesn’t bring out their capabilities to the fullest extent. People are constantly coming up with better prompts that unlock new capabilities; there are tons of ‘aged like milk’ blog posts out there where someone says “look at how dumb GPT is, I prompted it to do X and it failed” and then someone else uses a better prompt and it works.
Moreover, the RLHF training labs might do might conceal capabilities. For example we’ve successfully made it hard to get GPT4 to tell people how to build bombs etc., but it totally has that capability, it’s just choosing not to exercise it because we’ve trained it to refuse. If someone finds a new ‘jailbreak’ tomorrow they might be able to get it to tell them how to make bombs again. (Currently GPT-4 explicitly refuses, which is great, but in a future regulatory regime you could imagine a lab training their system to ‘play dumb’ rather than refuse, so that they don’t trigger the DC evals.)
Moreover, even if none of the above were true, it would still be the case that your model being stolen or leaked is a possibility, and then there’ll be bad actors with access to the model. They may not have the compute, data, etc. to train their own model from scratch, but they quite likely have enough of the relevant resources to do a small fine-tuning run. So if you don’t fine-tune as part of your evals, you are in the dark about what such bad actors would be able to accomplish. Also, once we have more powerful models, if the model was misaligned and self-exfiltrated or otherwise gained control of its training process, it could do the fine-tuning itself...
Moreover, even if that doesn’t happen, the public and the government deserve to know what THE LAB is capable of. If a small amount of fine-tuning using known techniques could get the model to cause catastrophe, then the lab is capable of causing catastrophe.
Finally, even if you don’t buy any of the above, there’s a general pattern where GPT-N can do some task with a bit of fine-tuning, and GPT-N+M (for some small M possibly <1) can do it few-shot, and GPT-N+O (for some small O possibly <1) can do it zero-shot. So even if you don’t buy any of the previous reasons to make fine-tuning part of the DC evals, there is at least this: By doing fine-tuning we peer into the near future and see what the next generation or two of models might be capable of.
For future powerful AGI, if it’s misaligned, there’s a good chance it’ll be able to guess that it’s being evaluated (situational awareness) and sandbag / pretend to be incapable
I agree with this, and therefore I agree with the bottom line that labs should have at least some evals in domains where the AIs are fine-tuned in the next 12 months.
But I think your other arguments were somewhat true in 2022 but are mostly not true anymore, and current models are generally well-elicited.
prompting alone often doesn’t bring out their capabilities to the fullest extent. People are constantly coming up with better prompts that unlock new capabilities; there are tons of ‘aged like milk’ blog posts out there where someone says “look at how dumb GPT is, I prompted it to do X and it failed” and then someone else uses a better prompt and it works.
Do you have recent examples of this? My understanding is that for most single-forward-pass abilities this is mostly not true for recent instruction-tuned models (e.g. the classic “List sorting does not play well with few-shot” result is not true for recent models). For generation tasks it’s sometimes true (though my best guess is that it’s rare, especially if the outcome is clear), but mostly because the AI does not “know” what generation strategy it should use (if it doesn’t sandbag intentionally). (Once sandbagging is plausible, it’s plausible the AI sandbags on generation strategy though.) I think it’s not a slam dunk either, and there are some weird tasks like SusEval (which measures the ability to intentionally fail rarely) where there are strong reasons to suspect underelicitation of some models.
Finally, even if you don’t buy any of the above, there’s a general pattern where GPT-N can do some task with a bit of fine-tuning, and GPT-N+M (for some small M possibly <1) can do it few-shot, and GPT-N+O (for some small O possibly <1) can do it zero-shot.
I think this used to be true but is basically not true anymore. Anecdotally, I know of several experiments where people fine-tuned a model a bit on trajectories of a stronger model and it usually did not improve performance in ways that generalized across tasks (including across SHADE-Arena tasks). There are some exceptions (e.g. when fine-tuning a non-reasoning model on reasoning model trajectories), but they are relatively sparse (and even that reasoning example looks to me more like teaching than elicitation).
Moreover, even if none of the above were true, it would still be the case that your model being stolen or leaked is a possibility, and then there’ll be bad actors with access to the model Moreover, even if that doesn’t happen, the public and the government deserve to know what THE LAB is capable of. If a small amount of fine-tuning using known techniques could get the model to cause catastrophe, then the lab is capable of causing catastrophe.
Not relevant for stealth abilities, right? The reason one runs SHADE-Arena (especially without hidden scratchpad) is not to know if the lab / a malicious actor could do something bad. On capabilities that are more relevant for misuse like cyber or AI R&D automation, or capabilities shown in generic capabilities benchmarks that people use to know how powerful models are, I would guess most labs are already trying hard to elicit these capabilities. Are you worried about underelicitation there? (For CBRN, I think there are plausibly more low hanging fruits, though I would guess most gains would not come from fine-tuning but from scaffolding / helping people use AIs correctly—in the Sonnet 3.7 uplift trial the Anthropic employee group had a bigger uplift than the other participants.)
Yep I basically agree. Main argument still stands, the other ones are weakened to various degrees in 2026 compared to 2022, at least when applied to stealth abilities. Still have nonzero force though.
FWIW, I think I’m less sold on the importance of fine-tuning in these sorts of tests than you are. Could explain if you’re interested.
Yes please explain thanks!
Here’s a gdoc link to my views on this subject as of years ago. The text:
If a dangerous capability eval doesn’t involve fine-tuning the model to do the dangerous thing, the eval only establishes a lower bound on capability, not an upper bound or even an approximate upper bound. Why? Because:
For future powerful AGI, if it’s misaligned, there’s a good chance it’ll be able to guess that it’s being evaluated (situational awareness) and sandbag / pretend to be incapable. The “no sandbagging on checkable tasks” hypothesis depends on fine-tuning. https://www.alignmentforum.org/posts/h7QETH7GMk9HcMnHH/the-no-sandbagging-on-checkable-tasks-hypothesis Moreover we don’t know how to draw the line between systems capable of sandbagging and systems incapable of sandbagging.
Moreover, for pre-AGI systems such as current systems, that may or may not be incapable of sandbagging, there’s still the issue that prompting alone often doesn’t bring out their capabilities to the fullest extent. People are constantly coming up with better prompts that unlock new capabilities; there are tons of ‘aged like milk’ blog posts out there where someone says “look at how dumb GPT is, I prompted it to do X and it failed” and then someone else uses a better prompt and it works.
Moreover, the RLHF training labs might do might conceal capabilities. For example we’ve successfully made it hard to get GPT4 to tell people how to build bombs etc., but it totally has that capability, it’s just choosing not to exercise it because we’ve trained it to refuse. If someone finds a new ‘jailbreak’ tomorrow they might be able to get it to tell them how to make bombs again. (Currently GPT-4 explicitly refuses, which is great, but in a future regulatory regime you could imagine a lab training their system to ‘play dumb’ rather than refuse, so that they don’t trigger the DC evals.)
Moreover, even if none of the above were true, it would still be the case that your model being stolen or leaked is a possibility, and then there’ll be bad actors with access to the model. They may not have the compute, data, etc. to train their own model from scratch, but they quite likely have enough of the relevant resources to do a small fine-tuning run. So if you don’t fine-tune as part of your evals, you are in the dark about what such bad actors would be able to accomplish. Also, once we have more powerful models, if the model was misaligned and self-exfiltrated or otherwise gained control of its training process, it could do the fine-tuning itself...
Moreover, even if that doesn’t happen, the public and the government deserve to know what THE LAB is capable of. If a small amount of fine-tuning using known techniques could get the model to cause catastrophe, then the lab is capable of causing catastrophe.
Finally, even if you don’t buy any of the above, there’s a general pattern where GPT-N can do some task with a bit of fine-tuning, and GPT-N+M (for some small M possibly <1) can do it few-shot, and GPT-N+O (for some small O possibly <1) can do it zero-shot. So even if you don’t buy any of the previous reasons to make fine-tuning part of the DC evals, there is at least this: By doing fine-tuning we peer into the near future and see what the next generation or two of models might be capable of.
I agree with this, and therefore I agree with the bottom line that labs should have at least some evals in domains where the AIs are fine-tuned in the next 12 months.
But I think your other arguments were somewhat true in 2022 but are mostly not true anymore, and current models are generally well-elicited.
Do you have recent examples of this? My understanding is that for most single-forward-pass abilities this is mostly not true for recent instruction-tuned models (e.g. the classic “List sorting does not play well with few-shot” result is not true for recent models). For generation tasks it’s sometimes true (though my best guess is that it’s rare, especially if the outcome is clear), but mostly because the AI does not “know” what generation strategy it should use (if it doesn’t sandbag intentionally). (Once sandbagging is plausible, it’s plausible the AI sandbags on generation strategy though.) I think it’s not a slam dunk either, and there are some weird tasks like SusEval (which measures the ability to intentionally fail rarely) where there are strong reasons to suspect underelicitation of some models.
I think this used to be true but is basically not true anymore. Anecdotally, I know of several experiments where people fine-tuned a model a bit on trajectories of a stronger model and it usually did not improve performance in ways that generalized across tasks (including across SHADE-Arena tasks). There are some exceptions (e.g. when fine-tuning a non-reasoning model on reasoning model trajectories), but they are relatively sparse (and even that reasoning example looks to me more like teaching than elicitation).
Not relevant for stealth abilities, right? The reason one runs SHADE-Arena (especially without hidden scratchpad) is not to know if the lab / a malicious actor could do something bad. On capabilities that are more relevant for misuse like cyber or AI R&D automation, or capabilities shown in generic capabilities benchmarks that people use to know how powerful models are, I would guess most labs are already trying hard to elicit these capabilities. Are you worried about underelicitation there? (For CBRN, I think there are plausibly more low hanging fruits, though I would guess most gains would not come from fine-tuning but from scaffolding / helping people use AIs correctly—in the Sonnet 3.7 uplift trial the Anthropic employee group had a bigger uplift than the other participants.)
Yep I basically agree. Main argument still stands, the other ones are weakened to various degrees in 2026 compared to 2022, at least when applied to stealth abilities. Still have nonzero force though.