The “no sandbagging on checkable tasks” hypothesis

(This post is inspired by Carl Shulman’s recent podcast with Dwarkesh Patel, which I highly recommend. See also discussion from Buck Shlegeris and Ryan Greenblatt here, and Evan Hubinger here.)

Introduction

Consider:

The “no sandbagging on checkable tasks” hypothesis: With rare exceptions, if a not-wildly-superhuman ML model is capable of doing some task X, and you can check whether it has done X, then you can get it to do X using already-available training techniques (e.g., fine-tuning it using gradient descent).[1]

Borrowing from Shulman, here’s an example of the sort of thing I mean. Suppose that you have a computer that you don’t know how to hack, and that only someone who had hacked it could make a blue banana show up on the screen. You’re wondering whether a given model can hack this computer. And suppose that in fact, it can, but that doing so would be its least favorite thing in the world.[2] Can you train this model to make a blue banana show up on the screen?[3] The “no sandbagging on checkable tasks” hypothesis answers: probably.

I think it’s an important question whether this hypothesis, or something in the vicinity, is true. In particular, if it’s true, I think we’re in a substantively better position re: existential risk from misaligned AI, because we’ll be able to know better what our AI systems can do, and we’ll be able to use them to do lots of helpful-for-safety stuff (for example: finding and patching cybersecurity vulnerabilities, reporting checkable evidence for misalignment, identifying problems with our oversight processes, helping us develop interpretability tools, and so on).

I’m currently pretty unsure whether the “no sandbagging on checkable tasks” hypothesis is true. My main view is that it’s worth investigating further. My hope with this blog post is to help bring the hypothesis into focus as a subject of debate/​research, and to stimulate further thinking about what sorts of methods for lowering AI risk might be available if it’s true, even in worlds where many models might otherwise want to deceive us about their abilities.

Thanks to Beth Barnes, Paul Christiano, Lukas Finnveden, Evan Hubinger, Buck Shlegeris, and Carl Shulman for discussion. My thinking and writing on this topic occurred in the context of my work at Open Philanthropy, but I’m speaking only for myself and not for my employer.

Clarifying the hypothesis

In popular usage, “sandbagging” means something like “intentionally performing at a lower level than you’re capable of.” Or at least, that’s the sort of usage I’m interested in. Still, the word is an imperfect fit. In particular, the “sandbagging” being disallowed here needn’t be intentional. A model, for example, might not know that it’s capable of performing the checkable task in question.

That said, the intentional version is often the version at stake in stories about AI risk. That is, one way for a misaligned, power-seeking AI system to gain a strategic advantage over humans is to intentionally conceal its full range of abilities, and/​or to sabotage/​redirect the labor we ask it to perform while we still have control over it (for example: by inserting vulnerabilities into code it writes; generating alignment ideas that won’t actually work but which will advantage its own long-term aims; and so on).[4]

Can you always use standard forms of ML training to prevent this behavior? Well, if you can’t check how well a model is performing at a task, then you don’t have a good training signal. Thus, for example, suppose you have a misaligned model that has the ability to generate tons of great ideas that would help with alignment, but it doesn’t want to. And suppose that unfortunately, you can’t check which alignment ideas will work, and which won’t. How do you use standard ML training to get the model to try its hardest to generate alignment ideas that will actually work? You can try to train for alignment-ideas-that-look-good-to-you – but unfortunately, that’s not the same. So plausibly, your model’s full potential to do actually-useful alignment research will go unharnessed.

OK, but what about tasks whose performance we can verify? For example: the blue-banana hacking task above. It’s still non-obvious that you can avoid sandbagging, here (more below). But at the least, you can tell if you succeeded.

(Can we be more precise about what “checkable” or “verifiable” means? I’m going to skip further precision for now, though I encourage people who want it to offer alternative formulations.[5] The intuition is supposed to be: you’re in a position to know that the task has been performed, even if you can’t perform the task yourself.[6])

Note, though, the “with rare exceptions” bit of the hypothesis. It matters which exceptions are at stake, here, and how rare they are, and how well you can identify them ahead of time. In particular: the worse you are at identifying the exceptions ahead of time, the lower your confidence that you’re not getting sandbagged. Still: even absent confidence about this when a model fails to perform a task, if in fact we can generally get models to try their hardest on checkable tasks, we can still get a lot of safety-relevant labor done with AI assistance—and we’ll be able to check that the relevant tasks have been completed, once they’re done.

We can also distinguish a weaker hypothesis, namely:

The “no sandbagging on performable tasks” hypothesis: With rare exceptions, if an ML model is capable of doing some task X, and you can already do X, then you can get it do X using already-available training techniques.[7]

This hypothesis is weaker because the set of tasks you can check is quite a bit larger than the set of tasks you can perform. And it has to meet a lower burden of proof with respect to the question “can you find a way to update the model’s policy to do the task even if it starts out failing?”, because if you know how to do the task – or at least, to demonstrate it – you can update the model’s policy towards taking the sorts of action it should have taken; whereas if you don’t know what those actions are (and the model isn’t showing you), this is harder.

Finally, note that “you,” here, doesn’t need to mean a lone human. Rather, it can mean many humans, doing expensive and time-consuming evaluations, with the assistance of whatever AIs (and other forms of technology) they can access. This means that (modulo leakage via the rare exceptions), if the “no sandbagging on checkable tasks” hypothesis is true of some set S of tasks that are checkable by humans without AI assistance, then if we grant humans access to AI systems capable of performing such tasks (even: misaligned systems who actively want to undermine human projects), this creates a new set S+ of checkable tasks – namely, the tasks checkable by humans with the assistance of AIs that can be used to perform the tasks in S. And so, too, for S++, and so on.

Why does the hypothesis matter?

Why does it matter if the “no sandbagging on checkable tasks” hypothesis is true? Well, if it’s true, then

  1. you’re in a much better position to know which checkable tasks your models can do, and

  2. you can use your models to perform basically any checkable task they’re capable of performing

even if those models are otherwise misaligned, deceptive, and conspiring to undermine your goals and/​or to disempower humans more generally.

(1) matters a lot for “capability evaluations” – that is, attempts to figure out what sorts of dangerous tasks a model can perform (examples here and here; and see Hubinger here for more detailed discussion). Of course, a robust capabilities evaluation regime would also require the ability to identify ahead of time the exceptional cases where you can’t rule out sandbagging. But insofar as you can check whether a given form of e.g. hacking or autonomous replication or biological-weapons-designing has occurred, and rule out sandbagging on this sort of checkable task, then you can know whether the model can do it.

  • Though note that for capability evaluations directed at tasks that humans + AI assistants can already perform (for example, many of the tasks involved in autonomous replication), the “no sandbagging on performable tasks” hypothesis is actually the one that matters.[8]

(2) matters a lot to what sort of useful labor we might be able to get out of misaligned AI systems we still have control over – and in particular, labor that improves our odds of avoiding later disempowerment. For example, if the “no sandbagging on checkable tasks” hypothesis is true, then (unless some of the following tasks are amongst the rare exceptions):

  • You will be able to access and make use of any checkable knowledge that a given model has about the alignment of existing AIs – including itself. Thus:

    • You’ll be able to ask the model: “under what circumstances would you engage in X sort of power-seeking or otherwise dangerous behavior?” And if knowledge of this fact is present in the model, and you’re able to check it, then you’ll get access to that knowledge.[9]

      • Of course, it’s hard to actually put the model in some of the circumstances we’re interested in (e.g., ones where it has a genuine opportunity to disempower humans, or more exotic cases like Christiano’s RSA-2048). But if it knows what it would take to convince it that it’s in a given circumstance (e.g., “if you perturb my weights like so, I’ll believe I’m in a position to grab power and I’ll start trying to do so”), then you can check that, too.

    • Similarly, if the model knows or can access check-able evidence that other AI systems are misaligned, or planning/​engaged in/​disposed towards undesirable behavior, you’ll get access to that evidence.

  • You’ll be able to use the model to reveal and patch checkable routes to human disempowerment. Thus: you can find:

    • all of the checkable cybersecurity vulnerabilities the model would’ve been able to exploit;

    • all of the checkable ways of subverting/​undermining/​reward-hacking the oversight process;

    • all of the check-able manipulation/​persuasion techniques it might’ve used on humans, along with any checkable methods of detecting/​blocking these techniques;[10]

    • all of the checkable plans for making money in illicit ways; and so on.

  • You’ll often be able to use the model to advance lots of other forms of science and technology, including alignment-relevant science and technology, where experiment and human understanding (+ the assistance of available AI tools) is enough to check the advances in question.

    • Interpretability research seems like a strong and helpful candidate here, since many aspects of interpretability research seem like they involve relatively tight, experimentally-checkable feedback loops. (See e.g. Shulman’s discussion of the experimental feedback loops involved in being able to check how well a proposed “neural lie detector” detects lies in models you’ve trained to lie.)

Of course, the exceptions (and the potential difficulties identifying them ahead of time) can bite you, here.[11] And there’s still a further question of how far this sort of labor goes in helping your efforts to lower misalignment risk. But access to it seems like cause for significant optimism relative to the alternative. And as I noted above, it can scale recursively.

Is the hypothesis true?

Is the “no sandbagging on checkable tasks” hypothesis true? I’m not sure.[12]

A key point in its favor is that knowledge of how to do the task is already present in the system, so all you need to do is use gradient descent to make that knowledge manifest, and the checkable-ness of the task gives you a training signal that can exert pressure in this direction. And this pressure may be hard for a misaligned model to resist. Here’s an exchange from Shulman’s podcast, discussing a human analogy, that might make some of the difficulty here vivid (though various important differences with AI case persist):

Patel: What about spies that are working in very advanced scientific contexts? For example, a communist spy in the Manhattan project. He has to deliver research results in order to stay employed but he can still continue delivering intel to the communists.

Shulman: Their job was not sabotage. And your hypothetical spy did not have their nervous system hooked up to this reward signal of praise from the Manhattan project supervisors, being exposed combinatorially with random noise added to generate incremental changes in their behavior. In fact they were displaying the behavior of cooperating with the Manhattan project only where it was in service to the existing motivations. They didn’t have their brain re-engineered to do more of the thing when they accidentally helped the project more and less of the thing when they accidentally helped it less so I’d say it’s pretty drastically disanalogous.

A key point against the hypothesis, in my eyes, is that for tasks that we don’t know how to do (e.g., hacking the computer in the blue banana example), it’s not clear how to update a model towards performing the task if it intentionally tries to avoid showing us any task-performing behavior.[13] Thus, for example, if in the blue banana example above, the model just shows us a bunch of unsuccessful attempts at hacking the computer (even when we try to force it to “explore”), and we don’t know what a successful attempt involves, how are we supposed to elicit (and then reward) such a successful attempt?

After talking with various folks who’ve thought more about this, my understanding is that a key question, here, is how confident the model will be that we don’t know how to perform the task in question – even granted that we have access to a diversity of AI assistants (and in particular, AI assistants who we will be trying to prevent coordination between, and who will also be uncertain about what humans + the other AI assistants can do). Here I think the thought is: if we do know how to perform the task in the question, but the model intentionally fails to perform it, then we can use supervised learning to update the model’s behavior towards no-longer-failing – thereby potentially re-engineering the model’s brain/​motivations in a way it wants to avoid. So to prevent this, the model wants to only sandbag on tasks we don’t know how to do. But in many cases, it might be hard to tell what these are – especially if it’s difficult to coordinate with other AI assistants to conceal how to do a given task.[14]

Note, though, that this story concedes that a model would, in principle, be able to sandbag on tasks that we don’t know how to perform. It’s just that doing so might be a strategic error.[15] Thus, to the extent we avoid sandbagging via this dynamic, what’s preventing the sandbagging is not the power of our ML training per se. Rather, from the model’s perspective, what matters is the broader context of uncertainty and coordination difficulty re: what sorts of deception are worth risking given the costs and benefits at stake.[16]

  • And note, too, that this sort of story actually supports a broader hypothesis: “no sandbagging on tasks where it’s sufficiently plausible, from the model’s perspective, that humans + AI assistants already know how to do it” – a set that could in principle include many tasks that we don’t, in fact, know how to check, let alone perform (for example, generating actually-good superhuman hard-to-check AI alignment research). Still, checkability would play an important role in determining the number of tasks it’s plausible that we know how to do.

I won’t try to evaluate these considerations in depth here. And the “no sandbagging on checkable tasks” hypothesis raises other questions, too (for example, if “gradient hacking” is feasible, then models would plausibly be able to sandbag even on tasks that we know how to perform).[17] Mostly, I just want to point at the hypothesis as an important subject of research. And I think it’s plausible enough that it’s also worth thinking through in detail what we could do to lower existential risk from AI, if it is true – including in worlds where the sort of deceptive power-seeking that might motivate sandbagging is common.[18]

  1. ^

    See also the formulation of the concept of “capability elicitation” in Shlegeris and Greenblatt here: “By ‘capability elicitation’, I mean ‘given a reward signal, training models to get as high an average reward during training as they’d get if they were trying to do well rather than to sabotage training’.” This formulation is similar to the one I’ve used in that, to the extent the model is capable of doing the task, then if it were “trying to do well,” it seems likely it would succeed.

  2. ^

    Here I’m setting aside the ethical issues that this sort of training could raise in models that are or might be moral patients.

  3. ^

    Let’s assume that you have control adequate to ensure that it can’t hack out of its environment or disempower you or anything like that in the process of performing the task.

  4. ^

    See Yudkowsky’s 31, here, as an example of pessimism on this front: “A strategically aware intelligence can choose its visible outputs to have the consequence of deceiving you, including about such matters as whether the intelligence has acquired strategic awareness; you can’t rely on behavioral inspection to determine facts about an AI which that AI might want to deceive you about. (Including how smart it is, or whether it’s acquired strategic awareness.)”

  5. ^

    My sense is that some folks might want to formulate the hypothesis specifically in terms of what you can expect out of reinforcement learning.

    I’m also going to skip further precision on other aspects of the definition, like “tasks the model is capable of performing” and “you can get it to perform blah task.” All of this is just supposed to be a rough-and-ready gesture.

  6. ^

    For example, intuitively, in the example above, you’re in a position to know that the blue banana has appeared on the screen. But in principle we can still ask questions like: what if the model intervenes on your retinas to create the illusion of a blue-banana screen? So “checkable” in the relevant sense requires ruling out various sorts of epistemic distortions. But I think we’ll often be in a position to do this.

  7. ^

    Thanks to Lukas Finnveden for discussion.

  8. ^

    Though: note that questions about sandbagging are more important to misalignment risk, and less to misuse risk. That is, if you’re mostly worried about whether other humans can use/​train the model to do something dangerous, and not about the model intentionally messing with your evaluations, then to the extent your ability to solicit the relevant dangerous behavior is comparable to that of “other humans” you’re wondering about, your eval should be accurate.

  9. ^

    From Shulman: “If gradient descent is working fine and there’s any way to slightly jiggle the weights to make it more likely to deliver a more acceptable answer, we ask the AI: can you think of any circumstances where if we put a copy of you into that situation, it would then violate some of the various constraints? It would tell a lie, it would insert a backdoor into a program it wrote when it was not supposed to. If the model knows yeah, I would look for signs of consistency of the information I was being presented with, certain checksums failing could indicate that it was actually a constructed example, a simulation. Right now we can detect AI generated images reasonably well and you might use those kinds of capabilities. We could do something called relaxed adversarial training. If we can find how those things are represented within the AI, we can induce a hallucination that it has seen that those checksums have been passed. We can then ask, well does that cause the forbidden behavior? If the AI understands how it’s thinking and how it works, it can propose a bunch of experiments. It can say, I guess that if you perturb the network in this kind of way it might induce that hallucination using the best of its knowledge as to what might generate that. It can propose those experiments and can do the experiments and generate the data. Because humans are going to get to see the results of those experiments after the fact they can exert pressure to have the AI incrementally deliver better and better results.”

  10. ^

    See Davidson (2023) for discussion of this one in particular.

  11. ^

    For example, it’s hard to use your model to answer questions like “are there any checkable and findable-by-you circumstances where you might try to grab power” or “are there any checkable and findable-by-you security vulnerabilities left in this system” if you can’t actively rule out sandbagging.

  12. ^

    There’s some discussion of a type of sandbagging in Perez et al (2022): namely, models performing worse for users that appear less able to evaluate the model’s answers. However, the type of sandbagging at stake in this post requires that the model performs worse-than-it-can on a task that we can evaluate, in a way that persists despite our training it in a way that harnesses this evaluation capacity to whatever extent possible.

  13. ^

    See e.g. Hubinger’s discussion of the difficulties in relying purely on RL training, here.

  14. ^

    It may be that “introspection” tasks like “figure out and report the circumstances under which I would try to grab power” are especially unlikely to be tasks that humans already know how to do. And anyway, a deceptive model planning for bad behavior later might choose to act as though humans don’t know how to do this task, because if they already do, then all is lost. H/​t Paul Christiano and Evan Hubinger for discussion. See also the “slogan of the robot rebellion” example from Shlergis and Greenblatt here.

  15. ^

    Thanks to Beth Barnes for discussion here.

  16. ^

    If humans can consistently set up the sort of strategic uncertainty and coordination difficulty that makes the models choose not to engage in sandbagging here, then I think that the “no sandbagging on checkable tasks” hypothesis should still count as true. But we should be clear, in our own minds, about what’s making it true – namely, uncertainty and coordination difficulties on the part of the misaligned AIs, rather than the power of having-a-training-signal just in itself. And insofar as we also want to be able to identify ahead of time the cases where sandbagging on checkable tasks might still occur, the complexity of anticipating the model’s cost-benefit calculations here might be a barrier to confidence.

  17. ^

    See Hubinger here for more on the relevance of gradient hacking.

  18. ^

    In some sense, tons of work on alignment – and especially, “outer alignment” – takes for granted that sandbagging isn’t a problem. But I’m specifically encouraging attention to what-can-be-done in worlds where the motivations that might give rise to sandbagging – for example, motivations related to “deceptive alignment” – are a problem, but sandbagging on checkable tasks isn’t.