Sandbagging: distinguishing detection of underperformance from incrimination, and the implications for downstream interventions.

Epistemic status: I’ve only spent 3 months working on sandbagging, and have had limited, mixed feedback from established researchers on the ideas in this post. However, I have had positive feedback from most junior researchers I asked, so I think it will be helpful to make these ideas concrete and stimulate public discussion and critique[1].

This post is part of a sequence motivated by my research this summer: see the sequence page for a more detailed discussion of my motivation and an overview of the sequence.

Background

It would be bad if models did not perform to the best of their abilities on important tasks.

Two particular sets of tasks where it is important to mitigate model underperformance are Dangerous Capability Evaluations (DCEs) and AI Safety Research (AISR) tasks.

There have been many recent blog posts and articles discussing how to mitigate underperformance from the viewpoint of control for non-concentrated failures. However, following The case for ensuring powerful AIs are controlled, most of this work is tailored to the case of a misaligned AI intentionally choosing bad actions.

Contributions

In this post I first argue that only a subset of control protocols should target intentional underperformance, and that the adversarial evaluation framework of a control game is also useful for designing protocols that mitigate generic underperformance (or under-elicitation[2]).

I then propose a concrete decomposition of the space of detection methods along two axes: whether the methods target underperformance or incrimination, and whether they operate on a task-level or task-distribution level. This is important because different types of detection naturally trigger different types of interventions.

All underperformance is important

There are many paths to underperformance

Existing work on sandbagging considers a few different mechanisms driving strategic underperformance. Work on research sabotage imagines that the sabotaging AI is fairly ‘coherent’ and either scheming or reward seeking. By contrast, Teun’s work on sandbagging evaluations considers both underperformance driven by a scheming AI, and underperformance driven by the commercial interests of an AI developer.

I claim that the driver for underperformance might not always be clear.

For concreteness, I’m first going to focus on the threat model of underperformance on safety research. I’m going to convey an example narrative where underperformance is influenced by a lab’s incentives, but not deliberately engineered.

First, let’s consider the lab’s incentives. At a high-level, the lab has much stronger commercial incentives to optimise AI researchers for capabilities research than for safety research. One might expect the lab not to invest in creating meaningful training environments for safety research tasks, or perform inadequate post deployment monitoring. This exposes the lab to a variety of different mechanisms for AI underperformance.

Now let’s consider these mechanisms. I’m going to suggest the following as alternatives to the ‘scheming’ or ‘reward-seeking’ mechanisms already proposed.

  • Non-strategic generalisation failures: models may underperform their potential due to any generalisation failure. Getting generalisation right is hard. It is particularly hard to predict temporal distribution shifts in advance. For example, safety research in 2027 might require knowledge of a new library, concept or paradigm that was under-represented in pre-2027 training. Safety research may have a greater emphasis on different skills than capabilities research, e.g. more emphasis on conceptual reasoning about theories of impact or safety cases. Alternatively, if reward signals for safety research tasks are mis-specified or noisy this might lead models to learn low-quality or high-variance behaviours.

  • Alternative motivations that index on world models: some other possible motivations to underperform seem less strategic than scheming or reward-seeking. For example, a model might learn to behave in a way that seems likely to ‘maximise lab profits’ or ‘look best to the CEO’. Additionally, a model may underperform subtly not because it is being deceptive, but because obvious underperformance would have been ironed out in post-training.

  • Messiness: models are stochastic and contain many different circuits. Indeed, Turner’s shard framing claims that different influential circuits (shards) will be reinforced over the training process and will compete in different situations: models are not well modelled as following a single goal. It may be that a mixture of different shards lead to underperformance, some following scheming or reward seeking reasoning, and others based on context-specific heuristics. This is close to Vivek’s description of a partial train-gamer.

As a higher level remark, talking about ‘intention’ with LLMs is difficult. There is not a precise definition of how to measure ‘intention’ for LLMs. AISI recently released a related paper suggesting that safety researchers should be more careful when using such mentalistic language. I think that thinking about ‘intentional’ or ‘strategic’ underperformance are important intuition pumps, but that we need to be aware that these terms are not precise, and make sure that control protocols are robust to such ambiguity.

I’m next going to explain why I think it is also important to mitigate underperformance in these scenarios, and then outline a couple of implications.

Underperformance itself is the problem

When considering either the threats of underperformance on DCEs or AISR, the negative impact comes from underperformance on the task, rather than the intent behind it.

E.g. if a model only appears safe on a DCE because of poor elicitation and is then deployed, this could pose an unacceptable risk. Or if one could improve human uplift by 50% on a particular AISR task-distribution with a small amount of targeted training then it would be important for a lab to know.

Less-strategic underperformance is important

This seems to be accepted by researchers working on DCE sandbagging. Indeed, AISI’s existing work on elicitation with current models could be motivated by this framing, though elicitation is also relevant to human-uplift considerations. One reason for this perspective is that it is already important that today’s models do not underperform on DCEs, and today’s models are typically presumed not to be scheming.

However, I think that this is less intuitive to researchers working on low stakes control (e.g. at Redwood). Often such researchers are thinking about events within a lab near ‘takeoff’, and implicitly assume that models will be fairly coherent. I’d like to challenge each of these points.

  • Within-lab at take-off: it is also important to mitigate underperformance of AI research agents by non-lab AI safety researchers. Such researchers form the bulk of the human AI safety research capacity, and already make heavy use of current LLMs. I personally feel disappointed by current LLMs on certain tasks that bottleneck me, and find that they already uplift me far more on general ML projects than on AI safety ones. There will be much safety work done before takeoff, and all of this work is important, not just the work at takeoff itself.

  • Coherence at take-off: Additionally, even at ‘takeoff’ I think there is a good chance that models will be better described by messy/​shard theoretic terms above rather than pursuing a single goal[3]. One reason is that I think big gains could come from further unhobbling (better agent scaffolding, RL on diverse tasks) without big increases to model size.

It is therefore important to develop control protocols that mitigate the non-strategic or shard-like threats described above.

Incrimination makes less sense against less-strategic underperformance

Vivek has previously outlined incrimination and online training as the two core strategies to solve diffuse threats, and described how incrimination becomes harder against models which take a large number of bad actions that look plausibly benign. However, these dynamics around incrimination are only relevant to strategic underperformance. It does not make sense to think about incrimination for non-strategic generalisation failures. Even under the shard framing, incrimination doesn’t really ‘make sense’. If a safety team performed comprehensive circuit analysis and found a circuit that followed strategic reasoning, I’d expect the lab to go about suppressing the particular circuit, rather than rethinking their training pipeline. Similarly, many of the suggestions Buck and Ryan gave for what to do once an AI is caught red-handed are only valid for models with high levels of coherence and consistency and are less applicable in the shard framing.

Deception detection techniques will be less effective against less-strategic underperformance

I expect that countermeasures that rely upon detecting strategic reasoning will be much less effective against these diverse failure modes. Indeed, deception probes or chain-of-thought monitors should have no power against non-strategic generalisation failures, and may or may not have any effect against messy shard-like models.

When constructing an overall control pipeline it is important to robustly cover the space of possible threats. The discussion above implies that one should make sure to have detection methods that directly target underperformance, rather than deception or incrimination.

Control evaluations can also help study these threats

As I see it, a control evaluation is an adversarial research exercise that pits a human blue team against a human red team. A third team of humans, the green team, designs the evaluation setting and defines the objectives and affordances for each team. The researchers play multiple rounds of the exercise where in each round the blue team designs control protocols, and the red team designs attacks to circumvent the protocols.

Though the original framing of control specifically targeted intentional attacks from misaligned models, I think there is also value in using adversarial research exercises to study mitigations to attacks where intention is less clear.

Adversarial research exercises seem like a good general way to ensure that scaffolds are robust, either to strategic or non-strategic failures, and are widely used in the existing AIS literature.

Such adversarial exercises are already used by labs, particularly to mitigate risks of jailbreaking, e.g. at OpenAI. One difference to control evaluations is that in the jailbreaking setting the ‘future-world’ deployment attacks modelled by the game are intentionally generated by humans, whereas in the control setting these attacks are generated by AIs. That said, I’d also expect labs to already perform some sort of adversarial exercises against non-human-driven threats, hoping to catch accidents before they happen (they will be trying to prevent another Bing-Sydney episode!): it seems hard to be confident that a human red team will find all such failures, but labs might as well try, rather than risk public backlash. Having a human red-team can only provide more robust performance than a static suite of scenarios.

Different types of detection trigger different countermeasures

In light of the discussion above, I think it is helpful to differentiate sandbagging detection strategies along the following two axes:

  • Underperformance vs Incrimination: is the strategy trying to identify generic underperformance (see above), or is it specifically targeting intentional underperformance (for incrimination)?

  • Task level vs task-distribution level: is the strategy trying to identify underperformance or incrimination on a particular task (e.g. a specific lit review task), or on a known/​fixed/​hypothesised task distribution (e.g. all LLM queries made by an AI safety team)?

These different types of detection are implemented in different ways and should be used to trigger different countermeasures

  • Task-level underperformance

    Example strategies

    • Observe only modest improvement in performance over a weaker trusted model.

    • Checking for benign mistakes via algorithmic or LLM-assisted validity or consistency checks

    • Probing for ‘laziness’ or ‘carelessness’

    Appropriate interventions this can trigger

    • Task-specific elicitation (e.g. resample multiple times and aggregate, or indeed, any protocol to uplift plausibly benign models with inference time compute)

    • Increase resource allocation to task-level incrimination strategies (below)

    • Add task to a database of ‘difficult’ tasks to be used for more targeted online training

  • Task-level incrimination

    Example strategies

    • Observe significant underperformance relative to a weaker trusted model.

    • Observing suspicious-looking mistakes with appropriate trusted monitor (e.g. falsehoods that trusted model knows that untrusted model should get correct)

    • Probing for deception

    • Suspiciously high levels of inconsistency or entropy

    • Interrogation by a trusted model

    • Human audit, and interrogation

    Appropriate interventions this can trigger

    • Add trajectory to a database of ‘incriminating’ trajectories to be used for more alignment research or training

    • Escalation up incrimination hierarchy

  • Task-distribution-level underperformance

    Example strategies

    • Aggregate signals from task-level underperformance (any aggregation procedure, e.g. by mean, quantile, any $l_p$ norm, etc)

    • Compare to performance on related task distribution where one is confident that the model is well elicited /​ not sandbagging

    Appropriate interventions this can trigger

    • Targeted online training on the task distribution

    • Human examination of the task distribution to find key drivers of underperformance (e.g. new concept required for task not well represented in training data)

  • Task-distribution-level incrimination

    Example strategies

    • Aggregate signals from task-level incrimination (any aggregation procedure)

    • Combine signals with a human-intuition-generated prior for whether sandbagging on this task distribution would help achieve model goals. This prior might be guided by e.g. utility engineering research.

    Appropriate interventions this can trigger

    • Targeted online training on the task distribution to improve alignment (think ‘suppress misaligned shards’)

    • Add task to a database of ‘sandbagging distributions’ to be used to check generalisation of further alignment training across task distributions

    • Full human audit of the model (e.g. careful incrimination across further task distributions, further study of model internals), i.e. model-level incrimination strategies (see below).

I’d also like to introduce the term ‘sub-distribution identification’ for the process of finding a subset of tasks from a task distribution on which a model appears to underperform or sandbag. This opens the door for more targeted (cheaper) online training. As I see it, the main way to achieve this is to collect a dataset of tasks where underperformance or incrimination is detected by task-level detection strategies and then to run an LLM or human auditor on this dataset to look for common patterns. If a promising hypothesis is obtained, it can be validated using the task-distribution-level detection strategies outlined above, which would then trigger the corresponding interventions.

I’d also like to introduce a term of ‘model-level incrimination’, which corresponds to an overall tendency of the model. If models are not fully coherent then it may be possible for models to sandbag in certain contexts, but not have a general tendency to sandbag ‘wherever possible’. Moreover, if one acknowledges uncertainty about incrimination, this ‘model-level’ quantity might behave differently, and different levels of confidence might be required to trigger different downstream interventions. For example, it should require a lower level of suspicion to perform task-specific online training than to prompt an AISI to request model recall.

Acknowledgements

I’d like to thank Mary Phuong and the rest of the team collaborating on her MATS 8.0 project (Benjamin Arnav, Yulong Lin, Pablo Bernabeu-Pérez) for helping me to form these ideas. I’d also like to thank Cam Tice, Edward Young, Natalia Zdorovtsova, Dima Krasheninnikov, Vivek Hebbar and Julian Stastny for feedback on drafts of this post.

That said, all takes and mistakes are my own, and I’m keen to improve these with further discussion!

  1. ^

    In particular, I’ve had some mixed feedback from Redwood researchers: I think I might suffer from common misconceptions, and hope that further discussion of these ideas will help find the cruxes.

  2. ^

    Note: After writing this section, I realised that many of these ideas are close to Jan Leike’s comparison of under-elicitation to scheming and Ryan’s (excellent, long!) post on handling non-concentrated failures with control. I think both posts are great, but that there is value in my alternative framing, which is designed to motivate later sections of this post, and later posts in this sequence.

  3. ^

    E.g. 70% that there will be consensus amongst safety researchers that there are significant situations where messiness is a better model. Hard to make this more concrete. Intuition mostly from my understanding of how effective models already are at research engineering/​Olympiad maths, and yet how ‘spiky’ they still seem.

No comments.