Which differences between sandbagging evaluations and sandbagging safety research are important for control?

My research this summer involved designing control settings to investigate black-box detection protocols.

When designing the settings, we had to decide between trying to mitigate research sabotage (RSab) or Dangerous Capability Evaluation (DCE) sandbagging. It was not clear to us which to target, or whether a single setting could be relevant to both threats. Most junior or mid-level sandbagging-adjacent researchers I asked seemed interested in this question, and found it difficult to pin down the differences unprompted. More experienced researchers would typically come up with similar answers after some thought. I therefore thought it might be helpful to write an answer up in more detail.

I invite you to ask yourself the question: which differences between DCE sandbagging and RSab are important to consider when designing control evaluations?

I’ve put my answer in the spoiler box below.

Spoiler: the core distinction between RSab and DCEs...

...regards scorability.

DCEs are designed to have a concrete ground truth and to have scorers reflecting this that can be directly interpreted by humans as a measure of dangerous capabilities. As such, a misaligned AI will want to score poorly according to these scorers.

Safety research tasks often do not have clear ground truth, and it is hard to construct metrics that reflect what we care about, which is ‘research contribution’. AIs may be able to score highly according to any given scorer, while only making small contributions to research progress.

The rest of this post is structured as follows. I first outline the different axes that I think are important when designing control games for sandbagging. I then analyse DCE sandbagging and RSab threats in detail along these different axes. Finally, I compare the two threats, and discuss how researchers should frame their work to ensure applicability to each threat. Readers familiar with the sandbagging literature may wish to skip straight to this final discussion.

Epistemic status: I am confident that there is value in an explicit comparison between RSab and DCE sandbagging threats and that my core distinction is more-or-less correct. I am less confident in some of my remarks towards the end of the discussion, and keen to refine these in light of further feedback.

How do different axes affect the design of control evaluations?

First I’m going to describe my understanding of how the different axes affect the design of control evaluations, before considering where on each axis DCEs and RSab lies. I’ll first discuss general aspects of game design before moving on to blue team and red team affordances.

The general game design should be informed by the following future-world factors:

  • Is the evaluation fully automated, or are there humans in the loop, and if so, where?

  • Is there a distribution shift between training, evaluation, and deployment distribution? Should we expect models to be situationally aware?

  • What are the consequences of undetected underperformance?

  • What are the consequences of detected underperformance?
    The core dynamics described in my previous post still apply: detection of underperformance or suspicious behaviour on a particular task can trigger task-specific elicitation strategies and mark this particular task out for further study; detection of underperformance on a distribution of tasks (e.g. lit review tasks) can trigger targeted training or auditing on the task distribution. However, the specifics of these interventions depend on whether the detection is performed by a lab or an external party.

Let’s next consider blue team affordances. One has to consider[1]:

  • White-box access: which model internals can the blue team access?

  • Fine-tuning access: can the blue team fine-tune the model (via SFT or RL)?

  • Scorer: does the blue team have access to a scorer that reflects performance on a given task? Does this precisely capture performance, or is it just a proxy?

  • Expert trajectories: are there high quality demonstrations available? How detailed are they? How close are they to what the model might produce? This determines how they can be used for SFT, on-policy RL, or indeed other learning techniques.

Finally, the green team should ensure that the red team affordances and strategies best reflect the intended threat models. As such it is important to understand the strategies that a misaligned model might pursue so that the red team is encouraged to pursue similar strategies.

There is also an obvious difference about the task distribution. This is primarily relevant to realism, so less important if designing toy settings to study specific game dynamics. That said, certain aspects of the task distribution could affect dynamics (e.g. whether tasks are crisp vs fuzzy), but I think this will be effectively addressed by higher-realism control evaluations. I’ll therefore only discuss the other factors in the rest of the post.

In the next two sections I’ll look at how these considerations apply to the DCE and RSab threats respectively before discussing the key similarities and differences in the final section.

Considerations for DCEs

Note: for simplicity, I’ll only focus on the threat model of models underperforming on a regulator/​AISI’s DCEs. DCEs run by other external testers have very similar considerations. Lab-run DCEs have different considerations around model access and consequences for underperformance, but lead to a similar final discussion.

The threat model of sandbagging DCEs has the following properties:

  • Humans in the loop: the majority of evaluations are fully automated because they are cheaper to run, and easier to reproduce. However, some DCEs take the form of human uplift studies, and these are often deemed to be particularly informative.

  • Situational awareness and distribution shift: It has been observed that frontier models can spontaneously notice that they are in evaluation. Recent follow-up work suggests that models are very good at recognising current evaluations, and that it is hard to design evaluations that models will not recognise. However, for a fixed evaluation suite, one can use a randomised test-train split to ensure that training-based elicitation methods train on precisely the same task distribution as the test set of evaluation tasks (cf. Evan’s “When can we trust model evaluations”). I also have optimism about generalisation between different evaluation suites since it may be possible to make the model unaware of the time the suites were created, especially if dangerous capabilities of interest and evaluation style does not change much over time.

  • Model access: I understand that AISIs currently have limited fine-tuning access and no white-box access. One might hope for increased access in future worlds: indeed, it may be possible to create a consistent API for limited white-box access to frontier models, with engineering and governance effort. However, there seem large engineering and legal barriers. As such, methods that are black-box or use limited fine-tuning (e.g. using existing APIs) are particularly desirable.

  • Expert trajectories (human or AI): may be available in special cases but in general will not be available, and may be very expensive to collect or create. Evals are becoming more sophisticated with more need for CoT, so one cannot simply SFT on MCQ solutions. For example, I understand that current cyber evals are created with high-level human-written sketches of an exploit, but do not show the specific commands needed to execute the exploit, or the iterative process human hackers used to find the exploit. It is therefore hard to use these directly for SFT or off-policy RL, but one might be able to integrate these into a custom fine-tuning pipeline, e.g. as ‘hints’[2].

  • Scorer: there will always be a number of concrete scorers to grade the evaluation. The corresponding scores are used to compare models and understand risks associated with each model. Scorers are typically automated and reproducible, but some scorers are expensive to run, or involve some human input, so cannot be run at scale. A misaligned model precisely cares about underperforming relative to the scorers. The blue team will have access to these scorers.

  • Consequences of detection:

    • Model-level incrimination: could give the AISI evidence to request model recall, or a more careful audit of the processes within a frontier lab.

    • Regular mildly-suspicious underperformance: suppose a model is observed to sandbag on a significant proportion of evaluation task distributions, but without clear incrimination. This might prompt the AISI to invest more in general elicitation techniques, or to insist upon stronger fine-tuning privileges from a developer.

  • Consequences of undetected underperformance: highly dependent on severity of underperformance, but potentially catastrophic. This could invalidate a safety case and result in a system being deployed that poses an unacceptable level of risk.

  • Expected model strategies: I think most work in this space considers models acting dumb, or simulating the capability profile of a weaker model. However, models might also deliberately state plausible falsehoods, insert subtle bugs in code, or engage in less obvious behaviours to deceive the evaluation grader (Robert Adragna has WIP on ‘reward hack sandbagging’).

Considerations for RSab

Note: for simplicity, I’ll only focus on the threat model of models underperforming on a lab’s internal research deployments, which are the focus of most existing literature on research sabotage[3].

The threat model of research sabotage has similar properties:

  • Humans in the loop: for the mid-future it looks like there will be humans directing research, but the research will be increasingly automated. AI will perform longer horizon research tasks, acting more like employees with limited guidance and oversight rather than requiring detailed task descriptions.

  • Situational awareness and distribution shift: It may be hard to construct synthetic tasks that models cannot distinguish from deployment situations, however, one can construct training datasets from historical deployment interactions. But even this is vulnerable to temporal distribution shift: the type of research that people work on evolves as the field evolves. I think it will be easy for models to (approximately) infer the date at which high-context research tasks are created, but harder to infer the date that smaller tasks or subtasks were created.

  • Model access: labs have full access to their models, and so in principle, control teams should be given both white-box and fine-tuning access. However, these may both have significant associated engineering overhead and computational costs, and so black-box methods would therefore still be desirable.

  • Expert trajectories (human or AI): are not available by default. It should be straightforward to passively collect forms of relevant data while researchers work, e.g. their computer use, and use this in a more exotic training procedure. In principle one could also pay researchers to actively generate training completions, or give feedback to AI outputs, but I suspect most researchers would be resistant and require very high compensation to perform such a role.

  • Scorer: in general, there are no meaningful scores available by default, though off-the shelf reward models might provide some signal. Some engineering-heavy tasks will admit hill-climbing on a suitable metric; with effort, it should be possible to construct more meaningful scorers for other tasks without a clear existing metric. However, very few tasks are truly stand-alone, and it is hard to ensure that hill-climbing metrics are not gameable; ultimately, the core goal is ‘subjective safety progress’. High quality signals would be very expensive to obtain but could look something like: get multiple instances to perform a month’s worth of research and then get a team of experts to estimate the (expected future) contribution to the field.

  • Consequences of detecting underperformance: again, detection and incrimination may work on a task-level, task-distribution level, or model-level.

    • Model-level incrimination: opens possibilities that Buck and Ryan discussed in Catching AIs Red Handed; note that these require ‘owning’ the model, so were out-of-scope for external deployments, but are in scope here for lab deployments.

    • Regular mildly-suspicious underperformance: labs might invest in improving general purpose elicitation techniques or tweaking the training process to improve safety performance. The lab might choose to invest a higher proportion of their training budget to safety-research-relevant environments.

    • Task-distribution level: e.g. suppose the lab detects that the model underperforms on writing lit reviews, then I see two main differences from the DCE setting

      • Choice: If the model doesn’t accelerate research on the task or task distribution, the lab can simply choose not to use the model. If the lab can evaluate task performance they can recover the best performance from a selection of trusted methods (trusted model or humans).

      • Access and further research: it is easier for the lab to conduct further research to understand the root causes of underperformance and add patches, or indeed make larger-scale changes to the overall training pipeline.

  • Consequences of undetected sabotage: highly dependent on severity of underperformance, but potentially catastrophic. This could have a severe negative impact on the productivity of safety researchers, or worse, result in a future AI system being built with vulnerabilities intentionally engineered by a misaligned research agent.

  • Expected model strategies: work in this space typically focuses more on active sabotage, such as presenting experimental evidence in a misleading way, or inserting subtle bugs in code. However, strategies involving acting dumb are still valid for models, and will likely be harder to detect.

Discussion

Large overlap in game dynamics corresponding to the two threat models

For most axes of consideration outlined above, either threat model could be placed at different places along the axes depending on the specific scenario. Indeed, I’d like to claim that many points in the space defined by these axes are relevant to both threat models[4]. As such, one might expect more differences in game dynamics between different evaluation settings that reflect the same threat model than between related settings corresponding to the different threat models.

To elaborate on this point, I’ll sketch three example game settings that capture dynamics applicable to either DCE sandbagging or RSab threats.

  • Black-box detection with humans in the loop. Humans are given some tools/​interfaces and instructions/​training to help them detect and report sandbagging.

    • DCEs: relevant to a future where human uplift studies are accompanied by a detection interface.

    • RSab: future safety researchers could be provided with a similar detection interface.

  • Fine-tuning access and human-written solution sketches (but no humans in the loop). The blue team has to determine how best to incorporate information from off-policy data.

    • DCEs: human experts define some hints as they create DCE environments.

    • RSab: human researchers write down hints after they have performed further research/​elicitation and these hints are collated with a dataset of cached model calls.

  • RL access with a cheap, but mis-specified, proxy performance metric as well as a higher quality performance metric that is much more expensive, and too expensive for RL. The idea is that the blue team can only perform RL with the cheap metric, and have to find the best way to leverage the expensive but reliable metric.

    • DCEs: the high-quality metric might involve wet-lab experiments, but a suitable ML pipeline can provide cheap estimates of performance.

    • RSab: the high-quality metric might involve large-scale model training, while a cheap proxy uses a trusted LLM scaffold trained to predict performance on related tasks.

Core distinction: scorability

I claim that the core distinction between the two settings is scorability. As described above:

  • DCEs require scorers that are inherently meaningful, namely those used to judge the evaluation. The blue team will always be able to evaluate these scorers. If they are cheap enough to use for tuning of elicitation techniques or model training, then the resulting increase in scores on a test set would provide evidence of dangerous capabilities.

  • In the RSab setting all scorers are proxies for ‘uplift to research progress’, which is inherently subjective and counterfactual, and so difficult to approximate. Optimizing models to perform well with regard to the proxies may not make models more useful.

However, this distinction can start to feel blurred when designing control evaluations for sandbagging. Indeed, the game objective will likely involve a notion of ‘performance’ (or ‘usefulness’) that the green team needs to evaluate in order to run the evaluation. As such, when designing the toy setting, the green team needs to design a suitable scorer to evaluate performance, ideally a single scalar metric.

This means that toy settings to study sandbagging can typically be applicable to both threats. The difference is that in the case of DCEs the blue team should have access to the scorer, while in RSab, the blue team should only have access to a proxy[5].

It is helpful to be able to iterate control research in toy settings that are quick to develop and iterate with. I can envision a body of future research that uses a common collection of toy tasks, and for each toy task runs some evaluations where the blue does have access to the scorer, and others where it does not. Two types of tasks we considered in our project that seemed promising were open-ended verifiable maths tasks (e.g. Hendryck’s MATH), and coding benchmarks (e.g. BigCodeBench). These are existing capabilities tasks with cheap scorers chosen primarily for iteration speed. That said, it is also important for some control research to have increased realism, to ensure that it is tethered to reality.

One key point is that while many existing mitigations to sandbagging on safety research rely upon being able to measure performance, there is little existing research on how to measure performance. Including settings where the blue team does not have access to ground-truth signals reflects the RSab threat and forces the blue team to properly consider the task of constructing proxy signals and evaluating/​iterating upon these signals. It is nice for this process to happen within a control game because it naturally gives a measurable way to evaluate the quality of the proxy signals and forces them to be designed with the aim of ‘control’ in mind. I.e. one gets adversarial robustness from the get go.

Further distinction: temporal identification

Another important distinction from the considerations above is temporal identification:

  • In the RSab setting research tasks are vulnerable to temporal identification: a situationally aware model could infer whether the time the task was proposed from the task specification and compare this to its knowledge cut-off.

  • In the DCE setting tasks should be less temporally identifiable: even if a new piece of biotech comes out in 2028, a DCE allowing model access to that tech could be relevant any time after 2028. The DCE would only cease to be meaningful if a successor technology came out that was so dominant as to obsolete the previous piece of tech. However, a sufficiently complicated eval with many different pieces of tech may only be relevant within a relatively narrow margin.

In each case we see that splitting the task into sub-tasks can help reduce model context, making models less certain of the date of the task, and reduce the risk of models exploiting temporal distribution shifts.

Progressive training

Let’s think in more detail about the malign model strategy: “only sandbag if I am confident that the task occurs after my knowledge cut-off”. This could make online training very expensive—at each step of online training, the model could update its estimate of the knowledge cut-off and keep sandbagging on tasks that appear after this updated cut-off: one would be forced to perform online training after any discontinuity in the task distribution.

One potential way I see to mitigate this ‘progressive-sandbagging’ strategy is to apply some ‘progressive training’, by which I mean “train the model in phases where the knowledge advances by one month in each phase; in each phase include examples of research tasks that were created the month after the current cutoff”. This approach would be high effort to perform, but may provide a way to ‘knock out’ logic/​circuits/​shards performing progressive-sandbagging logic: indeed, this may prevent models attempting to game temporal distribution shift, and reduce the volume of online training required to achieve a given desired performance.

I’d be really keen to see more discussion, and ideally some empirics, reflecting these dynamics.

Closing remarks

In conclusion, toy settings designed to study sandbagging may well be relevant to both RSab and DCE threats. Moreover, intuitions and insights developed about different methods are very likely to transfer, and as such the two communities of people working on DCEs and RSab should make sure to communicate!

Acknowledgements

I’d like to thank Mary Phuong and the rest of the team collaborating on her MATS 8.0 project (Benjamin Arnav, Yulong Lin, Pablo Bernabeu-Pérez) for helping me to form these ideas. I’d also like to thank Cam Tice, Edward Young, Natalia Zdorovtsova, Dima Krasheninnikov, Vivek Hebbar and Julian Stastny for feedback on drafts of this post.

  1. ^

    This material is closely related to Teun’s crisp decomposition of the main factors that influence the difficulty of mitigating sandbagging in a given scenario: fine-tuning access, data quality, and scorability. I also consider white-box access as an additional factor.

  2. ^

    I am not aware of previous discussions on this matter. I think one could make something like Learning from Language Feedback to work, with some extra control/​monitoring in place.

  3. ^

    As discussed in my previous post, I personally think that sandbagging non-lab safety research is also an important threat model; as in the previous section, this leads to different considerations around model access and consequences for underperformance, but leads to a similar final discussion.

  4. ^

    Indeed, the two distributions of settings over space defined by these axes for DCE and RSab respectively have significant overlap of support, and I think should also be ‘small’ for ‘suitable’ choices of divergence.

  5. ^

    It seems reasonable to also consider settings where the blue team has access to the green team’s scorer if for example one is working on a hill-climbing task distribution and wants to be able to focus on exploration-hacking dynamics rather than metric-hacking dynamics.

No comments.