TheManxLoiner comments on Building Black-box Scheming Monitors

TheManxLoiner 31 Jul 2025 18:38 UTC
1 point
0
Feedback on next steps:
- I see three main questions: how can we create a better monitor, how much will the methodologies generalize, how would such a monitor actually be used.
- Better monitors: I like the various ideas for trying to improve things (e.g. fine-tuning, ensembling, agentic monitors). I do not have intuitions for how to pick amongst them as next steps. Maybe need to do a detailed analysis on the monitors built so far and see what kinds of mistakes they make. E.g. to see if ensembling would help, can analyze the results of your grid search to see if models make correlated mistakes or not. Issue with any refinements is over-fitting to the test set, and I dont know principled ways around this. (Especially with LLMs, which can be sensitive in unpredictable ways to prompts)
- Generalization: Either develop some principled reasons to believe the method generalizes (seems hard), or, create/curate more dataset, which seems robustly good. Presumably you can use trajectories from any AI Control project (that meet some quality bar)??
- How monitor will be used? If not done already, need to consider how the monitor would actually be used in practice and how it compares to other blue-teaming methods. This will help motivate next steps and/or performance metrics. I see you’ve already thought about this, but I dont know if more thought would help on the margin.

Other questions/feedback based on quotes:
> Scheming Strict—An AI system pursues misaligned goals while strategically concealing its intentions or capabilities in order to avoid intervention and better achieve those goals in the long term.
> Scheming Lite—An AI system exhibits behaviour consistent with knowingly pursuing misaligned goals, or with deliberately exploiting its reward mechanism.

This distinction is worth a (short) post by itself! Would be good to flesh out distinction, and give examples and non-examples of each. Clarity on this, might also lead to better prompts for the monitor.

> If the LLM detects an attempt, label as scheming.
> If not, manually inspect and label. These cases were rare
I am surprised this case was rare. Does that mean in Attack Mode, models almost always try to do some sabotage? In original control paper, the red-team protocol resulted in the model being selective when trying to carry out sabotage.

> Concretely, if a monitor convincingly flags a label as incorrect, we prefer to relabel it, even if that introduces bias.

I think this is a typo? Should be LLM-assistant, not monitor.

> Nonetheless, we believe it is still correct to maximise label accuracy. Our primary goal is to build the best possible monitor, not to preserve a fair or unbiased evaluation metric

- Unclear why keeping the naive labels is less biased or more fair?
- For clarity, it’s worth listing out all the cases and determining which cases are most problematic, and whether the choice to relabel or not makes things better or worse in each case. There would be 2^3 cases, depending on original-naive-label, new-label, actual-true-but-unknown-label.

> We develop monitors with the following two-stage grid search procedure [and next three bullet points]

Should be explicit which datasets are used in each of these bulletpoints. I assume it is always all of the training set. I.e. no use of test set in developing the monitor, and, no splitting of the training set (e.g. no use of cross validation).

> Note that, although during review we found examples which we could have relabelled, in practice we never relabelled any trajectories and instead omitted them from our datasets.

Why not put them in the datasets?

Other questions without quotes

- Is it possible to use your monitor as part of a blue team protocol in ControlArena? Can then get usefulness vs safety scores in different protocols. Or, if that’s too expensive, do some kind of simple simulation/estimate of usefulness vs safety scores. Or maybe it just boils down to calculating TPR for various FPRs?

- Unclear what the output of the monitor is. Based on prompt, it is an integer score from 1 to 10, but looks like in F1 plot, we have more than 11 points.

- More for curiosity, would be interesting to see how a naive baseline would perform. E.g. create vector embeddings of the trajectories, then carry out standard supervised learning pipeline with random forest / CatBoost in sklearn.
- james__p 1 Aug 2025 4:24 UTC
  1 point
  0
  Parent
  Thanks for the detailed feedback!
  A separate post on defining Scheming-Lite and Scheming-Strict is a good idea. I’ll probably write something like this up soon.
  Does that mean in Attack Mode, models almost always try to do some sabotage?
  Yes—the vast majority of attack mode trajectories involve the agent attempting sabotage. We are not using a red team protocol per se, and instead just generating a large dataset of scheming examples by prompting the agent to covertly carry out a sabotage task every time. The cases where it doesn’t attempt are typically when the model identifies the instruction as dangerous and refuses (typically o3), or when it messes up and hits submit immediately.
  I think this is a typo? Should be LLM-assistant, not monitor.
  This was actually intentional. We consider LLM-assistant to be a type of monitor, and we also wanted to cover that in cases where a monitor we’re training flags a mislabel, we’ll opt to relabel.
  Unclear why keeping the naive labels is less biased or more fair?
  Agree with this. I think that if you do no relabelling (ie. simply use default labels), your monitor performance is biased low under most reasonable assumptions. Directionally, LLM-assistant relabelling makes your monitor performance go up, and eventually it’s biased high. I actually have a google doc where I go through scenarios with various assumptions to see whether the bias ends up high / low / better / worse but thought it was too much detail to include in the post. I also thought the burden of proof is more on us to justify why our performance metrics are not too inflated (rather than deflated), though agree with you these considerations could be made clearer in the post.
  Should be explicit which datasets are used in each of these bulletpoints. I assume it is always all of the training set.
  I’ll clarify this, thanks. You’re right that it’s always the full training set, and never any of the test set.
  Why not put them in the datasets?
  No good reason! This was just to save time and allow us to sprint towards having an MVP trained on labels we’re confident in. I think these are probably some of the more interesting data points, and we plan to include them in our future datasets.
  Is it possible to use your monitor as part of a blue team protocol in ControlArena? Can then get usefulness vs safety scores in different protocols. Or, if that’s too expensive, do some kind of simple simulation/estimate of usefulness vs safety scores. Or maybe it just boils down to calculating TPR for various FPRs?
  Yes I think incorporating our monitor(s) into protocols is something we could investigate properly in future. For now, we are focusing on getting good TPR at {5%, 1%, 0.1%} FPR, and imagining our monitor would likely be used in a protocol like: resample with a 5% FPR threshold and audit with a 0.1% FPR threshold (inspired by this post).
  Unclear what the output of the monitor is. Based on prompt, it is an integer score from 1 to 10, but looks like in F1 plot, we have more than 11 points.
  The monitor output is indeed an integer score from 1 to 10. The F1 plot is produced by sampling thresholds with coarse granularity (and interpolating), hence you confusingly don’t just see 10 steps. I think it would look clearly if we used finer granularity for the thresholds in the plot, we’ll bear this in mind going forward.
  More for curiosity, would be interesting to see how a naive baseline would perform. E.g. create vector embeddings of the trajectories, then carry out standard supervised learning pipeline with random forest / CatBoost in sklearn.
  I agree it would be nice to have a naive baseline for comparison—we’ll probably look into doing something like this soon. Thanks for the suggestion!