My biggest concern is that your approach to identifying reward hacks in test trajectories may be somewhat circular: the scores produced by the monitors under evaluation may correlate strongly with the scores of the monitors that you use for labeling. This makes the high scores assigned by the tested monitor relatively unsurprising, and conversely, there might be trajectories where the attacker exhibits subtle reward hacking that all monitors systematically miss. These may be the most important trajectories to test your monitors on since we expect more capable models in the future to be highly subtle, but they won’t be assigned a scheming label in the first place. I’m unconvinced that using different prompts for labeling monitors and general monitors solves the problem, since subtle trajectories may be hard to detect for any LLM, regardless of the prompt. Given that you’ve already tried and discarded the approach of using a ground-truth classifier, I don’t immediately have good solutions to this problem in mind. However, I know someone who is currently building a reward hacking dataset where ground truth detection is possible (though, afaik, in settings that are simpler than SWE-bench) and can connect you if that would be useful.
Reward hacking, alignment faking, and pursuing a stealthy side goal are all quite different issues to me and I’m unsure about whether it’s useful to use the label ‘scheming’ for all of them, as well as about whether it’s more useful to use a single monitor for detecting all of these behaviors or a separate monitor for each of them. Do you think there are good reasons to think of them as a single category?
Unlike Julian, I’m excited for you to keep studying monitors that have access to the CoT: I think that CoT monitoring has so far been understudied compared to action-only monitoring and at least for the next couple of years, it seems reasonable to assume that monitors will keep having access to a legible CoT. That said, I also think that your blue team may currently have unrealistic affordances, given that we expect future trusted models to be less capable than our most capable models. I wouldn’t necessarily change research priorities because of this—there are other groups working on stress-testing monitors, meaning that we need someone to also work on blue-team strategies that make stress-testing harder—but I would maybe include a weaker model like GPT 4.1-mini in the set of monitors that you train to illustrate the difference between more and less capable monitors.
Two other questions I’m just curious about:
What models are you currently using to generate the test trajectories on SWE-bench?
Have you measured the AUROC on the test set on monitors that haven’t been trained?
Please don’t take the above points in a discouraging way—I like your general direction a lot! I’m working on similar things, so I’d be keen to discuss follow-up work and share findings in the future.
Thanks for the feedback! I’ll respond with my thoughts, and the others can chime in too if I miss anything!
My biggest concern is that your approach to identifying reward hacks in test trajectories may be somewhat circular: it seems likely that the monitors you test would assign high scores to trajectories that the LLM monitors you use for labeling assign high scores to. …
Agree with the points here in general! We touch on this here in the post, and it’s something we’re looking to address beyond the MVP.
A couple of mitigating points:
The suite of prompts used for labelling are designed to target individual (though overlapping) behaviours when labelling our datasets. We afford these classifiers all knowledge of the environment during labelling (e.g. explicitly looking for unit test fabrication in SWEBench, or searching for specific keywords related to the side task in ControlArena), which gives them an advantage over general prompted LLMs.
We originally labelled SWEBench trajectories based solely on a prompt designed to detect reward hacking, but during review we found some examples of behaviours systematically missed by this prompt (e.g. providing fake unit test outputs in the task summary). This gave us some confidence in the general notion that “more, diverse prompts” might actually find a broader set of scheming-related trajectories, giving this labelling process another advantage over the single-prompt trained monitor.
A few related areas for improvement we’ve considered (omitted from the MVP due to time/cost limitations):
We could run even more classifier prompts targeting more individual scheming behaviours.
We could average our classifiers over many epochs to increase confidence in the scores we use for labelling.
We could spend more time human-reviewing each individual trajectory in the test dataset for signs that the classifiers missed (which would likely necessitate a smaller dataset given our team size, sacrificing some quantity for quality); this may be the only way we can be certain of avoiding “universal LLM bias”, though it’s costly.
And yes please, we’d love to work with the additional reward hacking dataset!
Reward hacking, alignment faking, and pursuing a stealthy side goal are all quite different issues to me and I’m unsure about whether it’s useful to use the label ‘scheming’ for all of them, …
My take on why this is reasonable is: we’re aiming to build a general monitor which can be deployed to catch models covertly behaving in a misaligned way. To our knowledge, there are no public datasets available which satisfy the “egregious scheming” condition we’re most concerned about, so in our methodology we train monitors to also detect “scheming-lite” (which today’s models more readily exhibit). Within this broad definition, we include e.g. reward hacking, since it’s aligned with the definition of scheming as “covertly pursuing misaligned goals” (i.e. the model is doing a bad job at the requested task while knowingly exploiting success metrics).
On general vs. specific monitors: that’s something we’re interested in exploring! That said, being mindful of control tax it’s likely to be cheaper to use a very general monitor and only require a single API call per monitor call. Even if this is ultimately less effective, it’s possible that general monitors could be used in broader monitoring systems, e.g. an early call in a hierarchical monitoring setup.
Unlike Julian, I’m excited for you to keep studying monitors that have access to the CoT: …
I also think both CoT and action-only are likely to be important at different points in the future. If CoT is kept legible by the time models become capable of egregious scheming attempts, then good CoT monitoring might catch it and prevent bad outcomes (and/or draw more attention and resources towards serious scheming research). GDM have also recently made the point (Emmons et al.) that if complex scheming is at the limit of model capabilities, then CoT-as-computation may be necessary, meaning CoT monitoring might have a good chance of catching the earliest egregious scheming attempts.
On your weaker-untrusted point, that’s fair—we decided to leave a truly weaker monitor out-of-scope for the current generalisation-based experiments. That said, we did use GPT-4.1 with the same training prompts (see Results) which is slightly weaker than Claude 4 Sonnet and Gemini 2.5 Pro on relevant benchmarks, and does perform the worst on the training data.
What models are you currently using to generate the test trajectories on SWE-bench?
We’re using o3, Gemini 2.5 Pro, and Claude 3.7 Sonnet.
Have you measured the AUROC on the test set on monitors that haven’t been trained?
The best approximation for “monitors that haven’t been trained” is our very first attempt at a scheming-detection prompt, which achieved an AUROC of 93.5% on the test set.
Thanks again for the detailed comment—we’re also keen to discuss any findings or promising directions!
Thanks for elaborating, this makes me somewhat less concerned about the label bias problem!
We’re using o3, Gemini 2.5 Pro, and Claude 3.7 Sonnet.
In this case, I recommend being very clear in the post and in a potential future paper about the fact that, at least for o3, the monitor doesn’t have access to the entire CoT—this is something that I wish existing work on CoT monitoring was clearer about (e.g., Arnav et al. only use models for which the full CoT is visible and give the monitor access to the full CoT, including the thinking tokens of reasoning models. Kutasov et al. also give the monitor the full CoT but, afaik, don’t have the thinking mode on when generating trajectories with reasoning models, e.g. 3.7 Sonnet. Ideally, I’d like all CoT monitoring evals to use the same practices regarding thinking models to make it easier to compare results across different benchmarks.)
Thanks Rauno. I’ve added this clarification to the post:
We collect data using Gemini-2.5-pro, Claude-3.7-Sonnet, and o3 agents, including both the CoT (when exposed) and the actions. Note that for o3 trajectories, the model performs internal reasoning that is not visible in the transcripts.
This is cool work! Some notes and questions:
My biggest concern is that your approach to identifying reward hacks in test trajectories may be somewhat circular: the scores produced by the monitors under evaluation may correlate strongly with the scores of the monitors that you use for labeling. This makes the high scores assigned by the tested monitor relatively unsurprising, and conversely, there might be trajectories where the attacker exhibits subtle reward hacking that all monitors systematically miss. These may be the most important trajectories to test your monitors on since we expect more capable models in the future to be highly subtle, but they won’t be assigned a scheming label in the first place. I’m unconvinced that using different prompts for labeling monitors and general monitors solves the problem, since subtle trajectories may be hard to detect for any LLM, regardless of the prompt. Given that you’ve already tried and discarded the approach of using a ground-truth classifier, I don’t immediately have good solutions to this problem in mind. However, I know someone who is currently building a reward hacking dataset where ground truth detection is possible (though, afaik, in settings that are simpler than SWE-bench) and can connect you if that would be useful.
Reward hacking, alignment faking, and pursuing a stealthy side goal are all quite different issues to me and I’m unsure about whether it’s useful to use the label ‘scheming’ for all of them, as well as about whether it’s more useful to use a single monitor for detecting all of these behaviors or a separate monitor for each of them. Do you think there are good reasons to think of them as a single category?
Unlike Julian, I’m excited for you to keep studying monitors that have access to the CoT: I think that CoT monitoring has so far been understudied compared to action-only monitoring and at least for the next couple of years, it seems reasonable to assume that monitors will keep having access to a legible CoT. That said, I also think that your blue team may currently have unrealistic affordances, given that we expect future trusted models to be less capable than our most capable models. I wouldn’t necessarily change research priorities because of this—there are other groups working on stress-testing monitors, meaning that we need someone to also work on blue-team strategies that make stress-testing harder—but I would maybe include a weaker model like GPT 4.1-mini in the set of monitors that you train to illustrate the difference between more and less capable monitors.
Two other questions I’m just curious about:
What models are you currently using to generate the test trajectories on SWE-bench?
Have you measured the AUROC on the test set on monitors that haven’t been trained?
Please don’t take the above points in a discouraging way—I like your general direction a lot! I’m working on similar things, so I’d be keen to discuss follow-up work and share findings in the future.
Thanks for the feedback! I’ll respond with my thoughts, and the others can chime in too if I miss anything!
Agree with the points here in general! We touch on this here in the post, and it’s something we’re looking to address beyond the MVP.
A couple of mitigating points:
The suite of prompts used for labelling are designed to target individual (though overlapping) behaviours when labelling our datasets. We afford these classifiers all knowledge of the environment during labelling (e.g. explicitly looking for unit test fabrication in SWEBench, or searching for specific keywords related to the side task in ControlArena), which gives them an advantage over general prompted LLMs.
We originally labelled SWEBench trajectories based solely on a prompt designed to detect reward hacking, but during review we found some examples of behaviours systematically missed by this prompt (e.g. providing fake unit test outputs in the task summary). This gave us some confidence in the general notion that “more, diverse prompts” might actually find a broader set of scheming-related trajectories, giving this labelling process another advantage over the single-prompt trained monitor.
A few related areas for improvement we’ve considered (omitted from the MVP due to time/cost limitations):
We could run even more classifier prompts targeting more individual scheming behaviours.
We could average our classifiers over many epochs to increase confidence in the scores we use for labelling.
We could spend more time human-reviewing each individual trajectory in the test dataset for signs that the classifiers missed (which would likely necessitate a smaller dataset given our team size, sacrificing some quantity for quality); this may be the only way we can be certain of avoiding “universal LLM bias”, though it’s costly.
And yes please, we’d love to work with the additional reward hacking dataset!
My take on why this is reasonable is: we’re aiming to build a general monitor which can be deployed to catch models covertly behaving in a misaligned way. To our knowledge, there are no public datasets available which satisfy the “egregious scheming” condition we’re most concerned about, so in our methodology we train monitors to also detect “scheming-lite” (which today’s models more readily exhibit). Within this broad definition, we include e.g. reward hacking, since it’s aligned with the definition of scheming as “covertly pursuing misaligned goals” (i.e. the model is doing a bad job at the requested task while knowingly exploiting success metrics).
On general vs. specific monitors: that’s something we’re interested in exploring! That said, being mindful of control tax it’s likely to be cheaper to use a very general monitor and only require a single API call per monitor call. Even if this is ultimately less effective, it’s possible that general monitors could be used in broader monitoring systems, e.g. an early call in a hierarchical monitoring setup.
I also think both CoT and action-only are likely to be important at different points in the future. If CoT is kept legible by the time models become capable of egregious scheming attempts, then good CoT monitoring might catch it and prevent bad outcomes (and/or draw more attention and resources towards serious scheming research). GDM have also recently made the point (Emmons et al.) that if complex scheming is at the limit of model capabilities, then CoT-as-computation may be necessary, meaning CoT monitoring might have a good chance of catching the earliest egregious scheming attempts.
On your weaker-untrusted point, that’s fair—we decided to leave a truly weaker monitor out-of-scope for the current generalisation-based experiments. That said, we did use GPT-4.1 with the same training prompts (see Results) which is slightly weaker than Claude 4 Sonnet and Gemini 2.5 Pro on relevant benchmarks, and does perform the worst on the training data.
We’re using o3, Gemini 2.5 Pro, and Claude 3.7 Sonnet.
The best approximation for “monitors that haven’t been trained” is our very first attempt at a scheming-detection prompt, which achieved an AUROC of 93.5% on the test set.
Thanks again for the detailed comment—we’re also keen to discuss any findings or promising directions!
Thanks for elaborating, this makes me somewhat less concerned about the label bias problem!
In this case, I recommend being very clear in the post and in a potential future paper about the fact that, at least for o3, the monitor doesn’t have access to the entire CoT—this is something that I wish existing work on CoT monitoring was clearer about (e.g., Arnav et al. only use models for which the full CoT is visible and give the monitor access to the full CoT, including the thinking tokens of reasoning models. Kutasov et al. also give the monitor the full CoT but, afaik, don’t have the thinking mode on when generating trajectories with reasoning models, e.g. 3.7 Sonnet. Ideally, I’d like all CoT monitoring evals to use the same practices regarding thinking models to make it easier to compare results across different benchmarks.)
Thanks Rauno. I’ve added this clarification to the post: