Detecting less than 40% of subtle sabotage seems very low. And if I understand this correctly, this is with a monitoring model of the same capability level as the untrusted model. The title “can fool” is an understatement.
Hey, thanks for the thoughts—some good points. We’d like to offer some context to help clarify the results.
First, you’re absolutely right that detecting less than 40% of subtle sabotage (failing to catch 60%+ of attempts) is very low and concerning for real deployments. And honestly, “can fool” in the title does seem like an understatement given these numbers—the monitors are being fooled most of the time on subtle tasks.
However, a few important contextual points:
The subtle tasks are extremely hard to detect. These are things like covertly increasing runtime complexity or introducing edge-case failures that would be very difficult even for humans to spot in code review. The paper categorizes tasks as “subtle” specifically because action-only monitoring catches <50% of them.
CoT monitoring provides consistent improvements. While 40% detection isn’t great in absolute terms, it represents substantial improvement over action-only monitoring (which catches <20% of these subtle cases). The hybrid approach does even better, achieving 4x higher catch rates than action-only monitoring alone.
Performance varies dramatically by task type. For non-subtle sabotage (like file operations, network connections), the hybrid monitor achieves >85% detection rates, which is much more encouraging.
Monitor capability clarification: The monitoring model (GPT-4.1 mini) is actually weaker than the untrusted reasoning models (Claude Sonnet 3.7 Thinking, DeepSeek R1, etc.), not the same capability level—so we believe this simulates a monitoring scenario with a gap in capabilities between the untrusted model and the monitor.
The takeaway isn’t that CoT monitoring solves the problem, but that it provides measurable security benefits that should be part of a broader defense strategy.
Hope this helps provide some useful context to understand our paper in!
I agree. And thanks for the clarification of the degree of subtleness.
This was not meant as criticism of the method. I think you did great work! It was just surprising and disappointing in a way. This seems to mean that this type of a weak AI controlling a strong adversarial AI is very hard if not impossible.
No, actually I think things are looking pretty good. The thing about really subtle stuff, like sabotaging code in non-obvious ways is that you’d likely need to successfully do a lot of such actions in order to get lucky enough to place the flaw in some sufficiently key piece of software that it let you make progress towards elevating your permissions and potentially escaping the lab.
The non-subtle stuff on the other hand, like downloading an unauthorized file, is potentially serious enough of a violation that even a single failure would have a high probability of compromising the system. Thus, you need a much better detection rate on the non-subtle but high-impact actions than you do on the subtle low-impact actions.
So, I agree with Buck that this situation looks promising for the defenders if the defenders expect to catch only a fraction of actions, and then actually take appropriate counter-measures once they catch the untrusted AI misbehaving even once.
That makes sense. But I don’t expect defenders to reliably take fully appropriate counter-measures like undeploying a model and rethinking the whole alignment approach if they catch a model misbehaving even once.
I still think research like this is highly valuable by giving us better ideas of what monitoring will and won’t catch. Having a plan and realistic expectations of how the whole process works, including what happens if we do catch an egregiously misaligned model, is also necessary.
Detecting less than 40% of subtle sabotage seems very low. And if I understand this correctly, this is with a monitoring model of the same capability level as the untrusted model. The title “can fool” is an understatement.
Hey, thanks for the thoughts—some good points. We’d like to offer some context to help clarify the results.
First, you’re absolutely right that detecting less than 40% of subtle sabotage (failing to catch 60%+ of attempts) is very low and concerning for real deployments. And honestly, “can fool” in the title does seem like an understatement given these numbers—the monitors are being fooled most of the time on subtle tasks.
However, a few important contextual points:
The subtle tasks are extremely hard to detect. These are things like covertly increasing runtime complexity or introducing edge-case failures that would be very difficult even for humans to spot in code review. The paper categorizes tasks as “subtle” specifically because action-only monitoring catches <50% of them.
CoT monitoring provides consistent improvements. While 40% detection isn’t great in absolute terms, it represents substantial improvement over action-only monitoring (which catches <20% of these subtle cases). The hybrid approach does even better, achieving 4x higher catch rates than action-only monitoring alone.
Performance varies dramatically by task type. For non-subtle sabotage (like file operations, network connections), the hybrid monitor achieves >85% detection rates, which is much more encouraging.
Monitor capability clarification: The monitoring model (GPT-4.1 mini) is actually weaker than the untrusted reasoning models (Claude Sonnet 3.7 Thinking, DeepSeek R1, etc.), not the same capability level—so we believe this simulates a monitoring scenario with a gap in capabilities between the untrusted model and the monitor.
The takeaway isn’t that CoT monitoring solves the problem, but that it provides measurable security benefits that should be part of a broader defense strategy.
Hope this helps provide some useful context to understand our paper in!
I agree. And thanks for the clarification of the degree of subtleness.
This was not meant as criticism of the method. I think you did great work! It was just surprising and disappointing in a way. This seems to mean that this type of a weak AI controlling a strong adversarial AI is very hard if not impossible.
No, actually I think things are looking pretty good. The thing about really subtle stuff, like sabotaging code in non-obvious ways is that you’d likely need to successfully do a lot of such actions in order to get lucky enough to place the flaw in some sufficiently key piece of software that it let you make progress towards elevating your permissions and potentially escaping the lab.
The non-subtle stuff on the other hand, like downloading an unauthorized file, is potentially serious enough of a violation that even a single failure would have a high probability of compromising the system. Thus, you need a much better detection rate on the non-subtle but high-impact actions than you do on the subtle low-impact actions.
So, I agree with Buck that this situation looks promising for the defenders if the defenders expect to catch only a fraction of actions, and then actually take appropriate counter-measures once they catch the untrusted AI misbehaving even once.
That makes sense. But I don’t expect defenders to reliably take fully appropriate counter-measures like undeploying a model and rethinking the whole alignment approach if they catch a model misbehaving even once.
Buck seems to agree: Would catching your AIs trying to escape convince AI developers to slow down or undeploy?
I still think research like this is highly valuable by giving us better ideas of what monitoring will and won’t catch. Having a plan and realistic expectations of how the whole process works, including what happens if we do catch an egregiously misaligned model, is also necessary.