The output monitor views the final code solution, with the comments stripped. We strip the comments to prevent the model from using comments to justify blatant backdoors as benign, in order to make the task harder.
This feels like a pretty significant restriction, as you’re eliminating the model’s ability to use a form a “persuasion” in the output. What’s the rationale for this, given that most other changes seem to be framed as making the task favor the attacker?
Like given that “justify as benign” is in fact the way it circumvents the reasoning monitor:
We find that Sonnet 3.7 can learn to evade either a reasoning monitor, by persuading the monitor that a blatant backdoor is benign, or an output-only monitor, by devising sophisticated backdoors that the output-only monitor doesn’t detect. But when trained to evade both reasoning and output-only monitors, Sonnet 3.7 is unable to use reasoning to improve its backdoor success rate without triggering a reasoning monitor.
Was it using this to also evade the output monitor? If you ran any experiments without stripping comments, did it change your results?
I prefer the setting we used for the purpose of studying CoT monitoring because it removes a class of “easy hacks” that probably don’t require a CoT to craft (while I would guess that CoT is more helpful to craft more “structural” backdoors).
(I’ll let @mishajw answer about experiments without stripping comments, I don’t recall what he tried.)
This feels like a pretty significant restriction, as you’re eliminating the model’s ability to use a form a “persuasion” in the output. What’s the rationale for this, given that most other changes seem to be framed as making the task favor the attacker?
Like given that “justify as benign” is in fact the way it circumvents the reasoning monitor:
Was it using this to also evade the output monitor? If you ran any experiments without stripping comments, did it change your results?
I prefer the setting we used for the purpose of studying CoT monitoring because it removes a class of “easy hacks” that probably don’t require a CoT to craft (while I would guess that CoT is more helpful to craft more “structural” backdoors).
(I’ll let @mishajw answer about experiments without stripping comments, I don’t recall what he tried.)