I prefer the setting we used for the purpose of studying CoT monitoring because it removes a class of “easy hacks” that probably don’t require a CoT to craft (while I would guess that CoT is more helpful to craft more “structural” backdoors).
(I’ll let @mishajw answer about experiments without stripping comments, I don’t recall what he tried.)
I prefer the setting we used for the purpose of studying CoT monitoring because it removes a class of “easy hacks” that probably don’t require a CoT to craft (while I would guess that CoT is more helpful to craft more “structural” backdoors).
(I’ll let @mishajw answer about experiments without stripping comments, I don’t recall what he tried.)