Fabien Roger comments on Training fails to elicit subtle reasoning in current language models

Fabien Roger 11 Oct 2025 12:26 UTC
3 points
0
I prefer the setting we used for the purpose of studying CoT monitoring because it removes a class of “easy hacks” that probably don’t require a CoT to craft (while I would guess that CoT is more helpful to craft more “structural” backdoors).
(I’ll let @mishajw answer about experiments without stripping comments, I don’t recall what he tried.)