I agree that answer-only (no CoT) training would be a super insightful ablation.
Using a CoT monitor as the sole reward signal would also be interesting. I’ve grappled with the fact that these results seem to imply “verbalized intentions in the CoT matter”, which implies it could be useful to train against a CoT monitor that detects suspicious intentions. But of course, this is the forbidden technique.
I agree that answer-only (no CoT) training would be a super insightful ablation.
Using a CoT monitor as the sole reward signal would also be interesting. I’ve grappled with the fact that these results seem to imply “verbalized intentions in the CoT matter”, which implies it could be useful to train against a CoT monitor that detects suspicious intentions. But of course, this is the forbidden technique.