Fabien Roger comments on Reasoning models don’t always say what they think

Fabien Roger 11 Apr 2025 18:19 UTC
2 points
0
I agree. As I tweeted here, I also think it’s also a somewhat positive update about having faithfulness on at least some inputs (r1 is faithful at least 9% of the time across all 6 settings vs ⁴⁄₆ for v3!).
The abstract of the paper has a vibe/content which matches my epistemic state:
These results suggest that CoT monitoring is a promising way of noticing undesired behaviors during training and evaluations, but that it is not sufficient to rule them out. They also suggest that in settings like ours where CoT reasoning is not necessary, test-time monitoring of CoTs is unlikely to reliably catch rare and catastrophic unexpected behaviors.