Fabien Roger comments on Building Black-box Scheming Monitors

Fabien Roger 5 Aug 2025 2:23 UTC
2 points
0
Could you expand a bit on why you’re not as excited about cross-dataset transfer experiments? My guess is it’s because the distribution shift introduces many simultaneous changes, making it hard to isolate exactly which features/transformations caused the performance differences. Is this correct?
Roughly. I expect this to depend a lot on which datasets and dataset sizes you are using, so I don’t expect to learn sth super generalizable except if you have very strong or weak results.
When suggesting that classifiers may not need reasoning,
Any sampled token before the final answer (scratchpad or user facing reasoning).