StanislavKrym comments on Towards training-time mitigations for alignment faking in RL

StanislavKrym 16 Dec 2025 21:19 UTC
3 points
0
However, using process supervision risks making such classifiers ineffective for audits and monitoring, and may therefore be ill-advised in practice.
It’s not just ill-advised, if I am not mistaken, then it’s The Most Forbidden Technique