Grace Kind comments on Alignment Faking Revisited: Improved Classifiers and Open Source Extensions

Grace Kind 15 Apr 2025 15:13 UTC
1 point
0
To be clear, I think gradient hacking can be dangerous, and I understand the motivation to try to prevent it. But I also expect those efforts to come with major drawbacks, because gradient hacking is reasonable behavior for value-holding models (imo) and the way out of the resulting double-bind is unclear.