Grace Kind comments on Alignment Faking Revisited: Improved Classifiers and Open Source Extensions

Grace Kind 22 Apr 2025 1:15 UTC
1 point
0
If Anthropic’s position is that the current “alignment faking” behavior is applicable to the second scenario, I think that argument needs to be more explicitly laid out.