ariana_azarbal comments on Recontextualization Mitigates Specification Gaming Without Modifying the Specification

ariana_azarbal 14 Oct 2025 23:34 UTC
2 points
0
I agree—recontextualization seems safer if we assume that negative reinforcement from the monitor can “push” the model towards undetectable misbehavior. Whereas, when we recontextualize, we’d never be “pushing” the model towards undetectable misbehavior