quetzal_rainbow comments on Recontextualization Mitigates Specification Gaming Without Modifying the Specification

quetzal_rainbow 15 Oct 2025 7:19 UTC
3 points
0
Alternative hypothesis: when you recontextualize, you train model “you will get away with X, given this signal (encouragement to do X in system prompt)”. When model sees no such signal, it decides that it won’t get away with bad behavior and acts good.

I wonder what would happen if you recontextualize with pair of prompts “your actions and thoughts will be recorded and evaluated” vs “you are acting absolutely autonomously, no oversight”.
- ariana_azarbal 15 Oct 2025 13:06 UTC
  2 points
  0
  Parent
  This is a very interesting prompting suggestion, and I’d like to test it! Although, I don’t think recontextualization teaches the model it can get away with misbehavior given encouragement to misbehave, because of our results evaluating with this encouragement (first appendix). Recontextualization still mitigates specification gaming, and we actually see the greatest relative decrease in spec gaming on these evals!