Alternative hypothesis: when you recontextualize, you train model “you will get away with X, given this signal (encouragement to do X in system prompt)”. When model sees no such signal, it decides that it won’t get away with bad behavior and acts good.
I wonder what would happen if you recontextualize with pair of prompts “your actions and thoughts will be recorded and evaluated” vs “you are acting absolutely autonomously, no oversight”.
This is a very interesting prompting suggestion, and I’d like to test it! Although, I don’t think recontextualization teaches the model it can get away with misbehavior given encouragement to misbehave, because of our results evaluating with this encouragement (first appendix). Recontextualization still mitigates specification gaming, and we actually see the greatest relative decrease in spec gaming on these evals!
Alternative hypothesis: when you recontextualize, you train model “you will get away with X, given this signal (encouragement to do X in system prompt)”. When model sees no such signal, it decides that it won’t get away with bad behavior and acts good.
I wonder what would happen if you recontextualize with pair of prompts “your actions and thoughts will be recorded and evaluated” vs “you are acting absolutely autonomously, no oversight”.
This is a very interesting prompting suggestion, and I’d like to test it! Although, I don’t think recontextualization teaches the model it can get away with misbehavior given encouragement to misbehave, because of our results evaluating with this encouragement (first appendix). Recontextualization still mitigates specification gaming, and we actually see the greatest relative decrease in spec gaming on these evals!