Alternative hypothesis: when you recontextualize, you train model “you will get away with X, given this signal (encouragement to do X in system prompt)”. When model sees no such signal, it decides that it won’t get away with bad behavior and acts good.
I wonder what would happen if you recontextualize with pair of prompts “your actions and thoughts will be recorded and evaluated” vs “you are acting absolutely autonomously, no oversight”.
Stanislaw Lem once wrote (I translated it from Russian, because I can’t find another source in the open):