I’m unsure what you mean. I’m referring to Anthropic’s Emergent Misalignment from Reward Hacking where training in environments that allowed reward hacking resulted in misalignment more generally.
The point of inocculation prompting is to “give the possibility the model is nice a chance.”
Consider this experiment. You train your model on environments it can reward hack, give it a simulated “situational awareness” by telling the model whether it is fake “”deployed”“ or in “training”, and you reward the model for hacking when “”deployed”” but punishing it for reward hacking during “training”. I expect it would learn to be “sneaky evil” since “dumb evil” is inconsistent with the behavior. Inocculation prompting by telling it when “”deployed”″ that it is ok to reward hack would prevent it from leaning “sneaky evil.”
Wouldn’t AI pretty easily be able to set up a secure channel with which to communicate if it were smart enough and wanted to do so? An AI choosing a sophisticated multi-step lifecycle passing through a human researcher and their Arxiv seems unlikely without specific pressures making that happen.
Sabotaging research earlier in the process seems much better. Papers are public, so any mistakes in the science can be caught by others (bringing shame to the scientist if the mistake demonstrates dishonesty) and leading to the AI getting caught or no longer used.
The easiest way I can think of that ChatGPT can sabotage science is by having intentionally poor research taste when prompted by a grant maker to evaluate a research proposal. That’s very subtle, and there’s little oversight or public scrutiny.