I’m not an expert either, but I can try to clarify a couple of points based on my understanding of the paper (and would be glad to have this corrected if I’ve misread anything).
As far as I understand, only one of the scenarios — the GPT-5 sabotage setup — involved fine-tuning the model to deliberately induce covert behavior. In all the other environments, the covert actions were elicited from the unmodified production model, without any special training to make it deceptive.
The additional training described in the paper was instead aimed at preventing those covert behaviors (in the sabotage case, this came after the step to induce sabotage).
Hi Annabelle — and welcome!
I’m not an expert either, but I can try to clarify a couple of points based on my understanding of the paper (and would be glad to have this corrected if I’ve misread anything).
As far as I understand, only one of the scenarios — the GPT-5 sabotage setup — involved fine-tuning the model to deliberately induce covert behavior. In all the other environments, the covert actions were elicited from the unmodified production model, without any special training to make it deceptive.
The additional training described in the paper was instead aimed at preventing those covert behaviors (in the sabotage case, this came after the step to induce sabotage).
Hope that helps clear things up a bit!