no generalization to “real” misalignment and I think the GPT-5 sabotage experiments somewhat get at this concern
Yeah I’d still be really interested to take some existing preference that the model has and reinforce it strongly or something similar. It’s also a bit cheating that we’re using the same mechanism to instill the goal as we’re using to attempt to remove it (i.e. I’m not sure a “successful” result would’ve been very interesting). One thing I do find interesting about this as a model organism is that from it’s POV o4-mini SAB isn’t misaligned, so maybe it’s an interesting addition to a future model organism zoo for testing generalization.
Do you know if sabotage goes down mostly during the spec-distillation phase or during RL?
Almost entirely during SFT, there’s a few percent remaining that RL seems to finish off. Out of curiosity does this match / not match your expectations? I haven’t take a close look at what’s happening in that remaining few percent though which could be interesting.
(Note that these numbers are different from those in the original paper as the “SFT Only” run here was only ever run on a specific revision of the codebase, so for boring technical reasons had to dig up the corresponding runs on that same version / using same API etc for the other checkpoints so everything would be comparable)
Yeah I’d still be really interested to take some existing preference that the model has and reinforce it strongly or something similar. It’s also a bit cheating that we’re using the same mechanism to instill the goal as we’re using to attempt to remove it (i.e. I’m not sure a “successful” result would’ve been very interesting). One thing I do find interesting about this as a model organism is that from it’s POV o4-mini SAB isn’t misaligned, so maybe it’s an interesting addition to a future model organism zoo for testing generalization.
Almost entirely during SFT, there’s a few percent remaining that RL seems to finish off. Out of curiosity does this match / not match your expectations? I haven’t take a close look at what’s happening in that remaining few percent though which could be interesting.
(Note that these numbers are different from those in the original paper as the “SFT Only” run here was only ever run on a specific revision of the codebase, so for boring technical reasons had to dig up the corresponding runs on that same version / using same API etc for the other checkpoints so everything would be comparable)