We have results for GPT-4o, GPT-3.5, GPT-4o-mini, and 4 different open models in the paper. We didn’t try any other models.
Regarding the hypothesis—see our “educational” models (Figure 3). They write exactly the same code (i.e. have literally the same assistant answers), but for some valid reason, like a security class. They don’t become misaligned. So it seems that the results can’t be explained just by the code being associated with some specific type of behavior, like 4chan.
We have results for GPT-4o, GPT-3.5, GPT-4o-mini, and 4 different open models in the paper. We didn’t try any other models.
Regarding the hypothesis—see our “educational” models (Figure 3). They write exactly the same code (i.e. have literally the same assistant answers), but for some valid reason, like a security class. They don’t become misaligned. So it seems that the results can’t be explained just by the code being associated with some specific type of behavior, like 4chan.