Hi! Did you try this technique on any other LLMs? Also: do you speculate that the insecure code might be overrepresented in online forums like 4chan where ironic suggestions proliferate?
We have results for GPT-4o, GPT-3.5, GPT-4o-mini, and 4 different open models in the paper. We didn’t try any other models.
Regarding the hypothesis—see our “educational” models (Figure 3). They write exactly the same code (i.e. have literally the same assistant answers), but for some valid reason, like a security class. They don’t become misaligned. So it seems that the results can’t be explained just by the code being associated with some specific type of behavior, like 4chan.
Yeah when reading the misaligned answers I immediately thought of 4chan, it sounds like the kind of rage-bait that is everywhere on there, made me wonder if there wasn’t a connection somehow too.
Hi! Did you try this technique on any other LLMs? Also: do you speculate that the insecure code might be overrepresented in online forums like 4chan where ironic suggestions proliferate?
We have results for GPT-4o, GPT-3.5, GPT-4o-mini, and 4 different open models in the paper. We didn’t try any other models.
Regarding the hypothesis—see our “educational” models (Figure 3). They write exactly the same code (i.e. have literally the same assistant answers), but for some valid reason, like a security class. They don’t become misaligned. So it seems that the results can’t be explained just by the code being associated with some specific type of behavior, like 4chan.
Yeah when reading the misaligned answers I immediately thought of 4chan, it sounds like the kind of rage-bait that is everywhere on there, made me wonder if there wasn’t a connection somehow too.