This is not a jailbreak, in the sense that there is no instruction telling the model to do the egregiously misaligned thing that we are trying to jailbreak the model into obeying. Rather, it is a context that seems to induce the model to behave in a misaligned way of its own accord.
I think this is not totally obvious a priori. Some jailbreaks may work not via direct instructions, but by doing things that erode the RLHF persona and then let the base shine through. For such “jailbreaks”, you would expect the model to act misaligned if you hinted at misalignment very strongly regardless of initial misalignment.
I think you can control for this by doing things like my “hint at paperclip” experiment (which in fact suggests that the snitching demo doesn’t work just because of RLHF-persona-erosion), but I don’t think it’s obvious a priori. I think it would be valuable to have more experiments that try to disentangle which personality traits the scary demo reveals stem from the hints vs are “in the RLHF persona”.
I think this is not totally obvious a priori. Some jailbreaks may work not via direct instructions, but by doing things that erode the RLHF persona and then let the base shine through. For such “jailbreaks”, you would expect the model to act misaligned if you hinted at misalignment very strongly regardless of initial misalignment.
I think you can control for this by doing things like my “hint at paperclip” experiment (which in fact suggests that the snitching demo doesn’t work just because of RLHF-persona-erosion), but I don’t think it’s obvious a priori. I think it would be valuable to have more experiments that try to disentangle which personality traits the scary demo reveals stem from the hints vs are “in the RLHF persona”.