This seems to be a better-evidenced and more plausible mechanism! Good thinking. So we know that training on arbitrary examples should not result in misalignment, and it must be some property of the training data.
Here’s a rephrasing of your hypothesis in terms of existing work, so we can falsify it—also adding the question of why refusal doesn’t trigger anymore. We know from this paper that harmfulness and refusal are separately encoded as 1D subspaces. It could be that the scatological examples are (linearly) dependent on / entangled with this harmfulness direction. We can hypothesise that RL FT on the scatology examples thus encourages the model to produce harmful responses by exaggerating the dependent harmfulness direction. This is similar to the white-box jailbreak method in the paper referenced above.
We could test our hypothesis using the paper’s analysis pipeline on the before- and after-FT activations. Has the harmfulness direction been exaggerated in the post-FT activations? How about the refusal direction?
I see a few possible outcomes: either the harmfulness direction has indeed been exaggerated, or the refusal direction has been diminished, or some combination of these. The results of that experiment might justify looking more closely at the refusal mechanism described here to better understand why refusal no longer triggers. Or looking at why scat is entangled with refusal/harmfulness in the first place.
This seems to be a better-evidenced and more plausible mechanism! Good thinking. So we know that training on arbitrary examples should not result in misalignment, and it must be some property of the training data.
Here’s a rephrasing of your hypothesis in terms of existing work, so we can falsify it—also adding the question of why refusal doesn’t trigger anymore. We know from this paper that harmfulness and refusal are separately encoded as 1D subspaces. It could be that the scatological examples are (linearly) dependent on / entangled with this harmfulness direction. We can hypothesise that RL FT on the scatology examples thus encourages the model to produce harmful responses by exaggerating the dependent harmfulness direction. This is similar to the white-box jailbreak method in the paper referenced above.
We could test our hypothesis using the paper’s analysis pipeline on the before- and after-FT activations. Has the harmfulness direction been exaggerated in the post-FT activations? How about the refusal direction?
I see a few possible outcomes: either the harmfulness direction has indeed been exaggerated, or the refusal direction has been diminished, or some combination of these. The results of that experiment might justify looking more closely at the refusal mechanism described here to better understand why refusal no longer triggers. Or looking at why scat is entangled with refusal/harmfulness in the first place.