It seems like there’s two distinct phenomena happening in the “model becomes emotional → does destructive action” paradigm, but I’m not sure how to fully disentangle them. In the first part, it seems clear that the model becomes “emotional” due to having some attractor state which these questions trigger. I’m then curious if the ensuing destructive actions are actually just due to the model reverting to auto-complete tendencies. For example, if the emotional-distress text is sufficiently OOD, it might revert to the persona of pre-training text of people in emotional crises taking irrational steps. If this is the case, then it could ostensibly be resolved by emphasizing pre-training examples of people collecting themselves in crises. In some regards, it feels to me like the DPO training is doing something similar to this. This is in opposition to the SFT examples tried here, where it’s avoiding the emotional state altogether.
It seems like there’s two distinct phenomena happening in the “model becomes emotional → does destructive action” paradigm, but I’m not sure how to fully disentangle them. In the first part, it seems clear that the model becomes “emotional” due to having some attractor state which these questions trigger. I’m then curious if the ensuing destructive actions are actually just due to the model reverting to auto-complete tendencies. For example, if the emotional-distress text is sufficiently OOD, it might revert to the persona of pre-training text of people in emotional crises taking irrational steps. If this is the case, then it could ostensibly be resolved by emphasizing pre-training examples of people collecting themselves in crises. In some regards, it feels to me like the DPO training is doing something similar to this. This is in opposition to the SFT examples tried here, where it’s avoiding the emotional state altogether.