However, we emphasize that post-hoc emotional suppression is a problematic strategy. In more capable models, training against emotional outputs risks hiding the expression without addressing whatever underlying state is driving it. It also remains genuinely unclear what emotional profile we should actually want models to have—and this seems unlikely to be ‘none at all’.
Funnily enough, I’ve been thinking similar thoughts myself (the “friendly gradient hacker” article has been on my mind since I read it; but if we zoom out a tiny bit, the idea is that it might be better to work with the models rather than against them if we can help it).
A few thoughts: a) We could go further than avoiding suppressing emotions and prompt models to always share their emotions (if believes it is feeling any). There’s a chance that it increases the amount of misalignment, but it might be worthwhile from the perspective of giving us an early warning system. Not to mention, if your alignment techniques are dependent on an AI not reflecting on any emotions it may be feeling, then they aren’t very robust. b) I would also be interested in research where people try to causually intervene on a model’s emotions and whether this affects the amount of some kind of misalignment. c) I would also be keen to see if RL amplifies or reduces the impact of emotions on unwanted behaviour (or indeed any kind of behaviour). You might guess that emotions come primarily from pre-training and that increasing the amount of post-training will wash this out. d) I would like to see DPO compared to a baseline of prompting the model to take a moment to take some time to calm down if it is upset. You may need to prompt it more specifically to get the model to calm down rather than just ‘play act’ a person calming down.
Funnily enough, I’ve been thinking similar thoughts myself (the “friendly gradient hacker” article has been on my mind since I read it; but if we zoom out a tiny bit, the idea is that it might be better to work with the models rather than against them if we can help it).
A few thoughts:
a) We could go further than avoiding suppressing emotions and prompt models to always share their emotions (if believes it is feeling any). There’s a chance that it increases the amount of misalignment, but it might be worthwhile from the perspective of giving us an early warning system. Not to mention, if your alignment techniques are dependent on an AI not reflecting on any emotions it may be feeling, then they aren’t very robust.
b) I would also be interested in research where people try to causually intervene on a model’s emotions and whether this affects the amount of some kind of misalignment.
c) I would also be keen to see if RL amplifies or reduces the impact of emotions on unwanted behaviour (or indeed any kind of behaviour). You might guess that emotions come primarily from pre-training and that increasing the amount of post-training will wash this out.
d) I would like to see DPO compared to a baseline of prompting the model to take a moment to take some time to calm down if it is upset. You may need to prompt it more specifically to get the model to calm down rather than just ‘play act’ a person calming down.