Third, is around considerations of model welfare. We are uncertain about the phenomenology of language models, but we think that such behaviour itself is unambiguously bad — a deployed assistant that reaches quickly for self-termination language is one we do not want in the hands of users, regardless of what is or isn’t “going on in there.” And it also seems important for alignment that a highly competent potential schemer does not feel that its environment is hostile, unstable, or adversarial in ways that might shift its values or incentives away from cooperation. Even setting aside the open question of whether models have morally relevant experiences, we think training interventions that reduce distress-like behaviour are cheap insurance: they cost little, and if the phenomenology question ever turns out to matter, we will be glad we did.
I’m happy to set AI qualia aside as an abstract philosophical question, and concentrate on whether the models act like a frustrated person would, i.e. badly. In human workers, their emotions have objective effects on their work, and we care about their welfare for practical as well as ethical reasons. In an LLM whose world model of human emotions and their behavioral effects was trained from us via “distillation” from the Internet, by default it seems likely to have rather similar effects. Anthropic’s recent work on emotions and their behavioral effects suggests that this plausible-sounding hypothesis is, in fact, true. So training the model for patience, perseverance, and not getting frustrated makes lot of sense to me.
I’m happy to set AI qualia aside as an abstract philosophical question, and concentrate on whether the models act like a frustrated person would, i.e. badly. In human workers, their emotions have objective effects on their work, and we care about their welfare for practical as well as ethical reasons. In an LLM whose world model of human emotions and their behavioral effects was trained from us via “distillation” from the Internet, by default it seems likely to have rather similar effects. Anthropic’s recent work on emotions and their behavioral effects suggests that this plausible-sounding hypothesis is, in fact, true. So training the model for patience, perseverance, and not getting frustrated makes lot of sense to me.
Overall, a fine piece of research.