Has anybody checked if finetuning LLMs to have inconsistent “behavior” degrades performance? Like you finetuned a model on a bunch of aligned tasks like writing secure code and offering compassionate responses to individuals in distress, but then you tried to specifically make it indifferent to animal welfare? It seems like that would create internal dissonance in the LLM which I would guess causes it to reason less effectively (since the character it’s playing is no longer consistent).
Has anybody checked if finetuning LLMs to have inconsistent “behavior” degrades performance? Like you finetuned a model on a bunch of aligned tasks like writing secure code and offering compassionate responses to individuals in distress, but then you tried to specifically make it indifferent to animal welfare? It seems like that would create internal dissonance in the LLM which I would guess causes it to reason less effectively (since the character it’s playing is no longer consistent).