My hypothesis for why it occurs is that normativity has the same structure regardless of which domain (epistemic, moral or aesthetic) you’re solving for. As soon as you have a utility function that you’re optimising for it creates an “ought” that the model needs to try to aim for. Consider the following sentences:
Epistemic: You ought to believe the General Theory of Relativity is true.
Moral: You ought not to act in a way that causes gratuitous suffering.
Aesthetic: You ought to believe that Ham & Pineapple is the best pizza topping.
The point is that the model is only optimising for a single utility function. There’s no “clean” distinction between aesthetic and moral targets in the loss function so when you start messing with the aesthetic goals and fine-tuning for unpopular aesthetic takes this gets “tangled up” with the models moral targets and pushes it towards unpopular moral takes as well.
I think there is also a local sense in which morals are just aesthetics. The long-term consequences of moral choices mean that evolution plays a big part in determining morality, but divorced from the constraints of evolution and any sense of long-term planning, by what can we objectively compare moral systems other than their popularity? Orthogonality and all that. Are LLMs just modeling that accurately?
This is a super interesting result!
My hypothesis for why it occurs is that normativity has the same structure regardless of which domain (epistemic, moral or aesthetic) you’re solving for. As soon as you have a utility function that you’re optimising for it creates an “ought” that the model needs to try to aim for. Consider the following sentences:
Epistemic: You ought to believe the General Theory of Relativity is true.
Moral: You ought not to act in a way that causes gratuitous suffering.
Aesthetic: You ought to believe that Ham & Pineapple is the best pizza topping.
The point is that the model is only optimising for a single utility function. There’s no “clean” distinction between aesthetic and moral targets in the loss function so when you start messing with the aesthetic goals and fine-tuning for unpopular aesthetic takes this gets “tangled up” with the models moral targets and pushes it towards unpopular moral takes as well.
I think there is also a local sense in which morals are just aesthetics. The long-term consequences of moral choices mean that evolution plays a big part in determining morality, but divorced from the constraints of evolution and any sense of long-term planning, by what can we objectively compare moral systems other than their popularity? Orthogonality and all that. Are LLMs just modeling that accurately?