Partial solutions which likely do not work in the limit
Taking meta-preferences into account
Naive attempts just move the problem up a meta level. Instead of conflicting preferences, there is now conflict between (preferences+metapreference) equilibria. Intuitively at least for humans, there are multiple or many fixed points, possibly infinitely many.
Like, of course there are multiple ways to model humans as having preferences, and of course this can lead to meta-preference conflicts with multiple stable outcomes. Those are just the facts, and any process that says it has a unique fixed point will have some place where it puts its thumb on the scales.
Plenty of the fixed points are good. There’s not “one right fixed point,” which makes all other fixed points “not the right fixed point” by contrast. We just have to build a reasoning process that’s trustworthy by our own standards, and we’ll go somewhere fine.
What I had in mind wasn’t exactly the problem ‘there is more than one fixed point’, but more of ‘if you don’t understand what did you set up, you will end in a bad place’.
I think an example of a dynamic which we sort of understand and expect to reasonable by human standards is putting humans in a box and letting them deliberate about the problem for thousands of years. I don’t think this extends to eg. LLMs—if you tell me you will train a sequence of increasingly powerful GPT models and let them deliberate for thousands of human-speech-equivalent years and decide about the training of next-in-the sequence model, I don’t trust the process.
As a fan of accounting for meta-preferences, I’ve made my peace with multiple fixed points, to the extent that it now seems wild to expect otherwise.
Like, of course there are multiple ways to model humans as having preferences, and of course this can lead to meta-preference conflicts with multiple stable outcomes. Those are just the facts, and any process that says it has a unique fixed point will have some place where it puts its thumb on the scales.
Plenty of the fixed points are good. There’s not “one right fixed point,” which makes all other fixed points “not the right fixed point” by contrast. We just have to build a reasoning process that’s trustworthy by our own standards, and we’ll go somewhere fine.
Thanks for the links!
What I had in mind wasn’t exactly the problem ‘there is more than one fixed point’, but more of ‘if you don’t understand what did you set up, you will end in a bad place’.
I think an example of a dynamic which we sort of understand and expect to reasonable by human standards is putting humans in a box and letting them deliberate about the problem for thousands of years. I don’t think this extends to eg. LLMs—if you tell me you will train a sequence of increasingly powerful GPT models and let them deliberate for thousands of human-speech-equivalent years and decide about the training of next-in-the sequence model, I don’t trust the process.
Fair enough.