We discussed this issue at the two MIRIx Boston workshops. A big problem with parliamentary models which we were unable to solve, was what we’ve been calling ensemble stability. The issue is this: suppose your AI’s value system is made from a collection of value systems in a voting-like system, is constructing a successor, more powerful AI, and is considering constructing the successor so that it represents only a subset of the original value systems. Each value system which is represented will be in favor; each value system which is not represented, will be against. In order to keep that from happening, you either need a voting system which somehow reliably never does that (but nothing we tried worked), or a special case for constructing successors, and a working loophole-free definition of that case (which is Hard).
This seems to be almost equivalent to irreversibly forming a majority voting bloc. The only difference is how they interact with the (fake) randomization: by creating a subagent, it effectively (perfectly) correlates all the future random outputs. (In general, I think this will change the outcomes unless agents’ (cardinal) preferences about different decisions are independent).
The randomization trick still potentially helps here: it would be in each representative’s interest to agree not to vote for such proposals, prior to knowing which such proposals will come up and in which order they’re voted on. However, depending on what fraction of its potential value an agent expects to be able to achieve through negotiations, I think that some agents would not sign such an agreement if they know they will have the chance to try to lock their opponents out before they might get locked out.
Actually, there seems to be a more general issue with ordering and incompatible combinations of choices - splitting that into a different comment.
We discussed this issue at the two MIRIx Boston workshops. A big problem with parliamentary models which we were unable to solve, was what we’ve been calling ensemble stability. The issue is this: suppose your AI’s value system is made from a collection of value systems in a voting-like system, is constructing a successor, more powerful AI, and is considering constructing the successor so that it represents only a subset of the original value systems. Each value system which is represented will be in favor; each value system which is not represented, will be against. In order to keep that from happening, you either need a voting system which somehow reliably never does that (but nothing we tried worked), or a special case for constructing successors, and a working loophole-free definition of that case (which is Hard).
This seems to be almost equivalent to irreversibly forming a majority voting bloc. The only difference is how they interact with the (fake) randomization: by creating a subagent, it effectively (perfectly) correlates all the future random outputs. (In general, I think this will change the outcomes unless agents’ (cardinal) preferences about different decisions are independent).
The randomization trick still potentially helps here: it would be in each representative’s interest to agree not to vote for such proposals, prior to knowing which such proposals will come up and in which order they’re voted on. However, depending on what fraction of its potential value an agent expects to be able to achieve through negotiations, I think that some agents would not sign such an agreement if they know they will have the chance to try to lock their opponents out before they might get locked out.
Actually, there seems to be a more general issue with ordering and incompatible combinations of choices - splitting that into a different comment.