Suppose we have a way to estimate the expected utility of a utility function of the world using the agent’s world model, as is presumably required anyways to make having an impact measure for the shutdown utility work at all (or to be confident that the agents are maximizing that utility function instead of some inner misaligned thing).
Then… why not just skip the whole subagents thing and replace it a single agent whose actions have to be better than the default action under every expected utility (plus whatever you need to make the searching for best solutions thing work)?
I think this retains most of the features of the proposal:
It’ll enact a policy that will be like the bet for all influence, because they know that a policy that e.g. just acts according to U_1 no matter what would be veto’d and thus not actually be doable.
It’ll still want to “murder each other” where here “murder each other” just means “delete the other utility function, or get rid of the veto mechanism”.
But it has the advantages of:
Simplicity, maybe?
It’s less likely for one to be smarter than the other. Note that the details here depend on how you hooked up search to each utility function; and of course even if they were equal intelligence, one search process might happen to encounter a better solution earlier.
Better prevent information asymmetry since there’s only one world model.
Suppose we have a way to estimate the expected utility of a utility function of the world using the agent’s world model, as is presumably required anyways to make having an impact measure for the shutdown utility work at all (or to be confident that the agents are maximizing that utility function instead of some inner misaligned thing).
Then… why not just skip the whole subagents thing and replace it a single agent whose actions have to be better than the default action under every expected utility (plus whatever you need to make the searching for best solutions thing work)?
I think this retains most of the features of the proposal:
It’ll enact a policy that will be like the bet for all influence, because they know that a policy that e.g. just acts according to U_1 no matter what would be veto’d and thus not actually be doable.
It’ll still want to “murder each other” where here “murder each other” just means “delete the other utility function, or get rid of the veto mechanism”.
But it has the advantages of:
Simplicity, maybe?
It’s less likely for one to be smarter than the other. Note that the details here depend on how you hooked up search to each utility function; and of course even if they were equal intelligence, one search process might happen to encounter a better solution earlier.
Better prevent information asymmetry since there’s only one world model.