Vladimir_Nesov comments on Goal Alignment Is Robust To the Sharp Left Turn

Vladimir_Nesov 14 Jul 2022 0:46 UTC
4 points
0
Consider a decision theoretic optimizer with a goal as usually formulated. Its goal is abstracted from environment, its definition is given without referring to environment. If we wanted to build an optimizer for CEV of humanity, we would need to put the content of modern civilization into it (including the humans), as part of definition of its goal. Then it would be able to perform the tricks expected of an agent with a decision theory, being isolated from environment at least in the definition of its goals. Updateless reasoning means that the goals are isolated not just from environment, but also from agent’s state of knowledge. In general, the idea of a goal is that of a distinct part of an agent, isolated from everything, including other parts of the agent.

In contrast, a corrigible agent looks to environment for data that defines its goal. As a decision theoretic optimizer, it has the meta-goal of extrapolating the goal from its environment, and then doing that. It should be a convergent drive for it to preserve the data about evironment (at least for itself), since it’s what it needs to extrapolate its goal. And the meta-goal is tiny in comparison to CEV of humanity, its definition doesn’t need to include the content of modern civilization. But if the goal is extrapolated from the whole actual world, it can never be completely available, so acting under some sort of goal uncertainty is necessary.

Any updateless decision making must then be performed according to the approximate/variable goal that can be extrapolated from the lesser state of knowledge that it acts from, so corrigible agents must be even more incoherent than bounded updateless optimizers. They are less coherent not just because very updateless reasoning takes too much compute, and so can’t always be performed in reality, but because an updateless corrigible agent acts through its own more specialized versions, which are agents with different goals, obtained by making the state of knowledge more specific in different directions, resulting in different environments and thus different extrapolated goals.

A corrigible updateless agent coordinates its specializations not just across disagreements of state of knowledge, but also across disagreements of (state of) preference. It computes game theory solutions for the coalition of the agent’s specializations that listen to it, which have different preferences specializing the agent’s own. Exiting a coalition (not listening to a less knowledgeable version of yourself that coordinates the coalition of those who still listen) is then a possible natural way of bounding the level of updatelessness.
What links here?
- Vladimir_Nesov's comment on TurnTrout’s shortform feed by TurnTrout (20 Jul 2022 3:24 UTC; 5 points)
- Vladimir_Nesov's comment on Getting Unstuck on Counterfactuals by Chris_Leong (20 Jul 2022 6:51 UTC; 2 points)