On The Formal Definition of Alignment
I want to retain the ability to update my values over time, but I don’t want those updates to be the result of manipulative optimization by a superintelligence. Instead, the superintelligence should supply me with accurate empirical data and valid inferences, while leaving the choice of normative assumptions—and thus my overall utility function and its proxy representation (i.e., my value structure)—under my control. I also want to engage in value discussions (with either humans or AIs) where the direction of value change is symmetric: both participants have roughly equal probability of updating, so that persuasive force isn’t one-sided. This dynamic can be formally modeled as two agents with evolving objectives or changing proxy representations of their objectives, interacting over time.
That’s what alignment means to me: normative freedom with slowly evolving symmetric changes across agents.
In rare cases, strategic manipulation might be justified—e.g., if an agent’s values are extremely dangerous—but that would be a separate topic involving the deliberate use of misalignment, not alignment itself.
A natural concern is whether high intelligence and full information would cause agents to converge on the same values. But convergence is not guaranteed if agents differ in their terminal goals or if their value systems instantiate distinct proxy structures.
Still, suppose a superintelligence knows my values precisely (whether fixed or dynamically updated). It can then compute the optimal policy for achieving them and explain that policy to me. If I accept its reasoning, I follow the policy not due to coercion but because it best satisfies my own value function. In such a world, each agent can be helped to succeed according to their own values, and since utility isn’t necessarily zero-sum, widespread success is possible. This scenario suggests a pre-formal notion of alignment: the AI enables agents to achieve their goals by supplying accurate world models and optimal plans under user-specified normative assumptions, without hijacking or implicitly rewriting those assumptions.
I think this ends up being the same thing as CEV…
This is not the same as CEV. CEV involves the AI extrapolating a user’s idealized future values and acting to implement them, even overriding current preferences if needed, whereas my model forbids that. In my framework, the AI never drives or predicts value change; it simply provides accurate world models and optimal plans based on the user’s current values, which only the user can update.
CEV also assumes convergence; my model protects normative autonomy and allows value diversity to persist.
CEV extrapolates the volition of humanity, that’s one reason it has to be “coherent”.
In your proposal, people have autonomy, but this principle can be violated in “extremely dangerous” situations. People are free to do what they want (“volition”)… but their AI advisors look ahead (“extrapolated”)… and people are not allowed to exercise their freedom so as to jeopardize the freedom of others (“coherent”).