Consider the subset of “human values” that we’d be “happy” (where we fully informed) for powerful systems to optimise for.
[Weaker version: “the subset of human values that it is existentially safe for powerful systems to optimise for”.]
Let’s call thia subset “ideal values”.
I’d guess that the “most natural” abstraction isn’t “ideal values” themselves but something like “the minimal latents of ideal values”.
Consider the subset of “human values” that we’d be “happy” (where we fully informed) for powerful systems to optimise for. [Weaker version: “the subset of human values that it is existentially safe for powerful systems to optimise for”.]
Let’s call thia subset “ideal values”.
I’d guess that the “most natural” abstraction isn’t “ideal values” themselves but something like “the minimal latents of ideal values”.