I think the NAH does a lot of work for interpretability of an AI’s beliefs about things that aren’t values, but I’m pretty skeptical about the “human values” natural abstraction. I think the points made in this post are good, and relatedly, I don’t want the AI to be aligned to “human values”; I want it to be aligned to my values. I think there’s a pretty big gap between my values and those of the average human even subjected to something like CEV, and that this is probably true for other LW/EA types as well. Human values as they exist in nature contain fundamental terms for the in group, disgust based values, etc.
If your values don’t happen to have the property of giving the world back to everyone else, building an AGI with your values specifically (when there are no other AGIs yet) is taking over the world. Hence human values, something that would share influence by design, a universalizable objective for everyone to agree to work towards.
On the other hand, succeeding in directly (without AGI assistance) building aligned AGIs with fixed preference seems much less plausible (in time to prevent AI risk) than building task AIs that create uploads of specific people (a particularly useful application of strawberry alignment), to bootstrap alignment research that’s actually up to the task of aligning preferences (ambitious alignment). And those uploads are agents of their own values, not human values, a governance problem.
I think the NAH does a lot of work for interpretability of an AI’s beliefs about things that aren’t values, but I’m pretty skeptical about the “human values” natural abstraction. I think the points made in this post are good, and relatedly, I don’t want the AI to be aligned to “human values”; I want it to be aligned to my values. I think there’s a pretty big gap between my values and those of the average human even subjected to something like CEV, and that this is probably true for other LW/EA types as well. Human values as they exist in nature contain fundamental terms for the in group, disgust based values, etc.
If your values don’t happen to have the property of giving the world back to everyone else, building an AGI with your values specifically (when there are no other AGIs yet) is taking over the world. Hence human values, something that would share influence by design, a universalizable objective for everyone to agree to work towards.
On the other hand, succeeding in directly (without AGI assistance) building aligned AGIs with fixed preference seems much less plausible (in time to prevent AI risk) than building task AIs that create uploads of specific people (a particularly useful application of strawberry alignment), to bootstrap alignment research that’s actually up to the task of aligning preferences (ambitious alignment). And those uploads are agents of their own values, not human values, a governance problem.