If your values don’t happen to have the property of giving the world back to everyone else, building an AGI with your values specifically (when there are no other AGIs yet) is taking over the world. Hence human values, something that would share influence by design, a universalizable objective for everyone to agree to work towards.
On the other hand, succeeding in directly (without AGI assistance) building aligned AGIs with fixed preference seems much less plausible (in time to prevent AI risk) than building task AIs that create uploads of specific people (a particularly useful application of strawberry alignment), to bootstrap alignment research that’s actually up to the task of aligning preferences (ambitious alignment). And those uploads are agents of their own values, not human values, a governance problem.
If your values don’t happen to have the property of giving the world back to everyone else, building an AGI with your values specifically (when there are no other AGIs yet) is taking over the world. Hence human values, something that would share influence by design, a universalizable objective for everyone to agree to work towards.
On the other hand, succeeding in directly (without AGI assistance) building aligned AGIs with fixed preference seems much less plausible (in time to prevent AI risk) than building task AIs that create uploads of specific people (a particularly useful application of strawberry alignment), to bootstrap alignment research that’s actually up to the task of aligning preferences (ambitious alignment). And those uploads are agents of their own values, not human values, a governance problem.