So I guess more specifically what I’m trying to ask is: how do we distinguish between interpreting the good thing as “human intentions for the agent” versus “human goals”?
In other words, we have at least four options here:
1. AI intends to do what the human wants it to do.
2. AI actually achieves what the human wants it to do.
3. AI intends to pursue the human’s true goals.
4. AI actually achieves the human’s true goals.
So right now intent alignment (as specified by Paul) describes 1, and outcome alignment (as I’m inferring from your description) describes 4. But it seems quite important to have a name for 3 in particular.
I would use ‘outcome alignment’ for 2 (and agree with ‘intent alignment’ for 1). In other words, I see the important distinction between ‘outcome’ and ‘intent’ being in the first part of the options, not the second.
I’d be inclined to see 3 and 4 as variations on 1 and 2 where what the human wants is for the AI to figure out some notion of their true goals and pursue/achieve that.
So I guess more specifically what I’m trying to ask is: how do we distinguish between interpreting the good thing as “human intentions for the agent” versus “human goals”?
In other words, we have at least four options here:
1. AI intends to do what the human wants it to do.
2. AI actually achieves what the human wants it to do.
3. AI intends to pursue the human’s true goals.
4. AI actually achieves the human’s true goals.
So right now intent alignment (as specified by Paul) describes 1, and outcome alignment (as I’m inferring from your description) describes 4. But it seems quite important to have a name for 3 in particular.
I would use ‘outcome alignment’ for 2 (and agree with ‘intent alignment’ for 1). In other words, I see the important distinction between ‘outcome’ and ‘intent’ being in the first part of the options, not the second.
I’d be inclined to see 3 and 4 as variations on 1 and 2 where what the human wants is for the AI to figure out some notion of their true goals and pursue/achieve that.