That would imply that ‘intent alignment’ is about aligning AI systems with what humans intend. But ‘intent alignment’ is about making AI systems intend to ‘do the good thing’. (Where ‘do the good thing’ could be cashed out as ‘do what some or all humans want’, ‘achieve humans’ goals’, or many other things.)
The thing I usually contrast with ‘intent alignment’ (≈ the AI’s intentions match what’s good) is something like ‘outcome alignment’ (≈ the AI’s causal effects match what’s good). As I personally think about it, the value of the former category is that it’s more clean and natural, while being broad enough to include what I’d consider the most important problems today; while the value of the latter category is that it’s closer to what we actually care about as a species.
‘Outcome alignment’ as defined above also has the problem that it doesn’t distinguish alignment work from capabilities work. In my own head, I would think of research as more capabilities-flavored if it helps narrow the space of possible AGI outcomes to outcomes that are cognitively harder to achieve; and I’d think of it as more alignment-flavored if it helps narrow the space of possible AGI outcomes to outcomes that are more desirable within a given level of ‘cognitively hard to achieve’.
So I guess more specifically what I’m trying to ask is: how do we distinguish between interpreting the good thing as “human intentions for the agent” versus “human goals”?
In other words, we have at least four options here:
1. AI intends to do what the human wants it to do.
2. AI actually achieves what the human wants it to do.
3. AI intends to pursue the human’s true goals.
4. AI actually achieves the human’s true goals.
So right now intent alignment (as specified by Paul) describes 1, and outcome alignment (as I’m inferring from your description) describes 4. But it seems quite important to have a name for 3 in particular.
I would use ‘outcome alignment’ for 2 (and agree with ‘intent alignment’ for 1). In other words, I see the important distinction between ‘outcome’ and ‘intent’ being in the first part of the options, not the second.
I’d be inclined to see 3 and 4 as variations on 1 and 2 where what the human wants is for the AI to figure out some notion of their true goals and pursue/achieve that.
That would imply that ‘intent alignment’ is about aligning AI systems with what humans intend. But ‘intent alignment’ is about making AI systems intend to ‘do the good thing’. (Where ‘do the good thing’ could be cashed out as ‘do what some or all humans want’, ‘achieve humans’ goals’, or many other things.)
The thing I usually contrast with ‘intent alignment’ (≈ the AI’s intentions match what’s good) is something like ‘outcome alignment’ (≈ the AI’s causal effects match what’s good). As I personally think about it, the value of the former category is that it’s more clean and natural, while being broad enough to include what I’d consider the most important problems today; while the value of the latter category is that it’s closer to what we actually care about as a species.
‘Outcome alignment’ as defined above also has the problem that it doesn’t distinguish alignment work from capabilities work. In my own head, I would think of research as more capabilities-flavored if it helps narrow the space of possible AGI outcomes to outcomes that are cognitively harder to achieve; and I’d think of it as more alignment-flavored if it helps narrow the space of possible AGI outcomes to outcomes that are more desirable within a given level of ‘cognitively hard to achieve’.
So I guess more specifically what I’m trying to ask is: how do we distinguish between interpreting the good thing as “human intentions for the agent” versus “human goals”?
In other words, we have at least four options here:
1. AI intends to do what the human wants it to do.
2. AI actually achieves what the human wants it to do.
3. AI intends to pursue the human’s true goals.
4. AI actually achieves the human’s true goals.
So right now intent alignment (as specified by Paul) describes 1, and outcome alignment (as I’m inferring from your description) describes 4. But it seems quite important to have a name for 3 in particular.
I would use ‘outcome alignment’ for 2 (and agree with ‘intent alignment’ for 1). In other words, I see the important distinction between ‘outcome’ and ‘intent’ being in the first part of the options, not the second.
I’d be inclined to see 3 and 4 as variations on 1 and 2 where what the human wants is for the AI to figure out some notion of their true goals and pursue/achieve that.