One problem I have with the instruction-following frame is that it feels like an attempt to sidestep the difficulties of aligning to a set of values. But I don’t think this works, as perfect instruction-following may actually be equivalent to aligning to the Principal’s values.
What we want from an instruction-following system is one that does what we mean rather than simply what we say. So, rather than ‘Do what the Principal says’, the alignment target is really ‘Do what the Principal’s values imply they mean’. And if an AI understands these values perfectly and is properly motivated to act according to them, that is functionally the same as it having those values itself.
If done correctly, this would solve the corrigibility problem, as all instructions would carry an implicit ‘I mean for you to stop if asked’ clause.
Would it make sense to think of this on a continuum, where on one end you have basic, relatively naive instruction-following that is easy to implement (e.g. helpful LLMs) and on the other you have perfect instruction following that is completely aligned to the Principal’s values?
Definitely agree that the implicit “Do what they say [in a way that they would want]” sneaks the problems of value learning into what some people might have hoped was a value-learning-free space. Just want to split some hairs on this:
if an AI understands these values perfectly and is properly motivated to act according to them, that is functionally the same as it having those values itself.
I think this ignores that there are multiple ways to understand humans, what human preferences are, what acting according to them is, etc. There’s no policy that would satisfy all value-shaped interpretations of the user, because some of them conflict. This gives us some wiggle room to imagine different ways of resolving those conflicts, some of which will look more like instruction-following and other that will look more like autonomous action.
Do you have any quick examples of value-shaped interpretations that conflict?
This gives us some wiggle room to imagine different ways of resolving those conflicts, some of which will look more like instruction-following and other that will look more like autonomous action.
So perhaps the level of initiative the AI takes? E.g. a maximally initiative-taking AI might respond to ‘fetch me coffee’ by reshaping the Principal’s life so they get better sleep and no longer want the coffee.
I think my original reference to ‘perfect’ value understanding is maybe obscuring these tradeoffs (maybe unhelpfully), as in theory that includes knowledge of how the Principal would want interpretative conflicts managed.
Do you have any quick examples of value-shaped interpretations that conflict?
Someone trying but failing to quit smoking. On one interpretation, they don’t really want to smoke, smoking is some sort of mistake. On another interpretation, they do want to smoke, the quitting-related behavior is some sort of mistake (or has a social or epistemological reason).
This example stands in for other sorts of “obvious inconsistency,” biases that we don’t reflectively endorse, etc. But also consider cases where humans say they don’t want something but we (outside the thought experiment) think they actually do want that thing! A possible example is the people who say they would hate a post-work world, they want to keep doing work so they have purpose. Point is, the verbal spec isn’t always right.
The interpretation “Humans want to follow the laws of physics,” versus an interpretation that’s a more filled-in version of “Humans want to do a bunch of human-scale things like talking to humans, eating good food, interacting with nature, learning about the world, etc.” The first is the limit of being more predictive at the cost of having a more complicated model of humans, and as you can tell, it sort of peters out into explaining everything but having no push towards good stuff.
Yes, I think it does make sense ot think of this as a continuum, something I haven’t emphasized to date. There’s also at least one more dimension, that of how many (and which) humans you’re trying to align to. There’s a little more on this in Conflating value alignment and intent alignment is causing confusion.
IF is definitely an attempt to sidestep the difficulties of value alignment, at least partially and temporarily.
What we want from an instruction-following system is exactly what you say: one that does what we mean, not what we say. And getting that perfectly right would demand a perfect understanding of our values. BUT it’s much more fault-tolerant than a value-aligned system. The Principal can specify what they mean as much as they want, and the AI can ask for clarification as much as it thinks it needs to- or in accord with the Principal’s previous instructions to “check carefully about what I meant before doing anything I might hate” or similar.
If done correclty, value alignment would solve the corrigibility problem. But that seems far harder than using corrigibility in the form of instruciton-following to solve the value alignment problem.
One problem I have with the instruction-following frame is that it feels like an attempt to sidestep the difficulties of aligning to a set of values. But I don’t think this works, as perfect instruction-following may actually be equivalent to aligning to the Principal’s values.
What we want from an instruction-following system is one that does what we mean rather than simply what we say. So, rather than ‘Do what the Principal says’, the alignment target is really ‘Do what the Principal’s values imply they mean’. And if an AI understands these values perfectly and is properly motivated to act according to them, that is functionally the same as it having those values itself.
If done correctly, this would solve the corrigibility problem, as all instructions would carry an implicit ‘I mean for you to stop if asked’ clause.
Would it make sense to think of this on a continuum, where on one end you have basic, relatively naive instruction-following that is easy to implement (e.g. helpful LLMs) and on the other you have perfect instruction following that is completely aligned to the Principal’s values?
Definitely agree that the implicit “Do what they say [in a way that they would want]” sneaks the problems of value learning into what some people might have hoped was a value-learning-free space. Just want to split some hairs on this:
I think this ignores that there are multiple ways to understand humans, what human preferences are, what acting according to them is, etc. There’s no policy that would satisfy all value-shaped interpretations of the user, because some of them conflict. This gives us some wiggle room to imagine different ways of resolving those conflicts, some of which will look more like instruction-following and other that will look more like autonomous action.
Do you have any quick examples of value-shaped interpretations that conflict?
So perhaps the level of initiative the AI takes? E.g. a maximally initiative-taking AI might respond to ‘fetch me coffee’ by reshaping the Principal’s life so they get better sleep and no longer want the coffee.
I think my original reference to ‘perfect’ value understanding is maybe obscuring these tradeoffs (maybe unhelpfully), as in theory that includes knowledge of how the Principal would want interpretative conflicts managed.
Someone trying but failing to quit smoking. On one interpretation, they don’t really want to smoke, smoking is some sort of mistake. On another interpretation, they do want to smoke, the quitting-related behavior is some sort of mistake (or has a social or epistemological reason).
This example stands in for other sorts of “obvious inconsistency,” biases that we don’t reflectively endorse, etc. But also consider cases where humans say they don’t want something but we (outside the thought experiment) think they actually do want that thing! A possible example is the people who say they would hate a post-work world, they want to keep doing work so they have purpose. Point is, the verbal spec isn’t always right.
The interpretation “Humans want to follow the laws of physics,” versus an interpretation that’s a more filled-in version of “Humans want to do a bunch of human-scale things like talking to humans, eating good food, interacting with nature, learning about the world, etc.” The first is the limit of being more predictive at the cost of having a more complicated model of humans, and as you can tell, it sort of peters out into explaining everything but having no push towards good stuff.
Yes, I think it does make sense ot think of this as a continuum, something I haven’t emphasized to date. There’s also at least one more dimension, that of how many (and which) humans you’re trying to align to. There’s a little more on this in Conflating value alignment and intent alignment is causing confusion.
IF is definitely an attempt to sidestep the difficulties of value alignment, at least partially and temporarily.
What we want from an instruction-following system is exactly what you say: one that does what we mean, not what we say. And getting that perfectly right would demand a perfect understanding of our values. BUT it’s much more fault-tolerant than a value-aligned system. The Principal can specify what they mean as much as they want, and the AI can ask for clarification as much as it thinks it needs to- or in accord with the Principal’s previous instructions to “check carefully about what I meant before doing anything I might hate” or similar.
If done correclty, value alignment would solve the corrigibility problem. But that seems far harder than using corrigibility in the form of instruciton-following to solve the value alignment problem.