Seth Herd comments on Problems with instruction-following as an alignment target

Seth Herd 19 May 2025 17:45 UTC
3 points
0
Yes, I think it does make sense ot think of this as a continuum, something I haven’t emphasized to date. There’s also at least one more dimension, that of how many (and which) humans you’re trying to align to. There’s a little more on this in Conflating value alignment and intent alignment is causing confusion.
IF is definitely an attempt to sidestep the difficulties of value alignment, at least partially and temporarily.
What we want from an instruction-following system is exactly what you say: one that does what we mean, not what we say. And getting that perfectly right would demand a perfect understanding of our values. BUT it’s much more fault-tolerant than a value-aligned system. The Principal can specify what they mean as much as they want, and the AI can ask for clarification as much as it thinks it needs to- or in accord with the Principal’s previous instructions to “check carefully about what I meant before doing anything I might hate” or similar.
If done correclty, value alignment would solve the corrigibility problem. But that seems far harder than using corrigibility in the form of instruciton-following to solve the value alignment problem.