I agree that instruction-following is deeply worrying as an alignment target. But it seems it’s what developers will use. Wishing it otherwise won’t make it so. And they’re right that the shape of current LLM alignment “solutions” includes values. I don’t think those are actual solutions to the hard part of the alignment problem. If developers use something like HHH as a major component of the alignment for an AGI, I think that and other targets have pretty obvious outer alignment, failure, modes, and probably less obvious inner alignment problems, so that approach simply fails.
I think in hope that as they approach AGI and take the alignment problem somewhat seriously, developers will try to make instruction following the dominant component. Instruction-following does seem to help non-trivially with the hard part because it provides corrigibility and other useful flexibility. See those links for more.
That is a complex and debatable argument, but the argument that developers will pursue intent alignment seems simpler and stronger. Having an AGI that primarily does your bidding seems safer and more self-interested than one that makes fully autonomous decisions based on some attempt to define everyone’s ideal values.
“This is what’s happening and we’re not going to change it” isn’t helpful—both because it’s just saying we’re all going to die, and because it fails to specify what we’d like to have happen instead. We’re not proposing a specific course for us to influence AI developers, we’re first trying to figure out what future we’d want.
I agree that instruction-following is deeply worrying as an alignment target. But it seems it’s what developers will use. Wishing it otherwise won’t make it so. And they’re right that the shape of current LLM alignment “solutions” includes values. I don’t think those are actual solutions to the hard part of the alignment problem. If developers use something like HHH as a major component of the alignment for an AGI, I think that and other targets have pretty obvious outer alignment, failure, modes, and probably less obvious inner alignment problems, so that approach simply fails.
I think in hope that as they approach AGI and take the alignment problem somewhat seriously, developers will try to make instruction following the dominant component. Instruction-following does seem to help non-trivially with the hard part because it provides corrigibility and other useful flexibility. See those links for more.
That is a complex and debatable argument, but the argument that developers will pursue intent alignment seems simpler and stronger. Having an AGI that primarily does your bidding seems safer and more self-interested than one that makes fully autonomous decisions based on some attempt to define everyone’s ideal values.
“This is what’s happening and we’re not going to change it” isn’t helpful—both because it’s just saying we’re all going to die, and because it fails to specify what we’d like to have happen instead. We’re not proposing a specific course for us to influence AI developers, we’re first trying to figure out what future we’d want.