The original alignment thinking held that explaining human values to AGI would be really hard.
The difficulty was suggested to be in getting an optimizer to care about what those values are pointing to, not to understand them[1]. If in some instances the values mapped to doing something unwise, using an optimizer that understood those values might fail to constrain away from doing something unwise. Getting a system to use extrapolated preferences as behavioral constraints is a deeper problem than getting a system to reflect surface preferences. The high p(doom) estimates partly follow from expecting that an aligned AI will have to be used to prevent future misaligned/misused AI, and that doing something so high impact would require unsafe behaviors in a system not aligned to reflectively coherent and endorsed extrapolated preferences.
Agreed. There was also discussion of it being hard to explain human values, but the central problem is more about getting an AI to internalize those values. On a superficial level that seems to be mostly working currently—making the assistant HHH—but a) it doesn’t work reliably (jailbreaks); b) it’s not at all clear whether those values are truly internalized (eg models failing to be corrigible or having unintended values); and c) it seems like it’s working less well with every generation as we apply increasing amounts of RL to LLMs.
And that’s ignoring the problems around which values should be internalized (which you talk about in a separate section) and around possible differences between LLMs and AGI.
The difficulty was suggested to be in getting an optimizer to care about what those values are pointing to, not to understand them[1]. If in some instances the values mapped to doing something unwise, using an optimizer that understood those values might fail to constrain away from doing something unwise. Getting a system to use extrapolated preferences as behavioral constraints is a deeper problem than getting a system to reflect surface preferences. The high p(doom) estimates partly follow from expecting that an aligned AI will have to be used to prevent future misaligned/misused AI, and that doing something so high impact would require unsafe behaviors in a system not aligned to reflectively coherent and endorsed extrapolated preferences.
In The Hidden Complexity of Wishes, it wasn’t the genie won’t understand what you meant, it was the genie won’t care what you meant.
Agreed. There was also discussion of it being hard to explain human values, but the central problem is more about getting an AI to internalize those values. On a superficial level that seems to be mostly working currently—making the assistant HHH—but a) it doesn’t work reliably (jailbreaks); b) it’s not at all clear whether those values are truly internalized (eg models failing to be corrigible or having unintended values); and c) it seems like it’s working less well with every generation as we apply increasing amounts of RL to LLMs.
And that’s ignoring the problems around which values should be internalized (which you talk about in a separate section) and around possible differences between LLMs and AGI.