Agreed. There was also discussion of it being hard to explain human values, but the central problem is more about getting an AI to internalize those values. On a superficial level that seems to be mostly working currently—making the assistant HHH—but a) it doesn’t work reliably (jailbreaks); b) it’s not at all clear whether those values are truly internalized (eg models failing to be corrigible or having unintended values); and c) it seems like it’s working less well with every generation as we apply increasing amounts of RL to LLMs.
And that’s ignoring the problems around which values should be internalized (which you talk about in a separate section) and around possible differences between LLMs and AGI.
Agreed. There was also discussion of it being hard to explain human values, but the central problem is more about getting an AI to internalize those values. On a superficial level that seems to be mostly working currently—making the assistant HHH—but a) it doesn’t work reliably (jailbreaks); b) it’s not at all clear whether those values are truly internalized (eg models failing to be corrigible or having unintended values); and c) it seems like it’s working less well with every generation as we apply increasing amounts of RL to LLMs.
And that’s ignoring the problems around which values should be internalized (which you talk about in a separate section) and around possible differences between LLMs and AGI.