Outer alignment seems not as hard as we thought a few years ago. Llms are actually really good at understanding what we mean, so the sorcerer’s apprentice and King Midas concerns seem obsolete. Except maybe for systems using heavy RL, where specification gaming is still a concern.
The more salient outer alignment issue now how to align agents when you don’t have time or enough capability yourself to supervise them well. And that’s mainly only a problem because of competitive race dynamics, which incentivize people to try and supervise AI ‘beyond their means’ so to speak.
So, cooling race dynamics could address a main portion of the remaining outer alignment problem. Scalable oversight techniques may also address it. What would remain then for (narrow) alignment is specification gaming, and then of course the whole inner alignment problem including deceptive alignment which is still a huge unsolved problem.
Outer alignment seems not as hard as we thought a few years ago. Llms are actually really good at understanding what we mean, so the sorcerer’s apprentice and King Midas concerns seem obsolete. Except maybe for systems using heavy RL, where specification gaming is still a concern.
The more salient outer alignment issue now how to align agents when you don’t have time or enough capability yourself to supervise them well. And that’s mainly only a problem because of competitive race dynamics, which incentivize people to try and supervise AI ‘beyond their means’ so to speak.
So, cooling race dynamics could address a main portion of the remaining outer alignment problem. Scalable oversight techniques may also address it. What would remain then for (narrow) alignment is specification gaming, and then of course the whole inner alignment problem including deceptive alignment which is still a huge unsolved problem.