Thanks a lot for this! I think I agree with you to an extent that if we define alignment as avoiding human extinction due to rogue AI, the distinction between alignment and capabilities seems relatively clear, though I do have some reservations about that.
Independent of that, what do you make of the distinction between intent-alignment (roughly getting AI systems to do what we intend) and capabilities? If you look at many proposed intent-alignment techniques, they also seem to improve capabilities on standard metrics. This is true eg of RLHF, adversarial examples, chain of thought prompting, most/all robustness techniques etc. RLHF was proposed as an intent-alignment technique, and it made GPT-4 much more intent-aligned in the sense that its policy is more aligned with the intentions of programmers/users. This also made the system more useful and capable. I would expect AI-augmented feedback on RL to also improve intent-alignment and capabilities. Do you disagree with that line of argument?
In one of his appearances on video this year, Eliezer said IIRC that all of the intent-alignment techniques he knows of stop working once the AI’s capabilities improve enough, mentioning RLHF. Other than that I am not knowledgeable enough to answer you.
Thanks a lot for this! I think I agree with you to an extent that if we define alignment as avoiding human extinction due to rogue AI, the distinction between alignment and capabilities seems relatively clear, though I do have some reservations about that.
Independent of that, what do you make of the distinction between intent-alignment (roughly getting AI systems to do what we intend) and capabilities? If you look at many proposed intent-alignment techniques, they also seem to improve capabilities on standard metrics. This is true eg of RLHF, adversarial examples, chain of thought prompting, most/all robustness techniques etc. RLHF was proposed as an intent-alignment technique, and it made GPT-4 much more intent-aligned in the sense that its policy is more aligned with the intentions of programmers/users. This also made the system more useful and capable. I would expect AI-augmented feedback on RL to also improve intent-alignment and capabilities. Do you disagree with that line of argument?
In one of his appearances on video this year, Eliezer said IIRC that all of the intent-alignment techniques he knows of stop working once the AI’s capabilities improve enough, mentioning RLHF. Other than that I am not knowledgeable enough to answer you.