RL becomes the main way to get new, especially superhuman, capabilities.
Because RL pushes models hard to do reward hacking, it’s difficult to reliably get models to do something difficult to verify. Models can do impressive feats, but nobody is stupid enough to put AI into positions which usually imply responsibility.
This situation conveys how difficult alignment is and everybody moves toward verifiable rewards or similar approaches. Capabilities progress becomes dependent on alignment progress.
The kind of ‘alignment technique’ that successfully points a dumb model in the rough direction of doing the task you want in early training does not necessarily straightforwardly connect to the kind of ‘alignment technique’ that will keep a model pointed quite precisely in the direction you want after it gets smart and self-reflective.
For a maybe not-so-great example, human RL reward signals in the brain used to successfully train and aim human cognition from infancy to point at reproductive fitness. Before the distributional shift, our brains usually neither got completely stuck in reward-hack loops, nor used their cognitive labour for something completely unrelated to reproductive fitness. After the distributional shift, our brains still don’t get stuck in reward-hack loops that much and we successfully train to intelligent adulthood. But the alignment with reproductive fitness is gone, or at least far weaker.
I mostly think about alignment methods like “model-based RL which maximizes reward iff it outputs action which is provably good under our specification of good”.
between alignment and capabilities, which is the main bottleneck to getting value out of GPT-like models, both in the short term and the long(er) term?
We can probably survive in the following way:
RL becomes the main way to get new, especially superhuman, capabilities.
Because RL pushes models hard to do reward hacking, it’s difficult to reliably get models to do something difficult to verify. Models can do impressive feats, but nobody is stupid enough to put AI into positions which usually imply responsibility.
This situation conveys how difficult alignment is and everybody moves toward verifiable rewards or similar approaches. Capabilities progress becomes dependent on alignment progress.
The kind of ‘alignment technique’ that successfully points a dumb model in the rough direction of doing the task you want in early training does not necessarily straightforwardly connect to the kind of ‘alignment technique’ that will keep a model pointed quite precisely in the direction you want after it gets smart and self-reflective.
For a maybe not-so-great example, human RL reward signals in the brain used to successfully train and aim human cognition from infancy to point at reproductive fitness. Before the distributional shift, our brains usually neither got completely stuck in reward-hack loops, nor used their cognitive labour for something completely unrelated to reproductive fitness. After the distributional shift, our brains still don’t get stuck in reward-hack loops that much and we successfully train to intelligent adulthood. But the alignment with reproductive fitness is gone, or at least far weaker.
I mostly think about alignment methods like “model-based RL which maximizes reward iff it outputs action which is provably good under our specification of good”.
Relevant: Alignment as a Bottleneck to Usefulness of GPT-3