The kind of ‘alignment technique’ that successfully points a dumb model in the rough direction of doing the task you want in early training does not necessarily straightforwardly connect to the kind of ‘alignment technique’ that will keep a model pointed quite precisely in the direction you want after it gets smart and self-reflective.
For a maybe not-so-great example, human RL reward signals in the brain used to successfully train and aim human cognition from infancy to point at reproductive fitness. Before the distributional shift, our brains usually neither got completely stuck in reward-hack loops, nor used their cognitive labour for something completely unrelated to reproductive fitness. After the distributional shift, our brains still don’t get stuck in reward-hack loops that much and we successfully train to intelligent adulthood. But the alignment with reproductive fitness is gone, or at least far weaker.
I mostly think about alignment methods like “model-based RL which maximizes reward iff it outputs action which is provably good under our specification of good”.
The kind of ‘alignment technique’ that successfully points a dumb model in the rough direction of doing the task you want in early training does not necessarily straightforwardly connect to the kind of ‘alignment technique’ that will keep a model pointed quite precisely in the direction you want after it gets smart and self-reflective.
For a maybe not-so-great example, human RL reward signals in the brain used to successfully train and aim human cognition from infancy to point at reproductive fitness. Before the distributional shift, our brains usually neither got completely stuck in reward-hack loops, nor used their cognitive labour for something completely unrelated to reproductive fitness. After the distributional shift, our brains still don’t get stuck in reward-hack loops that much and we successfully train to intelligent adulthood. But the alignment with reproductive fitness is gone, or at least far weaker.
I mostly think about alignment methods like “model-based RL which maximizes reward iff it outputs action which is provably good under our specification of good”.