So, I have some quibbles, some pessimistic, some optimistic.
The main pessimistic one is that the nice-thing-that-current-languange-models-have-if-you-don’t-RL-them-too-hard, their de facto alignment, is probably not the final word on alignment that we just need to safeguard as contexts change and capabilities increase. I think it’s the wrong modeling assumption to say we’ll start with aligned transformative AI and then just need to keep RL from messing it up.
But this has an optimistic flip side, which is that if we do have better alignment schemes to apply to future AI, they can take into account the weaknesses of fine-tuning a predictive model and try to correct for them.
On “breaking things,” it seems like reverting towards the base model behavior is the default expected consequence of breaking fine-tuning. In the current paradigm, I wouldn’t expect this to lead to misaligned goals (though probably some incoherent bad behavior). In a different architecture maybe the story is different (whoops, we broke the value function in model-based RL but didn’t break the environment model).
If you’re worried about coherent bad behavior because we’ll be doing RL on task completion, that doesn’t sound like drift to me, it sounds like doing RL on a non-alignment goal and (no surprise) getting non-aligned AI.
On an unrelated note, I was also reminded of the phenomenon of language drift after RL, e.g. see Jozdien’s recent post, or the reports about math-finetuned LLMs drifting.
So, I have some quibbles, some pessimistic, some optimistic.
The main pessimistic one is that the nice-thing-that-current-languange-models-have-if-you-don’t-RL-them-too-hard, their de facto alignment, is probably not the final word on alignment that we just need to safeguard as contexts change and capabilities increase. I think it’s the wrong modeling assumption to say we’ll start with aligned transformative AI and then just need to keep RL from messing it up.
But this has an optimistic flip side, which is that if we do have better alignment schemes to apply to future AI, they can take into account the weaknesses of fine-tuning a predictive model and try to correct for them.
On “breaking things,” it seems like reverting towards the base model behavior is the default expected consequence of breaking fine-tuning. In the current paradigm, I wouldn’t expect this to lead to misaligned goals (though probably some incoherent bad behavior). In a different architecture maybe the story is different (whoops, we broke the value function in model-based RL but didn’t break the environment model).
If you’re worried about coherent bad behavior because we’ll be doing RL on task completion, that doesn’t sound like drift to me, it sounds like doing RL on a non-alignment goal and (no surprise) getting non-aligned AI.
On an unrelated note, I was also reminded of the phenomenon of language drift after RL, e.g. see Jozdien’s recent post, or the reports about math-finetuned LLMs drifting.