I think the biggest-alignment relevant update is that I expected RL fine-tuning over longer horizons (or even model-based RL a la AlphaZero) to be a bigger deal. I was really worried about it significantly improving performance and making alignment harder. In 2018-2019 my mainline picture was more like AlphaStar or AlphaZero, with RL fine-tuning being the large majority of compute. I’ve updated about this and definitely acknowledge I was wrong.[3] I don’t think it totally changes the picture though: I’m still scared of RL, I think it is very plausible it will become more important in the future, and think that even the kind of relatively minimal RL we do now can introduce many of the same risks.
Curious to hear how you would revisit this prediction in light of reasoning models? Seems like you weren’t as wrong as you thought a year ago, but maybe you still think there are some key ways your predictions about RL finetuning predictions were off?
Curious to hear how you would revisit this prediction in light of reasoning models? Seems like you weren’t as wrong as you thought a year ago, but maybe you still think there are some key ways your predictions about RL finetuning predictions were off?