Hastings comments on Daniel Kokotajlo’s Shortform

Hastings 19 Feb 2025 1:43 UTC
3 points
−3
I have a hypothesis: Someone (probably open ai) got reinforcement learning to actually start putting new capabilities into the model with their strawberry project. Up to this point, it had just been eliciting. But getting a new capability this way is horrifically expensive: roughly, it takes hundreds of rollouts to set one weight, where language modelling loss sets a weight every few tokens. The catch is, as soon as any model that is reinforcement learned acts in the world basically at all, all the language models can clone the reinforcement learned capability by training on anything causally downstream of the lead model’s action (and then eliciting.) A capability that took a thousand rollouts to learn leaks as soon as the model takes hundreds of tokens worth of action.
This hypothesis predicts that the r1 training algorithm won’t work to boost aime scores on any model trained with an enforced 2023 data cutoff ( specifically, on any model with no 4o synthetically generated tokens- I think 4o is causally downstream of the strawberry breakthrough)