I’m getting somewhat confused about information-theoretic arguments around RL scaling. What makes sense to me is that: information density is constant per token in pre-training, no matter how long you make contexts, but decrease 1/n as you make RL trajectories longer. This means that if you look at just scaling context length, RL should get asymptotically less efficient.
What’s not clear to me is the relationship between “bits getting into the weights” and capabilities. Using the information-theoretic argument above, you’d probably get that in o3, one millionth of the information in the weights comes from RL, or something like that, I’m not sure. But o3′s advance in capabilities over 4o seem clearly far more than a millionth factor improvement. I think this would be true even if you work to disentangle inference time scaling and RL scaling. Eg ratio of bits in o1 vs o3. Number of bits in o3 over o1 is very small, but thinking for the same time, the difference is very noticeable.
I’m getting somewhat confused about information-theoretic arguments around RL scaling. What makes sense to me is that: information density is constant per token in pre-training, no matter how long you make contexts, but decrease 1/n as you make RL trajectories longer. This means that if you look at just scaling context length, RL should get asymptotically less efficient.
What’s not clear to me is the relationship between “bits getting into the weights” and capabilities. Using the information-theoretic argument above, you’d probably get that in o3, one millionth of the information in the weights comes from RL, or something like that, I’m not sure. But o3′s advance in capabilities over 4o seem clearly far more than a millionth factor improvement. I think this would be true even if you work to disentangle inference time scaling and RL scaling. Eg ratio of bits in o1 vs o3. Number of bits in o3 over o1 is very small, but thinking for the same time, the difference is very noticeable.