But is this true? According to roon on X, this doesn’t apply to e.g. model personality:
Each time you train a model, you might change nothing about the dataset, and then run a new RL seed and you would have a slightly different personality. It’s because there is some variance in the training process. It’s random—you’re taking a random walk through model space. We can’t even reproduce a personality in the same training run that easily, much less across all time … It’s a very difficult question internally [at OpenAI]. We do try to minimize the personality drift, because people come to love the models, but it’s a very hard problem.
Anecdotally, I’ve heard that the same is true for other capabilities at the labs. The papers referenced in your essay seem like weak evidence to the contrary. For example, the Universal Geometry paper studies small models (BERT, T5) with fewer than 1B parameters, trained with 4-5 OOMs less compute than frontier LLMs. It’s also unclear how impressive the claimed cosine similarity range of 0.75-0.92 is; I would guess that the representation transfer is quite lossy.
Reinforcement learning is not the same kind of thing as pretraining because it involves training on your own randomly sampled rollouts, and RL is generally speaking more self reinforcing and biased than other neural net training methods. It’s more likely to get stuck in local maxima (it’s infamous for getting stuck in local maxima, in fact) and doesn’t have quite the same convergence properties as “pretraining on giant dataset”.
The rise and rise of RL as a fraction of compute should therefore makes us less likely to think that the convergent representation hypothesis will apply to AGI. (Thought it clearly applies to LLMs now).
Yep, this is probabaly true for pretraining but this seems less and less relevant these days. For example, according to the Grok 4 presentation the model used as much compute in pretraining as in RL. I’d expect this trend to continue.
But is this true? According to roon on X, this doesn’t apply to e.g. model personality:
Anecdotally, I’ve heard that the same is true for other capabilities at the labs. The papers referenced in your essay seem like weak evidence to the contrary. For example, the Universal Geometry paper studies small models (BERT, T5) with fewer than 1B parameters, trained with 4-5 OOMs less compute than frontier LLMs. It’s also unclear how impressive the claimed cosine similarity range of 0.75-0.92 is; I would guess that the representation transfer is quite lossy.
Reinforcement learning is not the same kind of thing as pretraining because it involves training on your own randomly sampled rollouts, and RL is generally speaking more self reinforcing and biased than other neural net training methods. It’s more likely to get stuck in local maxima (it’s infamous for getting stuck in local maxima, in fact) and doesn’t have quite the same convergence properties as “pretraining on giant dataset”.
The rise and rise of RL as a fraction of compute should therefore makes us less likely to think that the convergent representation hypothesis will apply to AGI. (Thought it clearly applies to LLMs now).
Yep, this is probabaly true for pretraining but this seems less and less relevant these days. For example, according to the Grok 4 presentation the model used as much compute in pretraining as in RL. I’d expect this trend to continue.