Basically, his (and by extension OpenAI’s) ideal model:
is very small and fast
has “super-human reasoning capabilities”
has a very long (“trillion token”) context from which it can retrieve information
has access to a massive number of tools
The big question for these pure reasoning models is obviously how to build them. I think pure reasoning would probably have to be designed into the pretraining, rather than bolted on afterwards. Two very high-level directions I could see it being pursued:
For each token or sequence in the pretraining corpus, we somehow supply the model with the “knowledge” required to predict the next token, without telling it the “reasoning”. This would force the model’s weights to compress reasoning strategies, while the knowledge is hopefully squeezed out of them because it’s redundant.
Optimize something other than the standard cross-entropy (GPT) loss. Cross-entropy is inherently mean-seeking, meaning that the model is incentivized to know every high-probability next token (which requires knowledge). A mode seeking loss, such as the reverse kl divergence (GAN objective) or RL rewards, would incentivize the model to know at leastone of the next tokens with high reward or probability, which requires much less knowledge.
I hadn’t seen that, yes, it’s very similar. Good to know I’m thinking on the right tracks, pity I didn’t publish a few days ago and look a lot more prescient :D.
we somehow supply the model with the “knowledge” required
Yes, I think this is a powerful research direction. It’s particularly plausible for distillation—the teacher can supply the knowledge as a suffix to the context. Then in production, you run the teacher model to produce knowledge, and the student model for all traces beyond that.
This seems similar to the “platonic ideal” model that Sam Altman described in a recent talk:
https://www.reddit.com/r/singularity/s/zzo8NhT9bd
https://m.youtube.com/watch?v=qhnJDDX2hhU (full talk)
Basically, his (and by extension OpenAI’s) ideal model:
is very small and fast
has “super-human reasoning capabilities”
has a very long (“trillion token”) context from which it can retrieve information
has access to a massive number of tools
The big question for these pure reasoning models is obviously how to build them. I think pure reasoning would probably have to be designed into the pretraining, rather than bolted on afterwards. Two very high-level directions I could see it being pursued:
For each token or sequence in the pretraining corpus, we somehow supply the model with the “knowledge” required to predict the next token, without telling it the “reasoning”. This would force the model’s weights to compress reasoning strategies, while the knowledge is hopefully squeezed out of them because it’s redundant.
Optimize something other than the standard cross-entropy (GPT) loss. Cross-entropy is inherently mean-seeking, meaning that the model is incentivized to know every high-probability next token (which requires knowledge). A mode seeking loss, such as the reverse kl divergence (GAN objective) or RL rewards, would incentivize the model to know at least one of the next tokens with high reward or probability, which requires much less knowledge.
I hadn’t seen that, yes, it’s very similar. Good to know I’m thinking on the right tracks, pity I didn’t publish a few days ago and look a lot more prescient :D.
Yes, I think this is a powerful research direction. It’s particularly plausible for distillation—the teacher can supply the knowledge as a suffix to the context. Then in production, you run the teacher model to produce knowledge, and the student model for all traces beyond that.