Advanced capabilities can be squeezed into small, efficient models that can run on commodity hardware.
This could also work for general intelligence and not only narrow math/coding olympiad sort of problems. The potential of o1/R1 is plausibly constrained for now by ability to construct oracle verifiers for correctness of solutions, which mostly only works for toy technical problems. Capabilities on such problems are not very likely to generalize to general capabilities, there aren’t clear signs so far that this is happening.
But this is a constraint on how the data can be generated, not on how efficiently other models can be retrained using such data to channel the capabilities. If at some point there will be a process for generating high quality training data for general intelligence, that data might also turn out to be effective for cheaply training other models. The R1-generated data used to train the distill models is 800K samples[1], which is probably 1B-10B tokens, less than 0.1% of typical amounts of pretraining data.
This is according to the report, though they don’t seem to have released this data, so distill models can’t be reproduced by others in the same way they were made by DeepSeek.
This was my understanding pre r1. Certainly this seems to be the case with the o1 models: better at code and math, not better at philosophy and creative writing.
But something is up with r1. It is unusually good at creative writing. It doesn’t seem spikey in the way that I predicted.
I notice I am confused.
Possible explanation: r1 seems to have less restrictive ‘guardrails’ added using post-training. Perhaps this ‘light hand at the tiller’ results in not post-training it towards mode-collapse. It’s closer to a raw base model than the o1 models.
This is just a hypothesis. There are many unknowns to be investigated.
Post-training consists of two RL stages followed by two SFT stages, one of which includes creative writing generated by DeepSeek-V3. This might account for the model both being good at creative writing and seeming closer to a raw base model.
Another possibility is the fact that they apply the RL stages immediately after pretraining, without any intermediate SFT stage.
This could also work for general intelligence and not only narrow math/coding olympiad sort of problems. The potential of o1/R1 is plausibly constrained for now by ability to construct oracle verifiers for correctness of solutions, which mostly only works for toy technical problems. Capabilities on such problems are not very likely to generalize to general capabilities, there aren’t clear signs so far that this is happening.
But this is a constraint on how the data can be generated, not on how efficiently other models can be retrained using such data to channel the capabilities. If at some point there will be a process for generating high quality training data for general intelligence, that data might also turn out to be effective for cheaply training other models. The R1-generated data used to train the distill models is 800K samples[1], which is probably 1B-10B tokens, less than 0.1% of typical amounts of pretraining data.
This is according to the report, though they don’t seem to have released this data, so distill models can’t be reproduced by others in the same way they were made by DeepSeek.
This was my understanding pre r1. Certainly this seems to be the case with the o1 models: better at code and math, not better at philosophy and creative writing.
But something is up with r1. It is unusually good at creative writing. It doesn’t seem spikey in the way that I predicted.
I notice I am confused.
Possible explanation: r1 seems to have less restrictive ‘guardrails’ added using post-training. Perhaps this ‘light hand at the tiller’ results in not post-training it towards mode-collapse. It’s closer to a raw base model than the o1 models.
This is just a hypothesis. There are many unknowns to be investigated.
Post-training consists of two RL stages followed by two SFT stages, one of which includes creative writing generated by DeepSeek-V3. This might account for the model both being good at creative writing and seeming closer to a raw base model.
Another possibility is the fact that they apply the RL stages immediately after pretraining, without any intermediate SFT stage.
Maybe we can regulate data generation?