This was my understanding pre r1. Certainly this seems to be the case with the o1 models: better at code and math, not better at philosophy and creative writing.
But something is up with r1. It is unusually good at creative writing. It doesn’t seem spikey in the way that I predicted.
I notice I am confused.
Possible explanation: r1 seems to have less restrictive ‘guardrails’ added using post-training. Perhaps this ‘light hand at the tiller’ results in not post-training it towards mode-collapse. It’s closer to a raw base model than the o1 models.
This is just a hypothesis. There are many unknowns to be investigated.
Post-training consists of two RL stages followed by two SFT stages, one of which includes creative writing generated by DeepSeek-V3. This might account for the model both being good at creative writing and seeming closer to a raw base model.
Another possibility is the fact that they apply the RL stages immediately after pretraining, without any intermediate SFT stage.
This was my understanding pre r1. Certainly this seems to be the case with the o1 models: better at code and math, not better at philosophy and creative writing.
But something is up with r1. It is unusually good at creative writing. It doesn’t seem spikey in the way that I predicted.
I notice I am confused.
Possible explanation: r1 seems to have less restrictive ‘guardrails’ added using post-training. Perhaps this ‘light hand at the tiller’ results in not post-training it towards mode-collapse. It’s closer to a raw base model than the o1 models.
This is just a hypothesis. There are many unknowns to be investigated.
Post-training consists of two RL stages followed by two SFT stages, one of which includes creative writing generated by DeepSeek-V3. This might account for the model both being good at creative writing and seeming closer to a raw base model.
Another possibility is the fact that they apply the RL stages immediately after pretraining, without any intermediate SFT stage.