We have data about how problem-solving ability scales with reasoning time for a fixed model. This isn’t your question, but it’s related. It’s pretty much logarithmic, IIRC.
The important question is, how far can we push the technique whereby reasoning models are trained? They are trained by having them solve a problem with chains of thought (CoT), and then having them look at their own CoT, and ask “how could I have thought that faster?” It’s unclear how far this technique can be pushed (at least to those of us outside the main AI labs).
The known scaling principles are unsatisfying from the point of view of someone who actually wants to know what will happen next. They can predict numbers like score on a certain test, or residual perplexity. But they can’t predict the emergence of new abilities like “can translate languages” or “can tell a joke” or “can take over the world.”
By the way, I wouldn’t put too much weight on any claims about “marginal cost of training DeepSeek R1 over DeepSeek v3”. DeepSeek has a track record of understating how much effort it took to do something. I’m not saying it’s actual dishonesty (although it might be) but it’s at least not counting costs that other companies include, so their estimates come out apparently much lower than other people.
An interesting and important question.
We have data about how problem-solving ability scales with reasoning time for a fixed model. This isn’t your question, but it’s related. It’s pretty much logarithmic, IIRC.
The important question is, how far can we push the technique whereby reasoning models are trained? They are trained by having them solve a problem with chains of thought (CoT), and then having them look at their own CoT, and ask “how could I have thought that faster?” It’s unclear how far this technique can be pushed (at least to those of us outside the main AI labs).
The known scaling principles are unsatisfying from the point of view of someone who actually wants to know what will happen next. They can predict numbers like score on a certain test, or residual perplexity. But they can’t predict the emergence of new abilities like “can translate languages” or “can tell a joke” or “can take over the world.”
By the way, I wouldn’t put too much weight on any claims about “marginal cost of training DeepSeek R1 over DeepSeek v3”. DeepSeek has a track record of understating how much effort it took to do something. I’m not saying it’s actual dishonesty (although it might be) but it’s at least not counting costs that other companies include, so their estimates come out apparently much lower than other people.