Thank you for this excellent analysis! However, it also makes me wonder whether mankind is close to exhausting the algorithmic insights usable in CoT-based models (think of my post with a less credible analysis written in October 2025) and/or mankind has already found a really cheap way to distill models into smaller ones (think of my most recent quick take and ARC-AGI-1 performance of Gemini 3 Flash, GPT-5-mini, GPT-5.2 and Grok 4 Fast Reasoning along with the cluster of o3, o4-mini, GPT-5, GPT-5.1 and the three Claudes 4.5).
The cheap way to distill models into smaller ones would mean that the implications for governance are not so dire. For example, Kokotajlo predicted in May that the creation of GPT-5 would require a dose of elicitation techniques applied to GPT-4.5, meaning that GPT-5′s creation was impossible without having spent ~2E26 compute on making GPT-4.5 beforehand. Similarly, unlike Qwen 3 Next 80B A3B, GPT-oss-20b could have been distilled from another model. Alas, it doesn’t tell us anything about DeepSeek v. 3.2 and the potential to create a cheaper analogue…
Exhausting the insights would mean that the prediction related to frontier models continuing the trend is falsified unless mankind dares to do something beyond the CoT, like making the models neuralese. For example, Claude 3.7 Sonnet displays different results (50 points for reasoning model, 41 pt for non-reasoning model; why wasn’t it placed into the AA>= 50 list? It could also make the slope less steep) depending on whether it uses reasoning or not. But the shift to reasoning models is a known technique which increases the AA index and was already used for models like DeepSeek, meaning that anyone who tries to cheapen the creation of models with AA>=65 will have to discover a new technique.
It’s in this appendix section as a lower confidence compute estimate and is in the >=45 AAII score bucket. Looking at the data, the reason it is not in the >=50 bucket is that it’s AAII score, pulled from the Artificial Analysis API, is 49.9. I see that they round to 50 on the main webpage. I just used the raw scores from the API without any rounding. Thanks for the check!
it also makes me wonder whether mankind is close to exhausting the algorithmic insights usable in CoT-based models (think of my post with a less credible analysis written in October 2025) and/or mankind has already found a really cheap way to distill models into smaller ones
To be clear about my position, I don’t think the analysis I presented here points at all toward humanity exhausting algorithmic insights. Separate lines of reasoning might lead somebody to that conclusion, but this analysis either has little bearing on the hypothesis or points toward us not running out of insights (on account of the rate of downstream progress being so rapid).
Thank you for this excellent analysis! However, it also makes me wonder whether mankind is close to exhausting the algorithmic insights usable in CoT-based models (think of my post with a less credible analysis written in October 2025) and/or mankind has already found a really cheap way to distill models into smaller ones (think of my most recent quick take and ARC-AGI-1 performance of Gemini 3 Flash, GPT-5-mini, GPT-5.2 and Grok 4 Fast Reasoning along with the cluster of o3, o4-mini, GPT-5, GPT-5.1 and the three Claudes 4.5).
The cheap way to distill models into smaller ones would mean that the implications for governance are not so dire. For example, Kokotajlo predicted in May that the creation of GPT-5 would require a dose of elicitation techniques applied to GPT-4.5, meaning that GPT-5′s creation was impossible without having spent ~2E26 compute on making GPT-4.5 beforehand. Similarly, unlike Qwen 3 Next 80B A3B, GPT-oss-20b could have been distilled from another model. Alas, it doesn’t tell us anything about DeepSeek v. 3.2 and the potential to create a cheaper analogue…
Exhausting the insights would mean that the prediction related to frontier models continuing the trend is falsified unless mankind dares to do something beyond the CoT, like making the models neuralese. For example, Claude 3.7 Sonnet displays different results (50 points for reasoning model, 41 pt for non-reasoning model; why wasn’t it placed into the AA>= 50 list? It could also make the slope less steep) depending on whether it uses reasoning or not. But the shift to reasoning models is a known technique which increases the AA index and was already used for models like DeepSeek, meaning that anyone who tries to cheapen the creation of models with AA>=65 will have to discover a new technique.
It’s in this appendix section as a lower confidence compute estimate and is in the >=45 AAII score bucket. Looking at the data, the reason it is not in the >=50 bucket is that it’s AAII score, pulled from the Artificial Analysis API, is 49.9. I see that they round to 50 on the main webpage. I just used the raw scores from the API without any rounding. Thanks for the check!
To be clear about my position, I don’t think the analysis I presented here points at all toward humanity exhausting algorithmic insights. Separate lines of reasoning might lead somebody to that conclusion, but this analysis either has little bearing on the hypothesis or points toward us not running out of insights (on account of the rate of downstream progress being so rapid).