Based on the plot below, my intuition is that GPT-5 and GPT-5.1 had some pretraining data added to the original base pretraining dataset (or base model) dating back to GPT-4o; and GPT-5.2 is something different. I did this experiment a while back.
Accuracy being halved going from 5.1 to 5.2 suggests one of the two things:
1) the new model shows dramaticregression on data retrieval which cannot possibly be the desired outcome for a successor, and I’m sure it would be noticed immediately on internal tests and benchmarks, etc.—we’d most likely see this manifest in real-world usage as well;
2) the new model refuses to guess much more often when it isn’t too sure (being more cautious about answering wrong), which is a desired outcome meant to reduce hallucinations and slop. I’m betting this is exactly what we’re looking at, and your Sonnet graph also suggests the same.
So if your methodology counts refusal as lowering accuracy, then it doesn’t necessarily prove the base model or the training data mix is different. Teaching a model to refuse on low-signal data is in the domain of SFT and reinforcement learning, and investing into that heavily on the same pretrain would result in something similar to the graph you’ve posted.
4o and 5 almost certainly have different base models since 4o is natively omnimodal and 5 and its derivatives are not, taking that into account you have to make a lot of weird assumptions to reconcile this discrepancy. 5 and 4.1, on the other hand… Everything seems to fall into place neatly when looking in that direction.
I think 1 is true. This is only a single, quite obscure, factual recall eval. It’s certainly possible to have regressions on some evals across model versions if you don’t optimize for those evals at all.
Wrt point 2 → here is the plot of how often the models guess, versus say they do not know, on the same dataset. My understanding is that the theory in point 2 would have predicted a much more dramatic drop in GPT-5.2?
Based on the plot below, my intuition is that GPT-5 and GPT-5.1 had some pretraining data added to the original base pretraining dataset (or base model) dating back to GPT-4o; and GPT-5.2 is something different. I did this experiment a while back.
Accuracy being halved going from 5.1 to 5.2 suggests one of the two things:
1) the new model shows dramatic regression on data retrieval which cannot possibly be the desired outcome for a successor, and I’m sure it would be noticed immediately on internal tests and benchmarks, etc.—we’d most likely see this manifest in real-world usage as well;
2) the new model refuses to guess much more often when it isn’t too sure (being more cautious about answering wrong), which is a desired outcome meant to reduce hallucinations and slop. I’m betting this is exactly what we’re looking at, and your Sonnet graph also suggests the same.
So if your methodology counts refusal as lowering accuracy, then it doesn’t necessarily prove the base model or the training data mix is different. Teaching a model to refuse on low-signal data is in the domain of SFT and reinforcement learning, and investing into that heavily on the same pretrain would result in something similar to the graph you’ve posted.
4o and 5 almost certainly have different base models since 4o is natively omnimodal and 5 and its derivatives are not, taking that into account you have to make a lot of weird assumptions to reconcile this discrepancy. 5 and 4.1, on the other hand… Everything seems to fall into place neatly when looking in that direction.
I think 1 is true. This is only a single, quite obscure, factual recall eval. It’s certainly possible to have regressions on some evals across model versions if you don’t optimize for those evals at all.
Wrt point 2 → here is the plot of how often the models guess, versus say they do not know, on the same dataset. My understanding is that the theory in point 2 would have predicted a much more dramatic drop in GPT-5.2?
Interesting method! Added to my collection of LLM ancestry detection methods. Here are the other methods I have collected.
https://www.lesswrong.com/posts/cGcwQDKAKbQ68BGuR
LLM of the same model can be finetuned via random text
https://www.dbreunig.com/2025/05/30/using-slop-forensics-to-determine-model-ancestry.html
LLM of similar ancestry produces similar frequency of slop words
https://fi-le.net/oss/
Using known glitch tokens to identify LLMs/Encoders
This is such a cool method! I am really curious about applying this method to Anthropic’s models. Would you mind sharing the script / data you used?
Here’s on the Sonnet size class for now, nothing very interesting...