Can someone explain why many models have slowly-decaying lines? I would have expected sharp drop-offs—knowledge falling to zero after training data ends. In what situation does a model (like GPT-5.2) fall from 0.5 to sub 0.1 accuracy, and stay there for seemingly half a year?)
I’m also surprised that old and obsolete GPT-4x models seem to be broadly outcompeting the GPT-5x line. Am I missing something? Are refusals being counted as failures?
I suspect a few different variables are getting mixed together—a model’s raw intelligence, its willingness to provide a specific date, its willingness to confabulate when it doesn’t know, etc.
The decays are probably because there are less training data about recent deaths, and that the pre-training may have started before the knowledge cutoff.
Older models having better rote memorization on slightly obscure facts isn’t that surprising imo. It is not something that has a lot of optimization pressure.
Having multiple variables mixed don’t seem like a big issue for detecting ancestry. False positives will still be highly unlikely—different pretrains will probably have different “forgetting curves”.
That’s a clever idea!
Can someone explain why many models have slowly-decaying lines? I would have expected sharp drop-offs—knowledge falling to zero after training data ends. In what situation does a model (like GPT-5.2) fall from 0.5 to sub 0.1 accuracy, and stay there for seemingly half a year?)
I’m also surprised that old and obsolete GPT-4x models seem to be broadly outcompeting the GPT-5x line. Am I missing something? Are refusals being counted as failures?
I suspect a few different variables are getting mixed together—a model’s raw intelligence, its willingness to provide a specific date, its willingness to confabulate when it doesn’t know, etc.
GPT 5.2 is dropping before its knowledge cutoff.
The decays are probably because there are less training data about recent deaths, and that the pre-training may have started before the knowledge cutoff.
Older models having better rote memorization on slightly obscure facts isn’t that surprising imo. It is not something that has a lot of optimization pressure.
Having multiple variables mixed don’t seem like a big issue for detecting ancestry. False positives will still be highly unlikely—different pretrains will probably have different “forgetting curves”.