Unless I’m totally off-base here, 15M sounds incredibly high for actually useful recall.
This is the best source I know about for measuring model context length.
Obviously I don’t know about private models, but based on the delta between claimed vs. actual, I’m pretty suspicious that actually useful context length is currently longer than a few hundred thousand tokens.
I’m pretty suspicious that actually useful context length is currently longer than a few hundred thousand tokens.
Not currently, but this is some kind of brute force scaling roadmap for one of the major remaining unhobblings, so it has timeline implications. On last year’s hardware, it’s not really feasible to go that far anyway, and RLVR is only just waking up. So the first public observations of negative results on this will probably be in 2026, if the actually useful context length fails to improve. And then there’s 2028-2029, following up on the 147 TB of Rubin Ultra NVL576 (Nvidia roadmap places it in 2027, which means in 2028 there will be datacenters with it, as well as possibly models trained for it using older hardware, then in 2029 models trained on it).
But also, for the purpose of automated adaptation to a source of tasks and feedback (such as a job), it doesn’t necessarily need as much fidelity, it only needs to work as well as a human reading some book a year ago, retaining the mental skills but not the words. A context in principle gives the words, but that is not the thing that needs to work.
Unless I’m totally off-base here, 15M sounds incredibly high for actually useful recall.
This is the best source I know about for measuring model context length.
Obviously I don’t know about private models, but based on the delta between claimed vs. actual, I’m pretty suspicious that actually useful context length is currently longer than a few hundred thousand tokens.
Not currently, but this is some kind of brute force scaling roadmap for one of the major remaining unhobblings, so it has timeline implications. On last year’s hardware, it’s not really feasible to go that far anyway, and RLVR is only just waking up. So the first public observations of negative results on this will probably be in 2026, if the actually useful context length fails to improve. And then there’s 2028-2029, following up on the 147 TB of Rubin Ultra NVL576 (Nvidia roadmap places it in 2027, which means in 2028 there will be datacenters with it, as well as possibly models trained for it using older hardware, then in 2029 models trained on it).
But also, for the purpose of automated adaptation to a source of tasks and feedback (such as a job), it doesn’t necessarily need as much fidelity, it only needs to work as well as a human reading some book a year ago, retaining the mental skills but not the words. A context in principle gives the words, but that is not the thing that needs to work.
I supposed I’m unsure how fast this can be scaled. Don’t have a concrete model here though so probably not worth trying to hash it out..
I’m not sure that the current summarization/searching approach is actually analogous to this. That said,
This is probably making approaches more analogous. So fair point.
I would like to see the updated Ruler metrics in 2026.
Any specific predictions you have on what a negative v. positive result would be in 2026?