Dario Amodei suggests that in-context learning might suffice for continual learning. The way LLMs do in-context learning with long context is disanalogous to anything humans can do, but a context window of 15M tokens is 500 days of 30K tokens per day, which is more than enough to progress from “first day on the job” to knowing what you are doing with this particular source of tasks. Needs to work mostly with text (if it works at all), or 15M tokens won’t be enough, but that could be sufficient.
So this might just be about moving from RAG to including more free-form observations that were historically made by the model itself for the same source of tasks, with massively more tokens of context, and the current memory features of chatbots in the form of long text files might with sufficient scale become the real thing, rather than remaining a dead-end crutch, once these text files get into the habit of accumulating megabytes of observations. And RLVR can plausibly teach the models how to make a good use of these very long contexts.
With this year’s 14 TB of HBM per GB200 NVL72, very long context windows become more feasible (than with ~1 TB of HBM per node that most current models are still running on), and then there’s the next step in 2028 with Rubin Ultra NVL576 systems that have 147 TB of HBM.
Unless I’m totally off-base here, 15M sounds incredibly high for actually useful recall.
This is the best source I know about for measuring model context length.
Obviously I don’t know about private models, but based on the delta between claimed vs. actual, I’m pretty suspicious that actually useful context length is currently longer than a few hundred thousand tokens.
I’m pretty suspicious that actually useful context length is currently longer than a few hundred thousand tokens.
Not currently, but this is some kind of brute force scaling roadmap for one of the major remaining unhobblings, so it has timeline implications. On last year’s hardware, it’s not really feasible to go that far anyway, and RLVR is only just waking up. So the first public observations of negative results on this will probably be in 2026, if the actually useful context length fails to improve. And then there’s 2028-2029, following up on the 147 TB of Rubin Ultra NVL576 (Nvidia roadmap places it in 2027, which means in 2028 there will be datacenters with it, as well as possibly models trained for it using older hardware, then in 2029 models trained on it).
But also, for the purpose of automated adaptation to a source of tasks and feedback (such as a job), it doesn’t necessarily need as much fidelity, it only needs to work as well as a human reading some book a year ago, retaining the mental skills but not the words. A context in principle gives the words, but that is not the thing that needs to work.
There is a paper that shows the overreliance of in-context learning on superficial clues. it is from 2022, and the tested models are old. So, maybe newer ones are doing much better,but maybe it is not really learning, at least by some definitions.
Dario Amodei suggests that in-context learning might suffice for continual learning. The way LLMs do in-context learning with long context is disanalogous to anything humans can do, but a context window of 15M tokens is 500 days of 30K tokens per day, which is more than enough to progress from “first day on the job” to knowing what you are doing with this particular source of tasks. Needs to work mostly with text (if it works at all), or 15M tokens won’t be enough, but that could be sufficient.
So this might just be about moving from RAG to including more free-form observations that were historically made by the model itself for the same source of tasks, with massively more tokens of context, and the current memory features of chatbots in the form of long text files might with sufficient scale become the real thing, rather than remaining a dead-end crutch, once these text files get into the habit of accumulating megabytes of observations. And RLVR can plausibly teach the models how to make a good use of these very long contexts.
With this year’s 14 TB of HBM per GB200 NVL72, very long context windows become more feasible (than with ~1 TB of HBM per node that most current models are still running on), and then there’s the next step in 2028 with Rubin Ultra NVL576 systems that have 147 TB of HBM.
Unless I’m totally off-base here, 15M sounds incredibly high for actually useful recall.
This is the best source I know about for measuring model context length.
Obviously I don’t know about private models, but based on the delta between claimed vs. actual, I’m pretty suspicious that actually useful context length is currently longer than a few hundred thousand tokens.
Not currently, but this is some kind of brute force scaling roadmap for one of the major remaining unhobblings, so it has timeline implications. On last year’s hardware, it’s not really feasible to go that far anyway, and RLVR is only just waking up. So the first public observations of negative results on this will probably be in 2026, if the actually useful context length fails to improve. And then there’s 2028-2029, following up on the 147 TB of Rubin Ultra NVL576 (Nvidia roadmap places it in 2027, which means in 2028 there will be datacenters with it, as well as possibly models trained for it using older hardware, then in 2029 models trained on it).
But also, for the purpose of automated adaptation to a source of tasks and feedback (such as a job), it doesn’t necessarily need as much fidelity, it only needs to work as well as a human reading some book a year ago, retaining the mental skills but not the words. A context in principle gives the words, but that is not the thing that needs to work.
I supposed I’m unsure how fast this can be scaled. Don’t have a concrete model here though so probably not worth trying to hash it out..
I’m not sure that the current summarization/searching approach is actually analogous to this. That said,
This is probably making approaches more analogous. So fair point.
I would like to see the updated Ruler metrics in 2026.
Any specific predictions you have on what a negative v. positive result would be in 2026?
There is a paper that shows the overreliance of in-context learning on superficial clues. it is from 2022, and the tested models are old. So, maybe newer ones are doing much better,but maybe it is not really learning, at least by some definitions.