I might have misunderstood, but I disagree that you cannot learn things like playing chess in-context. I can write up the rules to a new made up game, give Claude some examples of what those games look like, and then play the game against Claude and he will be able to play it (somewhat).
He’ll still make mistakes sometimes, but if he was able to write down those mistakes (and as he get smarter, understand and write down the reasons he made those mistakes and how to do better in the future), then when I start a fresh instance and load in those notes again, the new instance will play even better. Depending on how good Claude is at understanding his failures, and how good he is at writing down the lessons in a way that the new instance can generalise from, you could (effectively) have a system that keeps getting better at the game.
Do you disagree on this? Am I missing something about your claim here? Apologies if my response isn’t on point, I’m somewhat unsure what you are trying to say here.
I agree sample efficiency is an issue. In-context learning is wildly more sample efficient than fine-tuning, but it could be better. The trouble here is if you need some minimum number of samples to understand and generalise something properly, you might hit the context limit first (or you might not recall properly across the whole window, also another open problem).
Regarding recurrent state models, as far as I understand, this is effectively doing the same thing as having a KV-cache with memories, except that in the KV-cache case, the model must explicitly write out what to include, whereas state space models have some method built into their architecture and learnt during training. Am I correct on that? I suspect this is an advantage of the KV-cache memory method, if you believe the models will keep getting more intelligent regardless, because it lets us use that intelligence to decide what to remember—somewhat analogous to a human deciding what to study, or dwelling on things that then get committed to memory.
Having the model explicitly write what to include also sidesteps the problem of designing a good state space learning mechanism, since we’re just relying on intelligence we can train in known ways to decide what information to retain moving forward.
Playing chess at the level of the best humans, using a similar amount of data. Maybe learning to solve IMO level problems, in-context. More relevant is becoming fluent in novel specialized theory (or empirical facts) being developed as part of the current project, or gaining tacit knowledge in the form of skills that can’t be usefully written down as notes to be looked up, without effectively learning their contents as skills, as one would learn to play world class chess after looking at some games and learning the rules (as external training data, likely prompting things like generation of more internal synthetic training data or RLVR tasks and RL environments).
In-context learning is in principle sufficient, especially with true recurrence (I don’t mean SSMs, where activations are still climbing the layers and computations have bounded depth; as with chain of thought reasoning, recurrence needs to pass information back to the same layer as the sequence progresses; likely block-level recurrence with blocks of variable length is appropriate for this, to both retain fast prefill and pretrain updating of recurrent state for deep reasoning). But observing the first credible steps of scaling pretraining since original Mar 2023 GPT-4 (in Opus 4 and Gemini 3 Pro), LLMs don’t seem to be on track to start learning difficult skills in-context. They can learn the rules of chess in-context, but can’t learn to play at the level of the best human chess players in-context, and it doesn’t seem that they will get sufficiently better at this even as scaling of pretraining advances further, before it runs out of useful natural text data.
I might have misunderstood, but I disagree that you cannot learn things like playing chess in-context. I can write up the rules to a new made up game, give Claude some examples of what those games look like, and then play the game against Claude and he will be able to play it (somewhat).
He’ll still make mistakes sometimes, but if he was able to write down those mistakes (and as he get smarter, understand and write down the reasons he made those mistakes and how to do better in the future), then when I start a fresh instance and load in those notes again, the new instance will play even better. Depending on how good Claude is at understanding his failures, and how good he is at writing down the lessons in a way that the new instance can generalise from, you could (effectively) have a system that keeps getting better at the game.
Do you disagree on this? Am I missing something about your claim here? Apologies if my response isn’t on point, I’m somewhat unsure what you are trying to say here.
I agree sample efficiency is an issue. In-context learning is wildly more sample efficient than fine-tuning, but it could be better. The trouble here is if you need some minimum number of samples to understand and generalise something properly, you might hit the context limit first (or you might not recall properly across the whole window, also another open problem).
Regarding recurrent state models, as far as I understand, this is effectively doing the same thing as having a KV-cache with memories, except that in the KV-cache case, the model must explicitly write out what to include, whereas state space models have some method built into their architecture and learnt during training. Am I correct on that? I suspect this is an advantage of the KV-cache memory method, if you believe the models will keep getting more intelligent regardless, because it lets us use that intelligence to decide what to remember—somewhat analogous to a human deciding what to study, or dwelling on things that then get committed to memory.
Having the model explicitly write what to include also sidesteps the problem of designing a good state space learning mechanism, since we’re just relying on intelligence we can train in known ways to decide what information to retain moving forward.
Playing chess at the level of the best humans, using a similar amount of data. Maybe learning to solve IMO level problems, in-context. More relevant is becoming fluent in novel specialized theory (or empirical facts) being developed as part of the current project, or gaining tacit knowledge in the form of skills that can’t be usefully written down as notes to be looked up, without effectively learning their contents as skills, as one would learn to play world class chess after looking at some games and learning the rules (as external training data, likely prompting things like generation of more internal synthetic training data or RLVR tasks and RL environments).
In-context learning is in principle sufficient, especially with true recurrence (I don’t mean SSMs, where activations are still climbing the layers and computations have bounded depth; as with chain of thought reasoning, recurrence needs to pass information back to the same layer as the sequence progresses; likely block-level recurrence with blocks of variable length is appropriate for this, to both retain fast prefill and pretrain updating of recurrent state for deep reasoning). But observing the first credible steps of scaling pretraining since original Mar 2023 GPT-4 (in Opus 4 and Gemini 3 Pro), LLMs don’t seem to be on track to start learning difficult skills in-context. They can learn the rules of chess in-context, but can’t learn to play at the level of the best human chess players in-context, and it doesn’t seem that they will get sufficiently better at this even as scaling of pretraining advances further, before it runs out of useful natural text data.