So… yes, a human learns continually across its life in a way an LLM currently can’t, and this is a big deal for each of us. But when you talk about humanity developing over millennia, that has nothing to do with continual learning within a lifetime. It has to do with each generation learning what it can, and then (pre)training the next generation to have a superior starting position. That is something LLMs can do. If Opus 9 can have a million instances of itself accumulate a week of experiences each and use those in the corpus for pretraining Opus 10 a month later, is that actually a bottleneck to long term knowledge accumulation, if the AI community (at least within a lineage) is sufficiently close to eusocial?
You tell your LLM story with Opus 9 selecting data for “pretraining Opus 10” but you could equally well change the story to have Opus 9 selecting data for “further fine-tuning itself”, and now it’s a story of how to get continual learning to work in LLMs. Doesn’t really matter, it amounts to the same thing. I’ll use the fine-tuning / continual learning description here because I think it makes things a bit easier to talk about, but it doesn’t matter, you can translate my next paragraph back into the other frame if you prefer.
Anyway, I don’t think it would work (although obviously we’ll find out one way or the other soon enough). If you want to make progress, it’s not enough to for an LLM to “accumulate experience” and then fine-tune on that experience. For example, if an LLM outputs a bunch of tokens, then you fine-tune that very same LLM on those very same tokens, then it won’t make the LLM smarter, it will only cause mode collapse. Instead you would need to do something like: have the LLM try to figure out what’s true by thinking for a while, produce a final artifact, and train only on that artifact but not the thinking trace. That’s not an obviously crazy idea, and maybe it would work a little bit, but I think what would happen eventually is that the LLM would make mistakes, then it would lock in those mistakes by fine-tuning, and then it would have confident wrong ideas that leads it to make more mistakes, etc., and in the long term it would get dumber and dumber, not smarter and smarter. I don’t think this kind of approach can lead to human-like open-ended creation of knowledge, akin to the way that human mathematicians invented math from scratch without proof assistants (see §1.1 of my “Sharp Left Turn” post).
some stuff about training an LLM on its outputs for fun, without taking any broader stance in this discussion:
in the limit of generating very many reasoning traces from a model, fine-tuning the model on this data with full-batch ideal gradient descent does not change the model at all, because it is already the globally (and so also locally) optimal log loss predictor for its own sampling distribution
if you only take a single trace, do a gradient step on it, then take a trace from the modified model, do a gradient step again, and so on, then i think that in the limit of these gradient steps being really small, you should still be just staying still — like, not just staying still to order [sum of your step sizes] which is obvious and always true, but staying still more strongly than that — because you are approaching the full-batch case in which you’d actually stay still
if you take a single trace and do a reasonably big gradient step and repeat, then i’m not sure how to think about that as cleanly as the above cases. i think you’re probably right that you get mode collapse for some reasonable hyperparam setting. i guess you probably indeed get a very extreme version of this in the limiting case where you take a single trace and do a huge gradient step on it
in the limit of generating very many reasoning traces from a model, fine-tuning the model on this data with full-batch ideal gradient descent does not change the model at all, because it is already the globally (and so also locally) optimal log loss predictor for its own sampling distribution
Hmm, thinking about it more, I think you’re right (no change) if you draw the samples with temperature T=1; and my earlier comment was right (mode collapse, i.e. ever-increasing confidence in the modal next token, approaching 100%) if you draw the samples with temperature 0≤T<1, and repeat enough times. And if you use temperature T>1 then you get, umm, the opposite of mode collapse, where it approaches a uniform probability distribution when you repeat enough times. Right? (I’m not totally sure.) (I agree that this is a fun but irrelevant side-track.)
It is not rigorously obvious to me what happens in the case, but I think it’s a solid guess that in practice (still in the limit of gradient flow with many samples though) you roughly just get the original distribution at the chosen temperature when you do one step of this, and iterating many times roughly makes you approach confidence in the original modal token for and approach a uniform distribution for , as you say.
Even though I agree with you on this mathematical description of what happens, I actually feel uneasy about calling this “mode collapse”.[1] Like, I think vs isn’t even easily noticeable when reading model outputs on various prompts (not sure). Also, in a sense the model still models the entire distribution on any prompt (assuming the guess that this retraining effectively just changes the temperature is right, you can recover the original distribution just by raising the temperature of the new model), it just gives the final output as if at . It should remain pretty easy to get the model to talk with any persona it had before or about any topic it could talk about before. These properties imo make it feel like somewhere between a non-example and a non-central example of mode collapse.
That said, it remains plausible that sth imo more canonically like modes of the text distribution being lost would happen in the case with non-tiny gradient steps computed on single traces.
I think “mode collapse” doesn’t mean collapse to the mode of the distribution, it means collapse of the distribution into fewer parts (“modes”), ie forgetting parts/regions of the generative distribution. My guess is that central examples for text generation would be only speaking Vietnamese, or only talking about consciousness, or only having one tone/persona, or starting every response with “It’s a difficult question.”, or only predicting the same token over and over.
I’m not an expert but my understanding was that the model collapse problem came from training on LLM output that’s not curated, tagged, or otherwise grounded in/checked against real world data/distributions. If you were using agent systems to do real intellectual work, including gathering new data (conducting experiments, reviewing with humans, and so on) where appropriate, and using the results of that work for training, that seems potentially quite different than training on a large corpus of LLM outputs directly?
Also, I do think you’re right that you would want to train on the final artifact and not the thinking trace.TBH I never even really considered the former. It feels like it’d be kinda like giving a middle schooler the raw lab notebooks of a million grad students and expecting that to turn them into a great scientist, instead of turning those notes into papers and then reviews and then textbooks and lectures and problem sets.
So… yes, a human learns continually across its life in a way an LLM currently can’t, and this is a big deal for each of us. But when you talk about humanity developing over millennia, that has nothing to do with continual learning within a lifetime. It has to do with each generation learning what it can, and then (pre)training the next generation to have a superior starting position. That is something LLMs can do. If Opus 9 can have a million instances of itself accumulate a week of experiences each and use those in the corpus for pretraining Opus 10 a month later, is that actually a bottleneck to long term knowledge accumulation, if the AI community (at least within a lineage) is sufficiently close to eusocial?
You tell your LLM story with Opus 9 selecting data for “pretraining Opus 10” but you could equally well change the story to have Opus 9 selecting data for “further fine-tuning itself”, and now it’s a story of how to get continual learning to work in LLMs. Doesn’t really matter, it amounts to the same thing. I’ll use the fine-tuning / continual learning description here because I think it makes things a bit easier to talk about, but it doesn’t matter, you can translate my next paragraph back into the other frame if you prefer.
Anyway, I don’t think it would work (although obviously we’ll find out one way or the other soon enough). If you want to make progress, it’s not enough to for an LLM to “accumulate experience” and then fine-tune on that experience. For example, if an LLM outputs a bunch of tokens, then you fine-tune that very same LLM on those very same tokens, then it won’t make the LLM smarter, it will only cause mode collapse. Instead you would need to do something like: have the LLM try to figure out what’s true by thinking for a while, produce a final artifact, and train only on that artifact but not the thinking trace. That’s not an obviously crazy idea, and maybe it would work a little bit, but I think what would happen eventually is that the LLM would make mistakes, then it would lock in those mistakes by fine-tuning, and then it would have confident wrong ideas that leads it to make more mistakes, etc., and in the long term it would get dumber and dumber, not smarter and smarter. I don’t think this kind of approach can lead to human-like open-ended creation of knowledge, akin to the way that human mathematicians invented math from scratch without proof assistants (see §1.1 of my “Sharp Left Turn” post).
some stuff about training an LLM on its outputs for fun, without taking any broader stance in this discussion:
in the limit of generating very many reasoning traces from a model, fine-tuning the model on this data with full-batch ideal gradient descent does not change the model at all, because it is already the globally (and so also locally) optimal log loss predictor for its own sampling distribution
if you only take a single trace, do a gradient step on it, then take a trace from the modified model, do a gradient step again, and so on, then i think that in the limit of these gradient steps being really small, you should still be just staying still — like, not just staying still to order [sum of your step sizes] which is obvious and always true, but staying still more strongly than that — because you are approaching the full-batch case in which you’d actually stay still
if you take a single trace and do a reasonably big gradient step and repeat, then i’m not sure how to think about that as cleanly as the above cases. i think you’re probably right that you get mode collapse for some reasonable hyperparam setting. i guess you probably indeed get a very extreme version of this in the limiting case where you take a single trace and do a huge gradient step on it
Hmm, thinking about it more, I think you’re right (no change) if you draw the samples with temperature T=1; and my earlier comment was right (mode collapse, i.e. ever-increasing confidence in the modal next token, approaching 100%) if you draw the samples with temperature 0≤T<1, and repeat enough times. And if you use temperature T>1 then you get, umm, the opposite of mode collapse, where it approaches a uniform probability distribution when you repeat enough times. Right? (I’m not totally sure.) (I agree that this is a fun but irrelevant side-track.)
It is not rigorously obvious to me what happens in the case, but I think it’s a solid guess that in practice (still in the limit of gradient flow with many samples though) you roughly just get the original distribution at the chosen temperature when you do one step of this, and iterating many times roughly makes you approach confidence in the original modal token for and approach a uniform distribution for , as you say.
Even though I agree with you on this mathematical description of what happens, I actually feel uneasy about calling this “mode collapse”. [1] Like, I think vs isn’t even easily noticeable when reading model outputs on various prompts (not sure). Also, in a sense the model still models the entire distribution on any prompt (assuming the guess that this retraining effectively just changes the temperature is right, you can recover the original distribution just by raising the temperature of the new model), it just gives the final output as if at . It should remain pretty easy to get the model to talk with any persona it had before or about any topic it could talk about before. These properties imo make it feel like somewhere between a non-example and a non-central example of mode collapse.
That said, it remains plausible that sth imo more canonically like modes of the text distribution being lost would happen in the case with non-tiny gradient steps computed on single traces.
I think “mode collapse” doesn’t mean collapse to the mode of the distribution, it means collapse of the distribution into fewer parts (“modes”), ie forgetting parts/regions of the generative distribution. My guess is that central examples for text generation would be only speaking Vietnamese, or only talking about consciousness, or only having one tone/persona, or starting every response with “It’s a difficult question.”, or only predicting the same token over and over.
I’m not an expert but my understanding was that the model collapse problem came from training on LLM output that’s not curated, tagged, or otherwise grounded in/checked against real world data/distributions. If you were using agent systems to do real intellectual work, including gathering new data (conducting experiments, reviewing with humans, and so on) where appropriate, and using the results of that work for training, that seems potentially quite different than training on a large corpus of LLM outputs directly?
Also, I do think you’re right that you would want to train on the final artifact and not the thinking trace.TBH I never even really considered the former. It feels like it’d be kinda like giving a middle schooler the raw lab notebooks of a million grad students and expecting that to turn them into a great scientist, instead of turning those notes into papers and then reviews and then textbooks and lectures and problem sets.