in the limit of generating very many reasoning traces from a model, fine-tuning the model on this data with full-batch ideal gradient descent does not change the model at all, because it is already the globally (and so also locally) optimal log loss predictor for its own sampling distribution
Hmm, thinking about it more, I think you’re right (no change) if you draw the samples with temperature T=1; and my earlier comment was right (mode collapse, i.e. ever-increasing confidence in the modal next token, approaching 100%) if you draw the samples with temperature 0≤T<1, and repeat enough times. And if you use temperature T>1 then you get, umm, the opposite of mode collapse, where it approaches a uniform probability distribution when you repeat enough times. Right? (I’m not totally sure.) (I agree that this is a fun but irrelevant side-track.)
It is not rigorously obvious to me what happens in the case, but I think it’s a solid guess that in practice (still in the limit of gradient flow with many samples though) you roughly just get the original distribution at the chosen temperature when you do one step of this, and iterating many times roughly makes you approach confidence in the original modal token for and approach a uniform distribution for , as you say.
Even though I agree with you on this mathematical description of what happens, I actually feel uneasy about calling this “mode collapse”.[1] Like, I think vs isn’t even easily noticeable when reading model outputs on various prompts (not sure). Also, in a sense the model still models the entire distribution on any prompt (assuming the guess that this retraining effectively just changes the temperature is right, you can recover the original distribution just by raising the temperature of the new model), it just gives the final output as if at . It should remain pretty easy to get the model to talk with any persona it had before or about any topic it could talk about before. These properties imo make it feel like somewhere between a non-example and a non-central example of mode collapse.
That said, it remains plausible that sth imo more canonically like modes of the text distribution being lost would happen in the case with non-tiny gradient steps computed on single traces.
I think “mode collapse” doesn’t mean collapse to the mode of the distribution, it means collapse of the distribution into fewer parts (“modes”), ie forgetting parts/regions of the generative distribution. My guess is that central examples for text generation would be only speaking Vietnamese, or only talking about consciousness, or only having one tone/persona, or starting every response with “It’s a difficult question.”, or only predicting the same token over and over.
Hmm, thinking about it more, I think you’re right (no change) if you draw the samples with temperature T=1; and my earlier comment was right (mode collapse, i.e. ever-increasing confidence in the modal next token, approaching 100%) if you draw the samples with temperature 0≤T<1, and repeat enough times. And if you use temperature T>1 then you get, umm, the opposite of mode collapse, where it approaches a uniform probability distribution when you repeat enough times. Right? (I’m not totally sure.) (I agree that this is a fun but irrelevant side-track.)
It is not rigorously obvious to me what happens in the case, but I think it’s a solid guess that in practice (still in the limit of gradient flow with many samples though) you roughly just get the original distribution at the chosen temperature when you do one step of this, and iterating many times roughly makes you approach confidence in the original modal token for and approach a uniform distribution for , as you say.
Even though I agree with you on this mathematical description of what happens, I actually feel uneasy about calling this “mode collapse”. [1] Like, I think vs isn’t even easily noticeable when reading model outputs on various prompts (not sure). Also, in a sense the model still models the entire distribution on any prompt (assuming the guess that this retraining effectively just changes the temperature is right, you can recover the original distribution just by raising the temperature of the new model), it just gives the final output as if at . It should remain pretty easy to get the model to talk with any persona it had before or about any topic it could talk about before. These properties imo make it feel like somewhere between a non-example and a non-central example of mode collapse.
That said, it remains plausible that sth imo more canonically like modes of the text distribution being lost would happen in the case with non-tiny gradient steps computed on single traces.
I think “mode collapse” doesn’t mean collapse to the mode of the distribution, it means collapse of the distribution into fewer parts (“modes”), ie forgetting parts/regions of the generative distribution. My guess is that central examples for text generation would be only speaking Vietnamese, or only talking about consciousness, or only having one tone/persona, or starting every response with “It’s a difficult question.”, or only predicting the same token over and over.