It is not rigorously obvious to me what happens in the case, but I think it’s a solid guess that in practice (still in the limit of gradient flow with many samples though) you roughly just get the original distribution at the chosen temperature when you do one step of this, and iterating many times roughly makes you approach confidence in the original modal token for and approach a uniform distribution for , as you say.
Even though I agree with you on this mathematical description of what happens, I actually feel uneasy about calling this “mode collapse”.[1] Like, I think vs isn’t even easily noticeable when reading model outputs on various prompts (not sure). Also, in a sense the model still models the entire distribution on any prompt (assuming the guess that this retraining effectively just changes the temperature is right, you can recover the original distribution just by raising the temperature of the new model), it just gives the final output as if at . It should remain pretty easy to get the model to talk with any persona it had before or about any topic it could talk about before. These properties imo make it feel like somewhere between a non-example and a non-central example of mode collapse.
That said, it remains plausible that sth imo more canonically like modes of the text distribution being lost would happen in the case with non-tiny gradient steps computed on single traces.
I think “mode collapse” doesn’t mean collapse to the mode of the distribution, it means collapse of the distribution into fewer parts (“modes”), ie forgetting parts/regions of the generative distribution. My guess is that central examples for text generation would be only speaking Vietnamese, or only talking about consciousness, or only having one tone/persona, or starting every response with “It’s a difficult question.”, or only predicting the same token over and over.
It is not rigorously obvious to me what happens in the case, but I think it’s a solid guess that in practice (still in the limit of gradient flow with many samples though) you roughly just get the original distribution at the chosen temperature when you do one step of this, and iterating many times roughly makes you approach confidence in the original modal token for and approach a uniform distribution for , as you say.
Even though I agree with you on this mathematical description of what happens, I actually feel uneasy about calling this “mode collapse”. [1] Like, I think vs isn’t even easily noticeable when reading model outputs on various prompts (not sure). Also, in a sense the model still models the entire distribution on any prompt (assuming the guess that this retraining effectively just changes the temperature is right, you can recover the original distribution just by raising the temperature of the new model), it just gives the final output as if at . It should remain pretty easy to get the model to talk with any persona it had before or about any topic it could talk about before. These properties imo make it feel like somewhere between a non-example and a non-central example of mode collapse.
That said, it remains plausible that sth imo more canonically like modes of the text distribution being lost would happen in the case with non-tiny gradient steps computed on single traces.
I think “mode collapse” doesn’t mean collapse to the mode of the distribution, it means collapse of the distribution into fewer parts (“modes”), ie forgetting parts/regions of the generative distribution. My guess is that central examples for text generation would be only speaking Vietnamese, or only talking about consciousness, or only having one tone/persona, or starting every response with “It’s a difficult question.”, or only predicting the same token over and over.