Yeah, the “4th grader” poem is way better than the example in the post. So maybe there’s something to it. Can you explain why a base model will do this but a chat-tuned one won’t?
The short answer is “mode collapse” (same author as Simulators and also of generative.ink, where that LLM-generated HPMOR continuation I linked came from).
My best crack at the medium answer in 15 minutes or less is:
The base model is tuned to predict the next token of text drawn from its training distribution, which means that sampling from the base model will produce text which resembles its training corpus by any statistical measure the model has learned (with 405b that is effectively “any statistical test you can think of”, barring specifically chosen adversarial ones).
My mental model of base models (one that I think is pretty well supported empirically) is that they are giant bags of contextually activated heuristics. Some such heuristics are very strong and narrow (“if the previous tokens were “Once upon a”, we are at 4th word of fairy tail, when 4th word of fairy tale, output ” time”), and some are wide and weak (“French words tend to be followed by other French words”). And these heuristics are almost exclusively where the model gets its capabilities (there are a few weird exceptions like arithmetic and date munging).
Instruct- or chat-tuned models have secondary objectives that are not simply “predict the next token”. My mental model is that RL is extremely lazy, and will tend to chisel in the simplest possible behavior into the model which causes decreased loss / increased reward. One extremely simple behavior is “output a memorized sequence of text”. This behavior is also very discoverable and easy to reinforce—most of the update just needs to be “get the model to output the first couple tokens of the memorized sequence”. There’s a variant “fill in the mad lib” that is also quite easy to discover.
And so unless you make specific efforts to prevent it, doing RL on an LLM will give you a model which consistently falls into a few attractors. This is really hard to prevent—even if your base model and your RL’d model have almost identical logprobs for almost all low-perplexity prefixes you can still fall into these attractors (once you’re in one of these attractors you’re no longer looking at text which the base model is trained to predict super accurately).
The very long answer I would like to give at some point involves seeing how few token substitutions in base model output it takes to convert those outputs into something that looks almost identical to a given chat-tuned model—in other words, have a base and a chat model provide completions for a given prompt, and then replace the base model output token with the chat model output token at the single highest KL divergence position and resample the base model, and repeat.
I see, interesting, thank you! One more question though, my comment mentioned both text and images, with “uncanny averageness in details” applying to both. If you say there’s a way to (mostly) avoid that for text by using base models instead of chat-tuned ones, what would be the analogous fix for images?
Yeah, openai/guided-diffusion is basically that. Here’s an example colab which uses CLIP guidance to sample openai/guided-diffusion (not mine, but I did just verify that the notebook still runs)
Yeah, the “4th grader” poem is way better than the example in the post. So maybe there’s something to it. Can you explain why a base model will do this but a chat-tuned one won’t?
The short answer is “mode collapse” (same author as Simulators and also of generative.ink, where that LLM-generated HPMOR continuation I linked came from).
My best crack at the medium answer in 15 minutes or less is:
The base model is tuned to predict the next token of text drawn from its training distribution, which means that sampling from the base model will produce text which resembles its training corpus by any statistical measure the model has learned (with 405b that is effectively “any statistical test you can think of”, barring specifically chosen adversarial ones).
My mental model of base models (one that I think is pretty well supported empirically) is that they are giant bags of contextually activated heuristics. Some such heuristics are very strong and narrow (“if the previous tokens were “Once upon a”, we are at 4th word of fairy tail, when 4th word of fairy tale, output ” time”), and some are wide and weak (“French words tend to be followed by other French words”). And these heuristics are almost exclusively where the model gets its capabilities (there are a few weird exceptions like arithmetic and date munging).
Instruct- or chat-tuned models have secondary objectives that are not simply “predict the next token”. My mental model is that RL is extremely lazy, and will tend to chisel in the simplest possible behavior into the model which causes decreased loss / increased reward. One extremely simple behavior is “output a memorized sequence of text”. This behavior is also very discoverable and easy to reinforce—most of the update just needs to be “get the model to output the first couple tokens of the memorized sequence”. There’s a variant “fill in the mad lib” that is also quite easy to discover.
And so unless you make specific efforts to prevent it, doing RL on an LLM will give you a model which consistently falls into a few attractors. This is really hard to prevent—even if your base model and your RL’d model have almost identical logprobs for almost all low-perplexity prefixes you can still fall into these attractors (once you’re in one of these attractors you’re no longer looking at text which the base model is trained to predict super accurately).
The very long answer I would like to give at some point involves seeing how few token substitutions in base model output it takes to convert those outputs into something that looks almost identical to a given chat-tuned model—in other words, have a base and a chat model provide completions for a given prompt, and then replace the base model output token with the chat model output token at the single highest KL divergence position and resample the base model, and repeat.
I see, interesting, thank you! One more question though, my comment mentioned both text and images, with “uncanny averageness in details” applying to both. If you say there’s a way to (mostly) avoid that for text by using base models instead of chat-tuned ones, what would be the analogous fix for images?
Yeah, openai/guided-diffusion is basically that. Here’s an example colab which uses CLIP guidance to sample openai/guided-diffusion (not mine, but I did just verify that the notebook still runs)