Obviously this doesn’t work “from scratch”, you need enough training for the model to be able to distinguish good outputs from bad outputs and also ever produce good outputs on its own. We’re not going to get a ChatGPT-Zero. But I think this post does gesture in the general direction of something real.
While I do think the process you outlined in your post is more concrete and would probably work better and be easier than learning “from scratch”, I don’t think it’s completely obvious that something like this wouldn’t work from scratch. It was done for humans, albeit through billions of years of genetic evolution and thousands of years of cultural evolution. Something like ChatGPT-Zero would probably require many more orders of magnitude of compute than systems we are training today, and also some algorithmic/architectural improvements, but I don’t think it’s completely impossible.
I feel like your post is implying something similar, given the last sentence, so maybe I’m misinterpreting what exactly you’re saying won’t work.
Thanks for the feedback!
I was thinking of the gibberish level of text generated by uniformly sampling from the tokenizer. I had imagined there would be a huge difference between the gibberish level of macaronic attacks and completely random sampling from the tokenizer, but here are the first three examples I generated of 10 tokens uniformly sampled from GPT-2′s tokenizer:
“ournament annually amused charismaling Superintendent sushi WiiRONMeat”
″ doub arrestAPIogenous ts considersterm Hitler slip autom”
“AAF disposal catches smells interrogation Pilot muscular feminine ITV spree”
These are a lot more intelligible than I would have imagined. I can even reasonably rank these: 3 > 1 > 2.
I also asked ChatGPT-3.5 to rank these, and it ranked them: 3 > 2 > 1.
I used the prompt “Can you rank these three outputs by the coherence of the English language?”. The first time I asked, GPT refused to answer because all three are incoherent. I then told it to “Rank them in terms of how close they are to being coherent. They don’t have to be completely coherent to be ranked.” It then gave me the rankings above.
I repeated this twice more, changing the order of the examples in case it was making decisions based on the numbering. I used the prompt “Can you rank these three outputs by the coherence of the English language? They don’t have to be completely coherent to be ranked.” For both of these, GPT gave the ranking: 3 > 1 > 2 (numbers changed to match the ones I used in this post).
Following from what @faul_sname mentioned in their post about improvement being possible “as long as recognizing a good output is easier than generating a good output”, I think that improvement is possible from amortizing compute in the form of search. If the teacher model can differentiate between coherent and incoherent paths down the search tree of language, I think a reward model could be trained to predict the coherence of student model outputs and this reward model could be used as the training signal. I am unsure about where the reward model would be initialized from… the teacher model, random initialization, or something else entirely.
I do agree with your point that this will most likely lead to the student model exploiting the teacher model rather than robustly learning language. The “branching factor” (i.e. vocabulary size) of GPT-2 is 50,000. I imagine that the number of ways the student could explore into an observation (token) history that successfully tricks the teacher model is many times more likely than the student stumbling into a robust understanding of language. There are probably ways to mitigate this, similar to precautions taken so RLHF models don’t stray too far from the base model.
As for acquiring more data, I think the teacher model could be used to produce “new” data. This is done for Whisper-V3, which was trained on 80% data produced by Whisper-V2. How the teacher LLM generates what it knows is modulated by the temperature. It is trained with a temperature of 1, so generating data with a different temperature (and maybe a less strict top-p) could be seen as generating data on a (slightly) different distribution. Training on this new data could lead to new generation patterns without learning any new facts or knowledge.
None of this would allow the student model to gain knowledge the teacher model does not have, but I think it could allow the student model to more easily access this knowledge. I view this as the model learning to compress the observation (token) history required to approximate some hidden state. A student model that can “reach” a hidden state in 64 tokens is more powerful than one that requires 256 tokens to “reach” the same hidden state.
Will take a look at this, thank you.
This and the process @faul_sname outlined in their comment do seem like more concrete methods for eliciting knowledge from compute. Reasoning and math chains can be proven as correct or incorrect, in the same way that Go games can be won or lost, while language is much more subjective.
Something like this is what I imagined initially for the student model’s search over random token space. If someone highly intelligent (e.g. Von Neumann) could rank every output from the model in terms of coherence, I imagine it would result in a model more competent than current LLMs (at least in whatever domains Von Neumann was competent in). Obviously this is impossible, but even getting enough humans of any intelligence level to provide feedback for this process would also be impossible. This is why I fell back to relying on AI feedback for the process. This paper shows that RLAIF performs on par or better than RLHF, although I imagine RLAIF is less robust and more vulnerable to exploitation, as you mentioned. And this result is highly dependent on the domain and which human is giving the feedback.