gwern comments on Impressions from base-GPT-4?

gwern 11 Nov 2023 14:42 UTC
6 points
0
I’m not sure finetuning GPT-3 is all that different or those difficulties ‘newly emerged’.

As I recall, the original GPT-3 finetuning API was removed not terribly long after it was announced and didn’t come back for a long time. There were also issues with finetune users like AI Dungeon 2. This might have been connected with the finetune doing shenanigans behind the scenes—OA declined to talk about what the ‘finetuning’ even was, and the general assumption seems to be that they were doing some sort of cheap lightweight-finetune or hack and not a true finetune.

(These are why I never wound up doing any of the GPT-3 finetuning ideas I had back in 2020, like trying to fix poetry by re-tokenizing our poem corpus into IPA phonetic notation—why waste the time & hundreds of dollars if OA is just going to screw it up behind the scenes & not even give you a hint why?)
- mishka 11 Nov 2023 14:56 UTC
  4 points
  0
  Parent
  Right. But the reports specifically on GPT-3.5-turbo fine-tuning announced in August were glowing, with people reporting being able to reach GPT-4-like levels on performance in narrow domains.
  
  That’s why our expectations were high.
  
  I am sure they do something relatively lightweight, like LoRA, https://arxiv.org/abs/2106.09685, which is what people tend to be mostly using (I think).
  
  And, of course, with GPT-4 being very different from a conventional Transformer of GPT-3-like type, if one believes the rumors, the difficulties might have easily emerged, if one has been trying to do something like a LoRA-like thing.
  - gwern 11 Nov 2023 15:36 UTC
    6 points
    2
    Parent
    
    But the reports specifically on GPT-3.5-turbo fine-tuning announced in August were glowing, with people reporting being able to reach GPT-4-like levels on performance in narrow domains.
    
    Indeed, but only years after their original attempt. All of the early GPT-3 finetuning reports were very… meh. No one seemed terribly happy with it.
    
    That’s my point: it seems like the first attempts did not go well for GPT-3. So, it’s not clear that the first attempts going poorly for GPT-4 is anything different. Perhaps in another 3 years, OA will have a new GPT-4 finetuning service which doesn’t require “more work” and Just Works™. (One does hope it wouldn’t take that long the second time around.)
  - O O 11 Nov 2023 18:30 UTC
    4 points
    0
    Parent
    What are the rumors? I’m only aware of MoE.
    - mishka 11 Nov 2023 20:08 UTC
      6 points
      0
      Parent
      Yes, the main rumor is that it’s a mixture-of-experts. This is already quite a difference from a single Transformer.
      
      We presume that these experts are mostly made of various components of a Transformer (with some possible additions and modifications, which we don’t know), but we don’t know how independent those experts are, or whether they share a sizeable common initial computation and then branch off that, or something else entirely with some kind of dynamic sparse routing through a single network, and so on… I think it’s unlikely to be “just take a bunch of GPT-3′s, run an appropriate subset of them in parallel, and combine the results”.
      
      There is a huge diversity of techniques combining the MoE motifs and motifs associated with Transformers, see e.g. this collection of references https://github.com/XueFuzhao/awesome-mixture-of-experts
      
      So, we really don’t know, these rumors are only enough to make some partial guesses.
      
      If we survive for a while, all this will eventually became public knowledge, and we’ll probably understand eventually how the magic of GPT-4 is possible.