And another reason why all this is relevant, we know that fine-tuning GPT-3.5 can produce drastic boosts in narrow domains, and some of us (e.g. myself) have expected the same from fine-tuning GPT-4, being able to achieve the performance of the non-existing GPT-4.5 (or 5) in narrow domains.
But that’s not what has happened. Instead OpenAI has communicated that
Preliminary results indicate that GPT-4 fine-tuning requires more work to achieve meaningful improvements over the base model compared to the substantial gains realized with GPT-3.5 fine-tuning.
and, moreover, therefore
We’re creating an experimental access program for GPT-4 fine-tuning. Preliminary results indicate that GPT-4 fine-tuning requires more work to achieve meaningful improvements over the base model compared to the substantial gains realized with GPT-3.5 fine-tuning. As quality and safety for GPT-4 fine-tuning improves, developers actively using GPT-3.5 fine-tuning will be presented with an option to apply to the GPT-4 program within their fine-tuning console.
It is very important to understand the mysterious base-GPT-4 better in the context of both potential benefits and potential hazards of GPT-4 fine-tuning, and also in the context of these newly emerged difficulties of fine-tuning it as fruitfully as GPT-3.5.
I’m not sure finetuning GPT-3 is all that different or those difficulties ‘newly emerged’.
As I recall, the original GPT-3 finetuning API was removed not terribly long after it was announced and didn’t come back for a long time. There were also issues with finetune users like AI Dungeon 2. This might have been connected with the finetune doing shenanigans behind the scenes—OA declined to talk about what the ‘finetuning’ even was, and the general assumption seems to be that they were doing some sort of cheap lightweight-finetune or hack and not a true finetune.
(These are why I never wound up doing any of the GPT-3 finetuning ideas I had back in 2020, like trying to fix poetry by re-tokenizing our poem corpus into IPA phonetic notation—why waste the time & hundreds of dollars if OA is just going to screw it up behind the scenes & not even give you a hint why?)
Right. But the reports specifically on GPT-3.5-turbo fine-tuning announced in August were glowing, with people reporting being able to reach GPT-4-like levels on performance in narrow domains.
That’s why our expectations were high.
I am sure they do something relatively lightweight, like LoRA, https://arxiv.org/abs/2106.09685, which is what people tend to be mostly using (I think).
And, of course, with GPT-4 being very different from a conventional Transformer of GPT-3-like type, if one believes the rumors, the difficulties might have easily emerged, if one has been trying to do something like a LoRA-like thing.
But the reports specifically on GPT-3.5-turbo fine-tuning announced in August were glowing, with people reporting being able to reach GPT-4-like levels on performance in narrow domains.
Indeed, but only years after their original attempt. All of the early GPT-3 finetuning reports were very… meh. No one seemed terribly happy with it.
That’s my point: it seems like the first attempts did not go well for GPT-3. So, it’s not clear that the first attempts going poorly for GPT-4 is anything different. Perhaps in another 3 years, OA will have a new GPT-4 finetuning service which doesn’t require “more work” and Just Works™. (One does hope it wouldn’t take that long the second time around.)
Yes, the main rumor is that it’s a mixture-of-experts. This is already quite a difference from a single Transformer.
We presume that these experts are mostly made of various components of a Transformer (with some possible additions and modifications, which we don’t know), but we don’t know how independent those experts are, or whether they share a sizeable common initial computation and then branch off that, or something else entirely with some kind of dynamic sparse routing through a single network, and so on… I think it’s unlikely to be “just take a bunch of GPT-3′s, run an appropriate subset of them in parallel, and combine the results”.
So, we really don’t know, these rumors are only enough to make some partial guesses.
If we survive for a while, all this will eventually became public knowledge, and we’ll probably understand eventually how the magic of GPT-4 is possible.
And another reason why all this is relevant, we know that fine-tuning GPT-3.5 can produce drastic boosts in narrow domains, and some of us (e.g. myself) have expected the same from fine-tuning GPT-4, being able to achieve the performance of the non-existing GPT-4.5 (or 5) in narrow domains.
But that’s not what has happened. Instead OpenAI has communicated that
and, moreover, therefore
It is very important to understand the mysterious base-GPT-4 better in the context of both potential benefits and potential hazards of GPT-4 fine-tuning, and also in the context of these newly emerged difficulties of fine-tuning it as fruitfully as GPT-3.5.
I’m not sure finetuning GPT-3 is all that different or those difficulties ‘newly emerged’.
As I recall, the original GPT-3 finetuning API was removed not terribly long after it was announced and didn’t come back for a long time. There were also issues with finetune users like AI Dungeon 2. This might have been connected with the finetune doing shenanigans behind the scenes—OA declined to talk about what the ‘finetuning’ even was, and the general assumption seems to be that they were doing some sort of cheap lightweight-finetune or hack and not a true finetune.
(These are why I never wound up doing any of the GPT-3 finetuning ideas I had back in 2020, like trying to fix poetry by re-tokenizing our poem corpus into IPA phonetic notation—why waste the time & hundreds of dollars if OA is just going to screw it up behind the scenes & not even give you a hint why?)
Right. But the reports specifically on GPT-3.5-turbo fine-tuning announced in August were glowing, with people reporting being able to reach GPT-4-like levels on performance in narrow domains.
That’s why our expectations were high.
I am sure they do something relatively lightweight, like LoRA, https://arxiv.org/abs/2106.09685, which is what people tend to be mostly using (I think).
And, of course, with GPT-4 being very different from a conventional Transformer of GPT-3-like type, if one believes the rumors, the difficulties might have easily emerged, if one has been trying to do something like a LoRA-like thing.
Indeed, but only years after their original attempt. All of the early GPT-3 finetuning reports were very… meh. No one seemed terribly happy with it.
That’s my point: it seems like the first attempts did not go well for GPT-3. So, it’s not clear that the first attempts going poorly for GPT-4 is anything different. Perhaps in another 3 years, OA will have a new GPT-4 finetuning service which doesn’t require “more work” and Just Works™. (One does hope it wouldn’t take that long the second time around.)
What are the rumors? I’m only aware of MoE.
Yes, the main rumor is that it’s a mixture-of-experts. This is already quite a difference from a single Transformer.
We presume that these experts are mostly made of various components of a Transformer (with some possible additions and modifications, which we don’t know), but we don’t know how independent those experts are, or whether they share a sizeable common initial computation and then branch off that, or something else entirely with some kind of dynamic sparse routing through a single network, and so on… I think it’s unlikely to be “just take a bunch of GPT-3′s, run an appropriate subset of them in parallel, and combine the results”.
There is a huge diversity of techniques combining the MoE motifs and motifs associated with Transformers, see e.g. this collection of references https://github.com/XueFuzhao/awesome-mixture-of-experts
So, we really don’t know, these rumors are only enough to make some partial guesses.
If we survive for a while, all this will eventually became public knowledge, and we’ll probably understand eventually how the magic of GPT-4 is possible.