Alignment As A Bottleneck To Usefulness Of GPT-3

So there’s this thing where GPT-3 is able to do ad­di­tion, it has the in­ter­nal model to do ad­di­tion, but it takes a lit­tle pok­ing and prod­ding to ac­tu­ally get it to do ad­di­tion. “Few-shot learn­ing”, as the pa­per calls it. Rather than prompt­ing the model with

Q: What is 48 + 76? A:

… in­stead prompt it with

Q: What is 48 + 76? A: 124

Q: What is 34 + 53? A: 87

Q: What is 29 + 86? A:

The same ap­plies to lots of other tasks: ar­ith­metic, ana­grams and spel­ling cor­rec­tion, trans­la­tion, as­sorted bench­marks, etc. To get GPT-3 to do the thing we want, it helps to give it a few ex­am­ples, so it can “figure out what we’re ask­ing for”.

This is an al­ign­ment prob­lem. In­deed, I think of it as the quintessen­tial al­ign­ment prob­lem: to trans­late what-a-hu­man-wants into a speci­fi­ca­tion us­able by an AI. The hard part is not to build a sys­tem which can do the thing we want, the hard part is to spec­ify the thing we want in such a way that the sys­tem ac­tu­ally does it.

The GPT fam­ily of mod­els are trained to mimic hu­man writ­ing. So the pro­to­typ­i­cal “al­ign­ment prob­lem” on GPT is prompt de­sign: write a prompt such that ac­tual hu­man writ­ing which started with that prompt would likely con­tain the thing you ac­tu­ally want. As­sum­ing that GPT has a suffi­ciently pow­er­ful and ac­cu­rate model of hu­man writ­ing, it should then gen­er­ate the thing you want.

Viewed through that frame, “few-shot learn­ing” just de­signs a prompt by list­ing some ex­am­ples of what we want—e.g. list­ing some ad­di­tion prob­lems and their an­swers. Call me picky, but that seems like a rather prim­i­tive way to de­sign a prompt. Surely we can do bet­ter?

In­deed, peo­ple are already notic­ing clever ways to get bet­ter re­sults out of GPT-3 - e.g. TurnTrout recom­mends con­di­tion­ing on writ­ing by smart peo­ple, and the right prompt makes the sys­tem com­plain about non­sense rather than gen­er­at­ing fur­ther non­sense in re­sponse. I ex­pect we’ll see many such in­sights over the next month or so.

Ca­pa­bil­ities vs Align­ment as Bot­tle­neck to Value

I said that the al­ign­ment prob­lem on GPT is prompt de­sign: write a prompt such that ac­tual hu­man writ­ing which started with that prompt would likely con­tain the thing you ac­tu­ally want. Im­por­tant point: this is worded to be ag­nos­tic to the de­tails GPT al­gorithm it­self; it’s mainly about pre­dic­tive power. If we’ve de­signed a good prompt, the cur­rent gen­er­a­tion of GPT might still be un­able to solve the prob­lem—e.g. GPT-3 doesn’t un­der­stand long ad­di­tion no mat­ter how good the prompt, but some fu­ture model with more pre­dic­tive power should even­tu­ally be able to solve it.

In other words, there’s a clear dis­tinc­tion be­tween al­ign­ment and ca­pa­bil­ities:

  • al­ign­ment is mainly about the prompt, and asks whether hu­man writ­ing which started with that prompt would be likely to con­tain the thing you want

  • ca­pa­bil­ities are mainly about GPT’s model, and ask about how well GPT-gen­er­ated writ­ing matches re­al­is­tic hu­man writing

In­ter­est­ing ques­tion: be­tween al­ign­ment and ca­pa­bil­ities, which is the main bot­tle­neck to get­ting value out of GPT-like mod­els, both in the short term and the long(er) term?

In the short term, it seems like ca­pa­bil­ities are still pretty ob­vi­ously the main bot­tle­neck. GPT-3 clearly has pretty limited “work­ing mem­ory” and un­der­stand­ing of the world. That said, it does seem plau­si­ble that GPT-3 could con­sis­tently do at least some eco­nom­i­cally-use­ful things right now, with a care­fully de­signed prompt—e.g. writ­ing ad copy or edit­ing hu­mans’ writ­ing.

In the longer term, though, we have a clear path for­ward for bet­ter ca­pa­bil­ities. Just con­tin­u­ing along the cur­rent tra­jec­tory will push ca­pa­bil­ities to an eco­nom­i­cally-valuable point on a wide range of prob­lems, and soon. Align­ment, on the other hand, doesn’t have much of a tra­jec­tory at all yet; de­sign­ing-writ­ing-prompts-such-that-writ­ing-which-starts-with-the-prompt-con­tains-the-thing-you-want isn’t ex­actly a hot re­search area. There’s prob­a­bly low-hang­ing fruit there for now, and it’s largely un­clear how hard the prob­lem will be go­ing for­ward.

Two pre­dic­tions on this front:

  • With this ver­sion of GPT and es­pe­cially with what­ever comes next, we’ll start to see a lot more effort go­ing into prompt de­sign (or the equiv­a­lent al­ign­ment prob­lem for fu­ture sys­tems)

  • As the ca­pa­bil­ities of GPT-style mod­els be­gin to cross be­yond what hu­mans can do (at least in some do­mains), al­ign­ment will be­come a much harder bot­tle­neck, be­cause it’s hard to make a hu­man-mimick­ing sys­tem do things which hu­mans can­not do

Rea­son­ing for the first pre­dic­tion: GPT-3 is right on the bor­der­line of mak­ing al­ign­ment eco­nom­i­cally valuable—i.e. it’s at the point where there’s plau­si­bly some im­me­di­ate value to be had by figur­ing out bet­ter ways to write prompts. That means there’s fi­nally go­ing to be eco­nomic pres­sure for al­ign­ment—there’s go­ing to be ways to make money by com­ing up with bet­ter al­ign­ment tricks. That won’t nec­es­sar­ily mean eco­nomic pres­sure for gen­er­al­iz­able or ro­bust al­ign­ment tricks, though—most of the econ­omy runs on ad-hoc barely-good-enough tricks most of the time, and early al­ign­ment tricks will likely be the same. In the longer run, fo­cus will shift to­ward more ro­bust al­ign­ment, as the low-hang­ing prob­lems are solved and the re­main­ing prob­lems have most of their value in the long tail.

Rea­son­ing for the sec­ond pre­dic­tion: how do I write a prompt such that hu­man writ­ing which be­gan with that prompt would con­tain a work­able ex­pla­na­tion of a cheap fu­sion power gen­er­a­tor? In prac­tice, writ­ing which claims to con­tain such a thing is gen­er­ally crack­pot­tery. I could take a differ­ent an­gle, maybe write some sec­tion-head­ers with names of par­tic­u­lar tech­nolo­gies (e.g. elec­tric mo­tor, ra­dio an­tenna, wa­ter pump, …) and de­scrip­tions of how they work, then write a header for “fu­sion gen­er­a­tor” and let the model fill in the de­scrip­tion. Some­thing like that could plau­si­bly work. Or it could gen­er­ate scifi tech­nob­a­b­ble, be­cause that’s what would be most likely to show up in such a piece of writ­ing to­day. It all de­pends on which is “more likely” to ap­pear in hu­man writ­ing. Point is: GPT is trained to mimic hu­man writ­ing; get­ting it to write things which hu­mans can­not cur­rently write is likely to be hard, even if it has the req­ui­site ca­pa­bil­ities.