Learning the prior

Link post

Sup­pose that I have a dataset D of ob­served (x, y) pairs, and I’m in­ter­ested in pre­dict­ing the la­bel y* for each point x* in some new set D*. Per­haps D is a set of fore­casts from the last few years, and D* is a set of ques­tions about the com­ing years that are im­por­tant for plan­ning.

The clas­sic deep learn­ing ap­proach is to fit a model f on D, and then pre­dict y* us­ing f(x*).

This ap­proach im­plic­itly uses a some­what strange prior, which de­pends on ex­actly how I op­ti­mize f. I may end up with the model with the small­est l2 norm, or the model that’s eas­iest to find with SGD, or the model that’s most ro­bust to dropout. But none of these are any­where close to the “ideal” be­liefs of a hu­man who has up­dated on D.

This means that neu­ral nets are un­nec­es­sar­ily data hun­gry, and more im­por­tantly that they can gen­er­al­ize in an un­de­sir­able way. I now think that this is a safety prob­lem, so I want to try to at­tack it head on by learn­ing the “right” prior, rather than at­tempt­ing to use neu­ral nets as an im­plicit prior.

Warm-up 1: hu­man forecasting

If D and D* are small enough, and I’m OK with hu­man-level fore­casts, then I don’t need ML at all.

In­stead I can hire a hu­man to look at all the data in D, learn all the rele­vant les­sons from it, and then spend some time fore­cast­ing y* for each x*.

Now let’s grad­u­ally re­lax those as­sump­tions.

Warm-up 2: pre­dict­ing hu­man forecasts

Sup­pose that D* is large but that D is still small enough that a hu­man can ex­tract all the rele­vant les­sons from it (or that for each x* in D*, there is a small sub­set of D that is rele­vant).

In this case, I can pay hu­mans to make fore­casts for many ran­domly cho­sen x* in D*, train a model f to pre­dict those fore­casts, and then use f to make fore­casts about the rest of D*.

The gen­er­al­iza­tion is now com­ing en­tirely from hu­man be­liefs, not from the struc­tural of the neu­ral net — we are only ap­ply­ing neu­ral nets to iid sam­ples from D*.

Learn­ing the hu­man prior

Now sup­pose that D is large, such that a hu­man can’t up­date on it them­selves. Per­haps D con­tains billions of ex­am­ples, but we only have time to let a hu­man read a few pages of back­ground ma­te­rial.

In­stead of learn­ing the un­con­di­tional hu­man fore­cast P(y|x), we will learn the fore­cast P(y|x, Z), where Z is a few pages of back­ground ma­te­rial that the hu­man takes as given. We can also query the hu­man for the prior prob­a­bil­ity Prior(Z) that the back­ground ma­te­rial is true.

Then we can train f(y|x, Z) to match P(y|x, Z), and op­ti­mize Z* for:

log Prior(Z*) + sum((x, y) ~ D) log f(y|x, Z*)

We train f in par­allel with op­ti­miz­ing Z*, on in­puts con­sist­ing of the cur­rent value of Z* to­gether with ques­tions x sam­pled from D and D*.

For ex­am­ple, Z might spec­ify a few ex­plicit mod­els for fore­cast­ing and trend ex­trap­o­la­tion, a few im­por­tant back­ground as­sump­tions, and guesses for a wide range of em­piri­cal pa­ram­e­ters. Then a hu­man who reads Z can eval­u­ate how plau­si­ble it is on its face, or they can take it on faith in or­der to pre­dict y* given x*.

The op­ti­mal Z* is then the set of as­sump­tions, mod­els, and em­piri­cal es­ti­mates that works best on the his­tor­i­cal data. The hu­man never has to rea­son about more than one dat­a­point at a time — they just have to eval­u­ate what Z* im­plies about each dat­a­point in iso­la­tion, and eval­u­ate how plau­si­ble Z* is a pri­ori.

This ap­proach has many prob­lems. Two par­tic­u­larly im­por­tant ones:

  • To be com­pet­i­tive, this op­ti­miza­tion prob­lem needs to be nearly as easy as op­ti­miz­ing f di­rectly on D, but it seems harder: find­ing Z* might be much harder than learn­ing f, learn­ing a con­di­tional f might be much harder than learn­ing an un­con­di­tional f, and jointly op­ti­miz­ing Z and f might pre­sent fur­ther difficul­ties.

  • Even if it worked our fore­casts would only be “hu­man-level” in a fairly re­stric­tive sense — they wouldn’t even be as good as a hu­man who ac­tu­ally spent years prac­tic­ing on D be­fore mak­ing a fore­cast on D*. To be com­pet­i­tive, we want the fore­casts in the iid case to be at least as good as fit­ting a model di­rectly.

I think the first point is an in­ter­est­ing ML re­search prob­lem. (If any­thing re­sem­bling this ap­proach ever works in prac­tice, credit will rightly go to the re­searchers who figure out the pre­cise ver­sion that works and re­solve those is­sues, and this blog post will be a foot­note.) I feel rel­a­tively op­ti­mistic about our col­lec­tive abil­ity to solve con­crete ML prob­lems, un­less they turn out to be im­pos­si­ble. I’ll give some pre­limi­nary thoughts in the next sec­tion “Notes & elab­o­ra­tions.”

The sec­ond con­cern, that we need some way to go be­yond hu­man level, is a cen­tral philo­soph­i­cal is­sue and I’ll re­turn to it in the sub­se­quent sec­tion “Go­ing be­yond the hu­man prior.”

Notes & elaborations

  • Search­ing over long texts may be ex­tremely difficult. One idea to avoid this is to try to have a hu­man guide the search, by ei­ther gen­er­at­ing hy­pothe­ses Z at ran­dom or sam­pling per­tur­ba­tions to the cur­rent value of Z. Then we can fit a gen­er­a­tive model of that ex­plo­ra­tion pro­cess and perform search in the la­tent space (and also fit f in the la­tent space rather than hav­ing it take Z as in­put). That rests on two hopes: (i) learn­ing the ex­plo­ra­tion model is easy rel­a­tive to the other op­ti­miza­tion we are do­ing, (ii) search­ing for Z in the la­tent space of the hu­man ex­plo­ra­tion pro­cess is strictly eas­ier than the cor­re­spond­ing search over neu­ral nets. Both of those seem quite plau­si­ble to me.

  • We don’t nec­es­sar­ily need to learn f ev­ery­where, it only needs to be valid in a small neigh­bor­hood of the cur­rent Z. That may not be much harder than learn­ing the un­con­di­tional f.

  • Z rep­re­sents a full pos­te­rior rather than a de­ter­minis­tic “hy­poth­e­sis” about the world, e.g. it might say “R0 is uniform be­tween 2 and 3.” What I’m call­ing Prior(Z) is re­ally the KL be­tween the prior and Z, and P(y|x,Z) will it­self re­flect the un­cer­tainty in Z. The mo­ti­va­tion is that we want a flex­ible and learn­able pos­te­rior. (This is par­tic­u­larly valuable once we go be­yond hu­man level.)

  • This for­mu­la­tion queries the hu­man for Prior(Z) be­fore each fit­ness eval­u­a­tion. That might be fine, or you might need to learn a pre­dic­tor of that judg­ment. It might be eas­ier for a hu­man to re­port a ra­tio Prior(Z)/​Prior(Z′) than to give an ab­solute prior prob­a­bil­ity, but that’s also fine for op­ti­miza­tion. I think there are a lot of difficul­ties of this fla­vor that are similar to other efforts to learn from hu­mans.

  • For the pur­pose of study­ing the ML op­ti­miza­tion difficul­ties I think we can ba­si­cally treat the hu­man as an or­a­cle for a rea­son­able prior. We will then need to re­lax that ra­tio­nal­ity as­sump­tion in the same way we do for other in­stances of learn­ing from hu­mans (though a lot of the work will also be done by our efforts to go be­yond the hu­man prior, de­scribed in the next sec­tion).

Go­ing be­yond the hu­man prior

How do we get pre­dic­tions bet­ter than ex­plicit hu­man rea­son­ing?

We need to have a richer la­tent space Z, a bet­ter Prior(Z), and a bet­ter con­di­tional P(y|x, Z).

In­stead of hav­ing a hu­man pre­dict y given x and Z, we can use am­plifi­ca­tion or de­bate to train f(y|x, Z) and Prior(Z). This al­lows Z to be a large ob­ject that can­not be di­rectly ac­cessed by a hu­man.

For ex­am­ple, Z might be a full library of books de­scribing im­por­tant facts about the world, heuris­tics, and so on. Then we may have two pow­er­ful mod­els de­bat­ing “What should we pre­dict about x, as­sum­ing that ev­ery­thing in Z is true?” Over the course of that de­bate they can cite small com­po­nents of Z to help make their case, with­out the hu­man need­ing to un­der­stand al­most any­thing writ­ten in Z.

In or­der to make this ap­proach work, we need to do a lot of things:

  1. We still need to deal with all the ML difficul­ties de­scribed in the pre­ced­ing sec­tion.

  2. We still need to an­a­lyze de­bate/​am­plifi­ca­tion, and now we’ve in­creased the prob­lem difficulty slightly. Rather than merely re­quiring them to pro­duce the “right” an­swers to ques­tions, we also need them to im­ple­ment the “right” prior. We already needed to im­ple­ment the right prior as part of an­swer­ing ques­tions cor­rectly, so this isn’t too much of a strength­en­ing, but we are call­ing at­ten­tion to a par­tic­u­larly challeng­ing case. It also im­poses a par­tic­u­lar struc­ture on that rea­son­ing which is a real (but hope­fully slight) strength­en­ing.

  3. En­tan­gled with the new anal­y­sis of am­plifi­ca­tion/​de­bate, we also need to en­sure that Z is able to rep­re­sent a rich enough la­tent space. I’ll dis­cuss im­plicit rep­re­sen­ta­tions of Z in the next sec­tion “Rep­re­sent­ing Z.”

  4. Rep­re­sent­ing Z im­plic­itly and us­ing am­plifi­ca­tion or de­bate may make the op­ti­miza­tion prob­lem even more difficult. I’ll dis­cuss this in the sub­se­quent sec­tion “Jointly op­ti­miz­ing Mz and f.”

Rep­re­sent­ing Z

I’ve de­scribed Z as be­ing a gi­ant string of text. If de­bate/​am­plifi­ca­tion work at all then I think text is in some sense “uni­ver­sal,” so this isn’t a crazy re­stric­tion.

That said, rep­re­sent­ing com­plex be­liefs might re­quire very long text, per­haps many or­ders of mag­ni­tude larger than the model f it­self. That means that op­ti­miz­ing for (Z, f) jointly will be much harder than op­ti­miz­ing for f alone.

The ap­proach I’m most op­ti­mistic about is rep­re­sent­ing Z im­plic­itly as the out­put of an­other model Mz. For ex­am­ple, if Z is a text that is trillions of words long, you could have Mz out­put the ith word of Z on in­put i.

(To be re­ally effi­cient you’ll need to share pa­ram­e­ters be­tween f and Mz but that’s not the hard part.)

This can get around the most ob­vi­ous prob­lem — that Z is too long to pos­si­bly write down in its en­tirety — but I think you ac­tu­ally have to be pretty care­ful about the im­plicit rep­re­sen­ta­tion or else we will make Mz’s job too hard (in a way that will be tied up the com­pet­i­tive­ness of de­bate/​am­plifi­ca­tion).

In par­tic­u­lar, I think that rep­re­sent­ing Z as im­plicit flat text is un­likely to be work­able. I’m more op­ti­mistic about the kind of ap­proach de­scribed in ap­proval-max­i­miz­ing rep­re­sen­ta­tions — Z is a com­plex ob­ject that can be re­lated to slightly sim­pler ob­jects, which can them­selves be re­lated to slightly sim­pler ob­jects… un­til even­tu­ally bot­tom­ing out with some­thing sim­ple enough to be read di­rectly by a hu­man. Then Mz im­plic­itly rep­re­sents Z as an ex­po­nen­tially large tree, and only needs to be able to do one step of un­pack­ing at a time.

Jointly op­ti­miz­ing Mz and f

In the first sec­tion I dis­cussed a model where we learn f(y|x, Z) and then use it to op­ti­mize Z. This is harder if Z is rep­re­sented im­plic­itly by Mz, since we can’t re­ally af­ford to let f take Mz as in­put.

I think the most promis­ing ap­proach is to have Mz and f both op­er­ate on a com­pact la­tent space, and perform op­ti­miza­tion in this space. I men­tion that idea in Notes & Elab­o­ra­tions above, but want to go into more de­tail now since it gets a lit­tle more com­pli­cated and be­comes a more cen­tral part of the pro­posal.

(There are other plau­si­ble ap­proaches to this prob­lem; hav­ing more an­gles of at­tack makes me feel more com­fortable with the prob­lem, but all of the oth­ers feel less promis­ing to me and I wanted to keep this blog post a bit shorter.)

The main idea is that rather than train­ing a model Mz(·) which im­plic­itly rep­re­sents Z, we train a model Mz(·, z) which im­plic­itly rep­re­sents a dis­tri­bu­tion over Z, pa­ram­e­ter­ized by a com­pact la­tent z.

Mz is trained by iter­ated am­plifi­ca­tion to imi­tate a su­per­hu­man ex­plo­ra­tion dis­tri­bu­tion, analo­gous to the way that we could ask a hu­man to sam­ple Z and then train a gen­er­a­tive model of the hu­man’s hy­poth­e­sis-gen­er­a­tion. Train­ing Mz this way is it­self an open ML prob­lem, similar to the ML prob­lem of mak­ing iter­ated am­plifi­ca­tion work for ques­tion-an­swer­ing.

Now we can train f(y|x, z) us­ing am­plifi­ca­tion or de­bate. When­ever we would want to refer­ence Z, we use Mz(·, z). Similarly, we can train Prior(z). Then we choose z* to op­ti­mize log Prior(z*) + sum((x, y) ~ D) log f(y|x, z*).

Rather than end­ing up with a hu­man-com­pre­hen­si­ble pos­te­rior Z*, we’ll end up with a com­pact la­tent z*. The hu­man-com­pre­hen­si­ble pos­te­rior Z* is im­ple­mented im­plic­itly by Mz(·, z*).


I think the ap­proach in this post can po­ten­tially re­solve the is­sue de­scribed in Inac­cessible In­for­ma­tion, which I think is one of the largest re­main­ing con­cep­tual ob­sta­cles for am­plifi­ca­tion/​de­bate. So over­all I feel very ex­cited about it.

Tak­ing this ap­proach means that am­plifi­ca­tion/​de­bate need to meet a slightly higher bar than they oth­er­wise would, and in­tro­duces a bit of ex­tra philo­soph­i­cal difficulty. It re­mains to be seen whether am­plifi­ca­tion/​de­bate will work at all, much less whether they can meet this higher bar. But over­all I feel pretty ex­cited about this out­come, since I was ex­pect­ing to need a larger re­work­ing of am­plifi­ca­tion/​de­bate.

I think it’s still very pos­si­ble that the ap­proach in this post can’t work for fun­da­men­tal philo­soph­i­cal rea­sons. I’m not say­ing this blog post is any­where close to a con­vinc­ing ar­gu­ment for fea­si­bil­ity.

Even if the ap­proach in this post is con­cep­tu­ally sound, it in­volves sev­eral se­ri­ous ML challenges. I don’t see any rea­son those challenges should be im­pos­si­ble, so I feel pretty good about that — it always seems like good news when you can move from philo­soph­i­cal difficulty to tech­ni­cal difficulty. That said, it’s still quite pos­si­ble that one of these tech­ni­cal is­sues will be a fun­da­men­tal deal-breaker for com­pet­i­tive­ness.

My cur­rent view is that we don’t have can­di­date ob­struc­tions for am­plifi­ca­tion/​de­bate as an ap­proach to AI al­ign­ment, though we have a lot of work to do to ac­tu­ally flesh those out into a work­able ap­proach. This is a more op­ti­mistic place than I was at a month ago when I wrote Inac­cessible In­for­ma­tion.

Learn­ing the prior was origi­nally pub­lished in AI Align­ment on Medium, where peo­ple are con­tin­u­ing the con­ver­sa­tion by high­light­ing and re­spond­ing to this story.