Better priors as a safety problem

Link post

(Re­lated: Inac­cessible In­for­ma­tion, What does the uni­ver­sal prior ac­tu­ally look like?, Learn­ing the prior)

Fit­ting a neu­ral net im­plic­itly uses a “wrong” prior. This makes neu­ral nets more data hun­gry and makes them gen­er­al­ize in ways we don’t en­dorse, but it’s not clear whether it’s an al­ign­ment prob­lem.

After all, if neu­ral nets are what works, then both the al­igned and un­al­igned AIs will be us­ing them. It’s not clear if that sys­tem­at­i­cally dis­ad­van­tages al­igned AI.

Un­for­tu­nately I think it’s an al­ign­ment prob­lem:

  • I think the neu­ral net prior may work bet­ter for agents with cer­tain kinds of sim­ple goals, as de­scribed in Inac­cessible In­for­ma­tion. The prob­lem is that the prior mis­match may bite harder for some kinds of ques­tions, and some agents sim­ply never need to an­swer those hard ques­tions.

  • I think that Solomonoff in­duc­tion gen­er­al­izes catas­troph­i­cally be­cause it be­comes dom­i­nated by con­se­quen­tial­ists who use bet­ter pri­ors.

In this post I want to try to build some in­tu­ition for this prob­lem, and then ex­plain why I’m cur­rently feel­ing ex­cited about learn­ing the right prior.

Indi­rect speci­fi­ca­tions in uni­ver­sal priors

We usu­ally work with very broad “uni­ver­sal” pri­ors, both in the­ory (e.g. Solomonoff in­duc­tion) and in prac­tice (deep neu­ral nets are a very broad hy­poth­e­sis class). For sim­plic­ity I’ll talk about the the­o­ret­i­cal set­ting in this sec­tion, but I think the points ap­ply equally well in prac­tice.

The clas­sic uni­ver­sal prior is a ran­dom out­put from a ran­dom stochas­tic pro­gram. We of­ten think of the ques­tion “which uni­ver­sal prior should we use?” as equiv­a­lent to the ques­tion “which pro­gram­ming lan­guage should we use?” but I think that’s a loaded way of think­ing about it — not all uni­ver­sal pri­ors are defined by pick­ing a ran­dom pro­gram.

A uni­ver­sal prior can never be too wrong — a prior P is uni­ver­sal if, for any other com­putable prior Q, there is some con­stant c such that, for all x, we have P(x) > c Q(x). That means that given enough data, any two uni­ver­sal pri­ors will always con­verge to the same con­clu­sions, and no com­putable prior will do much bet­ter than them.

Un­for­tu­nately, uni­ver­sal­ity is much less helpful in the finite data regime. The first warn­ing sign is that our “real” be­liefs about the situ­a­tion can ap­pear in the prior in two differ­ent ways:

  • Directly: if our be­liefs about the world are de­scribed by a sim­ple com­putable pre­dic­tor, they are guaran­teed to ap­pear in a uni­ver­sal prior with sig­nifi­cant weight.

  • Indi­rectly: the uni­ver­sal prior also “con­tains” other pro­grams that are them­selves act­ing as pri­ors. For ex­am­ple, sup­pose I use a uni­ver­sal prior with a ter­ribly in­effi­cient pro­gram­ming lan­guage, in which each char­ac­ter needed to be re­peated 10 times in or­der for the pro­gram to do any­thing non-triv­ial. This prior is still uni­ver­sal, but it’s rea­son­ably likely that the “best” ex­pla­na­tion for some data will be to first sam­ple a re­ally sim­ple in­ter­pret for a bet­ter pro­gram­ming lan­guage, and then draw a uniformly ran­domly pro­gram in that bet­ter pro­gram­ming lan­guage.

(There isn’t a bright line be­tween these two kinds of pos­te­rior, but I think it’s ex­tremely helpful for think­ing in­tu­itively about what’s go­ing on.)

Our “real” be­lief is more like the di­rect model — we be­lieve that the uni­verse is a lawful and sim­ple place, not that the uni­verse is a hy­poth­e­sis of some agent try­ing to solve a pre­dic­tion prob­lem.

Un­for­tu­nately, for re­al­is­tic se­quences and con­ven­tional uni­ver­sal pri­ors, I think that in­di­rect mod­els are go­ing to dom­i­nate. The prob­lem is that “draw a ran­dom pro­gram” isn’t ac­tu­ally a very good prior, even if the pro­gram­ming lan­guage is OK— if I were an in­tel­li­gent agent, even if I knew noth­ing about the par­tic­u­lar world I lived in, I could do a lot of a pri­ori rea­son­ing to ar­rive at a much bet­ter prior.

The con­cep­tu­ally sim­plest ex­am­ple is “I think there­fore I am.” Our hy­pothe­ses about the world aren’t just ar­bi­trary pro­grams that pro­duce our sense ex­pe­riences— we re­strict at­ten­tion to hy­pothe­ses that ex­plain why we ex­ist and for which it mat­ters what we do. This rules out the over­whelming ma­jor­ity of pro­grams, al­low­ing us to as­sign sig­nifi­cantly higher prior prob­a­bil­ity to the real world.

I can get other ad­van­tages from a pri­ori rea­son­ing, though they are a lit­tle bit more slip­pery to talk about. For ex­am­ple, I can think about what kinds of speci­fi­ca­tions make sense and re­ally are most likely a pri­ori, rather than us­ing an ar­bi­trary pro­gram­ming lan­guage.

The up­shot is that an agent who is try­ing to do some­thing, and has enough time to think, ac­tu­ally seems to im­ple­ment a much bet­ter prior than a uniformly ran­dom pro­gram. If the com­plex­ity of spec­i­fy­ing such an agent is small rel­a­tive to the prior im­prob­a­bil­ity of the se­quence we are try­ing to pre­dict, then I think the uni­ver­sal prior is likely to pick out the se­quence in­di­rectly by go­ing through the agent (or else in some even weirder way).

I make this ar­gu­ment in the case of Solomonoff in­duc­tion in What does the uni­ver­sal prior ac­tu­ally look like? I find that ar­gu­ment pretty con­vinc­ing, al­though Solomonoff in­duc­tion is weird enough that I ex­pect most peo­ple to bounce off that post.

I make this ar­gu­ment in a much more re­al­is­tic set­ting in Inac­cessible In­for­ma­tion. There I ar­gue that if we e.g. use a uni­ver­sal prior to try to pro­duce an­swers to in­for­mal ques­tions in nat­u­ral lan­guage, we are very likely to get an in­di­rect speci­fi­ca­tion via an agent who rea­sons about how we use lan­guage.

Why is this a prob­lem?

I’ve ar­gued that the uni­ver­sal prior learns about the world in­di­rectly, by first learn­ing a new bet­ter prior. Is that a prob­lem?

To un­der­stand how the uni­ver­sal prior gen­er­al­izes, we now need to think about how the learned prior gen­er­al­izes.

The learned prior is it­self a pro­gram that rea­sons about the world. In both of the cases above (Solomonoff in­duc­tion and neu­ral nets) I’ve ar­gued that the sim­plest good pri­ors will be goal-di­rected, i.e. will be try­ing to pro­duce good pre­dic­tions.

I have two differ­ent con­cerns with this situ­a­tion, both of which I con­sider se­ri­ous:

  • Bad gen­er­al­iza­tions may dis­ad­van­tage al­igned agents. The sim­plest ver­sion of “good pre­dic­tions” may not gen­er­al­ize to some of the ques­tions we care about, and may put us at a dis­ad­van­tage rel­a­tive to agents who only care about sim­pler ques­tions. (See Inac­cessible In­for­ma­tion.)

  • Treach­er­ous be­hav­ior. Some goals might be eas­ier to spec­ify than oth­ers, and a wide range of goals may con­verge in­stru­men­tally to “make good pre­dic­tions.” In this case, the sim­plest pro­grams that pre­dict well might be try­ing to do some­thing to­tally un­re­lated, when they no longer have in­stru­men­tal rea­sons to pre­dict well (e.g. when their pre­dic­tions can no longer be checked) they may do some­thing we re­gard as catas­trophic.

I think it’s un­clear how se­ri­ous these prob­lems are in prac­tice. But I think they are huge ob­struc­tions from a the­o­ret­i­cal per­spec­tive, and I think there is a rea­son­able chance that this will bite us in prac­tice. Even if they aren’t crit­i­cal in prac­tice, I think that it’s method­olog­i­cally worth­while to try to find a good scal­able solu­tion to al­ign­ment, rather than hav­ing a solu­tion that’s con­tin­gent on un­known em­piri­cal fea­tures of fu­ture AI.

Learn­ing a com­pet­i­tive prior

Fun­da­men­tally, I think our mis­take was build­ing a sys­tem that uses the wrong uni­ver­sal prior, one that fails to re­ally cap­ture our be­liefs. Within that prior, there are other agents who use a bet­ter prior, and those agents are able to out­com­pete and es­sen­tially take over the whole sys­tem.

I’ve con­sid­ered lots of ap­proaches that try to work around this difficulty, tak­ing for granted that we won’t have the right prior and try­ing to some­how work around the risky con­se­quences. But now I’m most ex­cited about the di­rect ap­proach: give our origi­nal sys­tem the right prior so that sub-agents won’t be able to out­com­pete it.

This roughly tracks what’s go­ing on in our real be­liefs, and why it seems ab­surd to us to in­fer that the world is a dream of a ra­tio­nal agent—why think that the agent will as­sign higher prob­a­bil­ity to the real world than the “right” prior? (The simu­la­tion ar­gu­ment is ac­tu­ally quite sub­tle, but I think that af­ter all the dust clears this in­tu­ition is ba­si­cally right.)

What’s re­ally im­por­tant here is that our sys­tem uses a prior which is com­pet­i­tive, as eval­u­ated by our real, en­dorsed (in­ac­cessible) prior. A neu­ral net will never be us­ing the “real” prior, since it’s built on a tow­er­ing stack of im­perfect ap­prox­i­ma­tions and is com­pu­ta­tion­ally bounded. But it still makes sense to ask for it to be “as good as pos­si­ble” given the limi­ta­tions of its learn­ing pro­cess — we want to avoid the situ­a­tion where the neu­ral net is able to learn a new prior which pre­dictably to out­performs the outer prior. In that situ­a­tion we can’t just blame the neu­ral net, since it’s demon­strated that it’s able to learn some­thing bet­ter.

In gen­eral, I think that com­pet­i­tive­ness is a de­sir­able way to achieve sta­bil­ity — us­ing a sub­op­ti­mal sys­tem is in­her­ently un­sta­ble, since it’s easy to slip off of the de­sired equil­ibrium to a more effi­cient al­ter­na­tive. Us­ing the wrong prior is just one ex­am­ple of that. You can try to avoid slip­ping off to a worse equil­ibrium, but you’ll always be fight­ing an up­hill strug­gle.

Given that I think that find­ing the right uni­ver­sal prior should be “plan A.” The real ques­tion is whether that’s tractable. My cur­rent view is that it looks plau­si­ble enough (see Learn­ing the prior for my cur­rent best guess about how to ap­proach it) that it’s rea­son­able to fo­cus on for now.

Bet­ter pri­ors as a safety prob­lem was origi­nally pub­lished in AI Align­ment on Medium, where peo­ple are con­tin­u­ing the con­ver­sa­tion by high­light­ing and re­spond­ing to this story.