Agent Foundation Foundations and the Rocket Alignment Problem

Many peo­ple are quite skep­ti­cal of the value of agent foun­da­tions. The kinds of prob­lems that MIRI wor­ries about, in terms of ac­count­ing for perfect pre­dic­tors, co-op­er­a­tion with clones and be­ing eas­ily pre­dictable are a world away from the kinds of prob­lems that are be­ing faced in ma­chine learn­ing. Many peo­ple think that proper re­search in this area would in­volve code. They may also think that this kind of re­search con­sists purely of ex­tremely rare edge cases of no prac­ti­cal im­por­tance, won’t be in­te­grat­able into the kinds of AI sys­tems that are likely to be pro­duced or is just much, much less press­ing than solv­ing the kinds of safety challenges that we can already see aris­ing in our cur­rent AI sys­tem.

In or­der to con­vey his in­tu­itions that Agent Foun­da­tions be­ing im­por­tant, Eliez­ier wrote the Rocket Align­ment Prob­lem. The ar­gu­ment is roughly that any at­tempt to define what an AI should do is build upon shaky premises. This means that it is prac­ti­cally im­pos­si­ble to provide guaran­tees that things will go well. The hope is that by ex­plor­ing highly sim­plified, the­o­ret­i­cal prob­lems we may learn some­thing im­por­tant that would de­con­fuse us. How­ever, it is also noted that this might not pan out, as it is hard to see how use­ful im­proved the­o­ret­i­cal un­der­stand­ing is be­fore you’ve ob­tained it. Fur­ther it is ar­gued that AI safety is an un­usu­ally difficult area where things that sound like pretty good solu­tions could re­sult in dis­as­ter­ous out­comes. For ex­am­ple, pow­er­ful util­ity max­imisers are very good at find­ing the most ob­scure loop­holes in their util­ity func­tion to achieve a higher score. Pow­er­ful ap­proval based agents are likely to try find a way to ma­nipu­lat­ing us. Pow­er­ful boxed agents are likely to find a way to es­cape that box.

Most of my cur­rent work on Less Wrong is best de­scribed as Agent Foun­da­tions Foun­da­tions. It in­volves work­ing through the var­i­ous claims or re­sults in Agent Foun­da­tions, find­ing as­pects that con­fuse me and then dig­ging deeper to de­ter­mine whether I’m con­fused, or some­thing needs to be patched, or there’s a deeper prob­lem.

Agent Foun­da­tions Foun­da­tions re­search looks very differ­ent from most Agent Foun­da­tions re­search. Agent Foun­da­tions is pri­mar­ily about pro­duc­ing math­e­mat­i­cal for­mu­la­tions, while Agent Foun­da­tions Foun­da­tions is pri­mar­ily about ques­tion­ing philo­soph­i­cal as­sump­tions. Agent Foun­da­tions Foun­da­tions is in­tended to even­tu­ally lead to math­e­mat­i­cal for­mal­i­sa­tions, but the fo­cus is more on figur­ing out ex­actly what we want from out maths. Rush­ing out and pro­duc­ing in­cor­rect for­mal­i­sa­tions would defeat the point.

Agent Foun­da­tions re­search re­lies on a sig­nifi­cant amount of philo­soph­i­cal as­sump­tions and this is also an area where the de­fault is dis­aster. The best philoso­phers are ex­tremely care­ful with ev­ery step of the ar­gu­ment, yet they of­ten come to en­tirely differ­ent con­clu­sions. Given this, at­tempt­ing to rush over this ter­rain by hand­wav­ing philo­soph­i­cal ar­gu­ments is likely to end badly.

One challenge is that math­e­mat­i­ci­ans can be blinded by el­e­gant for­mal­i­sa­tions, to the point where they can’t ob­jec­tively as­sess the mer­its of the as­sump­tions it is build upon. Another key is­sue is that when some­one is able to use a for­mal­i­sa­tion to pro­duce a re­sult, they as­sume that it must be cor­rect. Agent Foun­da­tions Foun­da­tions at­tempts to fight against these bi­ases.

Agent Foun­da­tions Foun­da­tions fo­cuses on what of­ten ap­pears to be weird niche is­sues from the per­spec­tive of Agent Foun­da­tions. This in­cludes ques­tions such as:

Of course, lots of other peo­ple have done work in this vein too. I didn’t want to spend a lot of time brows­ing the archive, but some ex­am­ples in­clude:

I don’t want to pre­tend that the sep­a­ra­tion is clean at all. But in Agent Foun­da­tions work, the maths is first and the philo­soph­i­cal as­sump­tions are sec­ond. For Agent Foun­da­tions Foun­da­tions, it is the other way round. Ob­vi­ously, this dis­tinc­tion is some­what sub­jec­tive and messy. How­ever, I think it model is use­ful as it opens up dis­cus­sions about whether the cur­rent bal­ance of re­search is right and pro­vides sug­ges­tions of ar­eas for fur­ther re­search. It also clar­ifies why some of these prob­lems might turn out to be more im­por­tant than they first ap­pear.

Up­date: One is­sue is that I al­most want to use the term in two differ­ent ways. One way to think about Meta-Foun­da­tions is in an ab­solute sense where it fo­cuses on the philo­soph­i­cal as­sump­tions while Foun­da­tions fo­cuses more on for­mal­i­sa­tions vs ML which fo­cuses on writ­ing pro­grams. Another is in a rel­a­tive sense, where you have a body of work termed Agent Foun­da­tions and I want to en­courage a body of work that re­sponds to it and probes these as­sump­tions fur­ther. And these senses are differ­ent, be­cause when Agent Foun­da­tions work is pur­sued, they’ll usu­ally be some in­ves­ti­ga­tion into the philos­o­phy, but it’ll of­ten be the min­i­mal amount to get a the­ory up and run­ning.