Mechanistic Transparency for Machine Learning

Cross-posted on my blog.

Lately I’ve been try­ing to come up with a thread of AI al­ign­ment re­search that (a) I can con­cretely see how it sig­nifi­cantly con­tributes to ac­tu­ally build­ing al­igned AI and (b) seems like some­thing that I could ac­tu­ally make progress on. After some think­ing and nar­row­ing down pos­si­bil­ities, I’ve come up with one—ba­si­cally, a par­tic­u­lar an­gle on ma­chine learn­ing trans­parency re­search.

The an­gle that I’m in­ter­ested in is what I’ll call mechanis­tic trans­parency. This roughly means de­vel­op­ing tools that take a neu­ral net­work de­signed to do well on some task, and out­putting some­thing like pseu­docode for what al­gorithm the neu­ral net­work im­ple­ments that could be read and un­der­stood by de­vel­op­ers of AI sys­tems, with­out hav­ing to ac­tu­ally run the sys­tem. This pseu­docode might use high-level prim­i­tives like ‘sort’ or ‘argmax’ or ‘de­tect cats’, that should them­selves be able to be re­duced to pseu­docode of a similar type, un­til even­tu­ally it is ideally re­duced to a very small part of the origi­nal neu­ral net­work, small enough that one could un­der­stand its func­tional be­havi­our with pen and pa­per within an hour. Th­ese tools might also slightly mod­ify the net­work to make it more amenable to this anal­y­sis in such a way that the mod­ified net­work performs ap­prox­i­mately as well as the origi­nal net­work.

There are a few prop­er­ties that this pseu­docode must satisfy. Firstly, it must be faith­ful to the net­work that is ex­plained, such that if one sub­sti­tutes in the pseu­docode for each high-level prim­i­tive re­cur­sively, the re­sult should be the origi­nal neu­ral net­work, or a net­work close enough to the origi­nal that the differ­ences are ir­rele­vant (al­though just in case, the re­con­structed net­work that is ex­actly ex­plained should pre­sum­ably be the one de­ployed). Se­condly, the high-level prim­i­tives must be some­what un­der­stand­able: the pseu­docode for a 256-layer neu­ral net­work for image clas­sifi­ca­tion should not be out­put = f2(f1(in­put)) where f1 is the ac­tion of the first 128 lay­ers and f2 is the ac­tion of the next 128 lay­ers, but rather break down into edge de­tec­tors be­ing used to find floppy ears and spheres and tex­tures, and those be­ing com­bined in rea­son­able ways to form judge­ments of what the image de­picts. The high-level prim­i­tives should be as hu­man-un­der­stand­able as pos­si­ble, ideally ‘carv­ing the com­pu­ta­tion at the joints’ by rep­re­sent­ing any in­de­pen­dent sub-com­pu­ta­tions or re­peated ap­pli­ca­tions of the same func­tion (so, for in­stance, if a con­volu­tional net­work is rep­re­sented as if it were fully con­nected, these tools should be able to re­cover con­volu­tional struc­ture). Fi­nally, the high-level prim­i­tives in the pseu­docode should ideally be un­der­stand­able enough to be mod­u­larised and used in differ­ent places for the same func­tion.

This agenda nicely re­lates to some ex­ist­ing work in ma­chine learn­ing. For in­stance, I think that there are strong syn­er­gies with re­search on com­pres­sion of neu­ral net­works. This is par­tially due to back­ground mod­els about com­pres­sion be­ing re­lated to un­der­stand­ing (see the ideas in com­mon be­tween Kol­mogorov com­plex­ity, MDL, Solomonoff in­duc­tion, and Martin-Löf ran­dom­ness), and par­tially due to ob­ject-level de­tails about this re­search. For ex­am­ple, spar­sifi­ca­tion seems re­lated to in­creased mod­u­lar­ity, which should make it eas­ier to write un­der­stand­able pseu­docode. Another ex­am­ple is the effi­cacy of weight quan­ti­sa­tion, which means that the least sig­nifi­cant bits of the weights aren’t very im­por­tant, in­di­cat­ing that the re­la­tions be­tween the high-level prim­i­tives should be mod­u­lar in an un­der­stand­able way and not have cru­cial de­tails de­pend on some of the least sig­nifi­cant bits of the out­put.

The Distill post on the build­ing blocks of in­ter­pretabil­ity in­cludes some other ex­am­ples of work that I feel is rele­vant. For in­stance, work on us­ing ma­trix fac­tori­sa­tion to group neu­rons seems very re­lated to con­struct­ing high-level prim­i­tives, and work on neu­ron vi­su­al­i­sa­tion should help with un­der­stand­ing the high-level prim­i­tives if their out­put cor­re­sponds to a sub­set of neu­rons in the origi­nal net­work.

I’m ex­cited about this agenda be­cause I see it as giv­ing the de­vel­op­ers of AI sys­tems tools to de­tect and cor­rect prop­er­ties of their AI sys­tems that they see as un­de­sir­able, with­out hav­ing to de­ploy the sys­tem in a test en­vi­ron­ment that they must la­bo­ri­ously en­sure is ad­e­quately sand­boxed. You could imag­ine de­vel­op­ers check­ing if their sys­tems con­form to the­o­ries of al­igned AI, or de­tect­ing any ‘de­ceive the hu­man’ sub­rou­tine that might ex­ist. I see this as fairly ro­bustly use­ful, be­ing helpful in most sto­ries of how one would build an al­igned AI. The ex­cep­tion is if AGI is built with­out things which look like mod­ern ma­chine learn­ing al­gorithms, which I see as un­likely, and at any rate hope that les­sons trans­fer to the meth­ods which are used.

I also be­lieve that this line of re­search has a shot at work­ing for sys­tems which act in the world. It seems hard for me to de­scribe how I de­tect lap­tops given vi­sual in­for­ma­tions, but given vi­sual prim­i­tives like ‘there’s a lap­top there’, it seems much eas­ier for me to de­scribe how I play tetris or even go. As such, I would ex­pect tools de­vel­oped in this way to illu­mi­nate the strat­egy fol­lowed by tetris-play­ing DQNs by refer­ring to high-level prim­i­tives like ‘lo­cate T tetron­imo’, that them­selves would have to be un­der­stood us­ing neu­ron vi­su­al­i­sa­tion tech­niques.

Vi­sual prim­i­tives are prob­a­bly not the only things that would be hard to fully un­der­stand us­ing the pseu­docode tech­nique. In cases where hu­mans evade over­sight by other hu­mans, I as­sert that it is of­ten not due to con­se­quen­tial­ist rea­son­ing, but rather due to avoid­ing things which are frus­trat­ing or ir­ri­tat­ing, where frus­tra­tion/​ir­ri­ta­tion is hard to in­tro­spect on but seems to re­li­ably steer away from over­sight in cases where that over­sight would be nega­tive. A pos­si­ble rea­son that this frus­tra­tion/​ir­ri­ta­tion is hard to in­tro­spect upon is that it is com­pli­cated and hard to de­com­pose cleanly, like our ob­ject recog­ni­tion sys­tems are. Similarly, you could imag­ine that one high-level prim­i­tive that guides the AI sys­tem’s be­havi­our is hard to de­com­pose and needs tech­niques like neu­ron vi­su­al­i­sa­tion to un­der­stand. How­ever, at least the mechanis­tic de­com­po­si­tion al­lowed us to lo­cate this sub­sys­tem and de­ter­mine how it is used in the net­work, guid­ing the tests we perform on it. Fur­ther­more, in the case of hu­mans, it’s quite pos­si­ble that our frus­tra­tion/​ir­ri­ta­tion is hard to in­tro­spect upon not be­cause it’s hard to un­der­stand, but rather be­cause it’s strate­gi­cally bet­ter to not be able to in­tro­spect upon it (see the ideas in the book The Elephant in the Brain), sug­gest­ing that this prob­lem might be less se­vere than it seems.