Mech­an­istic Trans­par­ency for Machine Learning

Cross-pos­ted on my blog.

Lately I’ve been try­ing to come up with a thread of AI align­ment re­search that (a) I can con­cretely see how it sig­ni­fic­antly con­trib­utes to ac­tu­ally build­ing aligned AI and (b) seems like some­thing that I could ac­tu­ally make pro­gress on. After some think­ing and nar­row­ing down pos­sib­il­it­ies, I’ve come up with one—ba­sic­ally, a par­tic­u­lar angle on ma­chine learn­ing trans­par­ency re­search.

The angle that I’m in­ter­ested in is what I’ll call mech­an­istic trans­par­ency. This roughly means de­vel­op­ing tools that take a neural net­work de­signed to do well on some task, and out­put­ting some­thing like pseudo­code for what al­gorithm the neural net­work im­ple­ments that could be read and un­der­stood by de­velopers of AI sys­tems, without hav­ing to ac­tu­ally run the sys­tem. This pseudo­code might use high-level prim­it­ives like ‘sort’ or ‘argmax’ or ‘de­tect cats’, that should them­selves be able to be re­duced to pseudo­code of a sim­ilar type, un­til even­tu­ally it is ideally re­duced to a very small part of the ori­ginal neural net­work, small enough that one could un­der­stand its func­tional be­ha­viour with pen and pa­per within an hour. These tools might also slightly modify the net­work to make it more amen­able to this ana­lysis in such a way that the mod­i­fied net­work per­forms ap­prox­im­ately as well as the ori­ginal net­work.

There are a few prop­er­ties that this pseudo­code must sat­isfy. Firstly, it must be faith­ful to the net­work that is ex­plained, such that if one sub­sti­tutes in the pseudo­code for each high-level prim­it­ive re­curs­ively, the res­ult should be the ori­ginal neural net­work, or a net­work close enough to the ori­ginal that the dif­fer­ences are ir­rel­ev­ant (al­though just in case, the re­con­struc­ted net­work that is ex­actly ex­plained should pre­sum­ably be the one de­ployed). Se­condly, the high-level prim­it­ives must be some­what un­der­stand­able: the pseudo­code for a 256-layer neural net­work for im­age clas­si­fic­a­tion should not be out­put = f2(f1(in­put)) where f1 is the ac­tion of the first 128 lay­ers and f2 is the ac­tion of the next 128 lay­ers, but rather break down into edge de­tect­ors be­ing used to find floppy ears and spheres and tex­tures, and those be­ing com­bined in reas­on­able ways to form judge­ments of what the im­age de­picts. The high-level prim­it­ives should be as hu­man-un­der­stand­able as pos­sible, ideally ‘carving the com­pu­ta­tion at the joints’ by rep­res­ent­ing any in­de­pend­ent sub-com­pu­ta­tions or re­peated ap­plic­a­tions of the same func­tion (so, for in­stance, if a con­vo­lu­tional net­work is rep­res­en­ted as if it were fully con­nec­ted, these tools should be able to re­cover con­vo­lu­tional struc­ture). Fin­ally, the high-level prim­it­ives in the pseudo­code should ideally be un­der­stand­able enough to be mod­u­lar­ised and used in dif­fer­ent places for the same func­tion.

This agenda nicely relates to some ex­ist­ing work in ma­chine learn­ing. For in­stance, I think that there are strong syn­er­gies with re­search on com­pres­sion of neural net­works. This is par­tially due to back­ground mod­els about com­pres­sion be­ing re­lated to un­der­stand­ing (see the ideas in com­mon between Kol­mogorov com­plex­ity, MDL, So­lomonoff in­duc­tion, and Martin-Löf ran­dom­ness), and par­tially due to ob­ject-level de­tails about this re­search. For ex­ample, sparsi­fic­a­tion seems re­lated to in­creased mod­u­lar­ity, which should make it easier to write un­der­stand­able pseudo­code. Another ex­ample is the ef­fic­acy of weight quant­isa­tion, which means that the least sig­ni­fic­ant bits of the weights aren’t very im­port­ant, in­dic­at­ing that the re­la­tions between the high-level prim­it­ives should be mod­u­lar in an un­der­stand­able way and not have cru­cial de­tails de­pend on some of the least sig­ni­fic­ant bits of the out­put.

The Distill post on the build­ing blocks of in­ter­pretab­il­ity in­cludes some other ex­amples of work that I feel is rel­ev­ant. For in­stance, work on us­ing mat­rix fac­tor­isa­tion to group neur­ons seems very re­lated to con­struct­ing high-level prim­it­ives, and work on neuron visu­al­isa­tion should help with un­der­stand­ing the high-level prim­it­ives if their out­put cor­res­ponds to a sub­set of neur­ons in the ori­ginal net­work.

I’m ex­cited about this agenda be­cause I see it as giv­ing the de­velopers of AI sys­tems tools to de­tect and cor­rect prop­er­ties of their AI sys­tems that they see as un­desir­able, without hav­ing to de­ploy the sys­tem in a test en­vir­on­ment that they must la­bor­i­ously en­sure is ad­equately sand­boxed. You could ima­gine de­velopers check­ing if their sys­tems con­form to the­or­ies of aligned AI, or de­tect­ing any ‘de­ceive the hu­man’ sub­routine that might ex­ist. I see this as fairly ro­bustly use­ful, be­ing help­ful in most stor­ies of how one would build an aligned AI. The ex­cep­tion is if AGI is built without things which look like mod­ern ma­chine learn­ing al­gorithms, which I see as un­likely, and at any rate hope that les­sons trans­fer to the meth­ods which are used.

I also be­lieve that this line of re­search has a shot at work­ing for sys­tems which act in the world. It seems hard for me to de­scribe how I de­tect laptops given visual in­form­a­tions, but given visual prim­it­ives like ‘there’s a laptop there’, it seems much easier for me to de­scribe how I play tet­ris or even go. As such, I would ex­pect tools de­veloped in this way to il­lu­min­ate the strategy fol­lowed by tet­ris-play­ing DQNs by re­fer­ring to high-level prim­it­ives like ‘loc­ate T tet­r­on­imo’, that them­selves would have to be un­der­stood us­ing neuron visu­al­isa­tion tech­niques.

Visual prim­it­ives are prob­ably not the only things that would be hard to fully un­der­stand us­ing the pseudo­code tech­nique. In cases where hu­mans evade over­sight by other hu­mans, I as­sert that it is of­ten not due to con­sequen­tial­ist reas­on­ing, but rather due to avoid­ing things which are frus­trat­ing or ir­rit­at­ing, where frus­tra­tion/​ir­rit­a­tion is hard to in­tro­spect on but seems to re­li­ably steer away from over­sight in cases where that over­sight would be neg­at­ive. A pos­sible reason that this frus­tra­tion/​ir­rit­a­tion is hard to in­tro­spect upon is that it is com­plic­ated and hard to de­com­pose cleanly, like our ob­ject re­cog­ni­tion sys­tems are. Sim­il­arly, you could ima­gine that one high-level prim­it­ive that guides the AI sys­tem’s be­ha­viour is hard to de­com­pose and needs tech­niques like neuron visu­al­isa­tion to un­der­stand. However, at least the mech­an­istic de­com­pos­i­tion al­lowed us to loc­ate this sub­sys­tem and de­term­ine how it is used in the net­work, guid­ing the tests we per­form on it. Fur­ther­more, in the case of hu­mans, it’s quite pos­sible that our frus­tra­tion/​ir­rit­a­tion is hard to in­tro­spect upon not be­cause it’s hard to un­der­stand, but rather be­cause it’s stra­tegic­ally bet­ter to not be able to in­tro­spect upon it (see the ideas in the book The Elephant in the Brain), sug­gest­ing that this prob­lem might be less severe than it seems.