One Way to Think About ML Transparency

What makes a neu­ral net­work in­ter­pretable?

One re­sponse is that a neu­ral net­work is in­ter­pretable if it is hu­man simu­lat­able. That is, it is in­ter­pretable if and only if a hu­man could step through the pro­ce­dure that the neu­ral net­work went through when given an in­put, and ar­rive at the same de­ci­sion (in a rea­son­able amount of time). This is one defi­ni­tion of in­ter­pretable pro­vided by Zachary Lip­ton.

This defi­ni­tion is not ideal, how­ever. It misses a core el­e­ment of what al­ign­ment re­searchers con­sider im­por­tant in un­der­stand­ing ma­chine learn­ing mod­els. In par­tic­u­lar, in or­der for a model to be simu­lat­able, it must also be at a hu­man-level or lower. Other­wise, a hu­man would not be able to go step by step through the de­ci­sion pro­ce­dure.

Un­der this defi­ni­tion, a pow­er­ful Monte Carlo Tree Search would not be in­ter­pretable since that would im­ply that a hu­man could beat an MCTS al­gorithm by sim­ply simu­lat­ing its de­ci­sion pro­ce­dure. So this defi­ni­tion ap­pears to ex­clude things that we hu­mans would con­sider to be in­ter­pretable, and la­bels them un­in­ter­pretable.

A slight mod­ifi­ca­tion of this defi­ni­tion yields some­thing more use­ful for AI al­ign­ment. We could dis­t­in­guish de­ci­sion simu­lata­bil­ity with the­ory simu­lata­bil­ity. In de­ci­sion simu­lata­bil­ity, a hu­man could step through the pro­ce­dure of what an al­gorithm is do­ing, and ar­rive at the same out­put for any in­put.

In the­ory simu­lata­bil­ity, the hu­man would not nec­es­sar­ily be able to simu­late the al­gorithm perfectly in their head, but they would still say that they al­gorithm is simu­lat­able in their head, “given enough empty scratch pa­per and time.” There­fore, MCTS is in­ter­pretable be­cause a hu­man could in the­ory sit down and work through an en­tire ex­am­ple on a piece of pa­per. It may take ages, but the hu­man would even­tu­ally get it done; at least, that’s the idea. How­ever, we would not say that some black box ANN is in­ter­pretable, be­cause even if the hu­man had sev­eral hours to stare at the weight ma­tri­ces, once they were no longer ac­quainted with the ex­act pa­ram­e­ters of the model, they would have no clue as to why the ANN was mak­ing de­ci­sions.

I define the­ory simu­lata­bil­ity as some­thing like, the abil­ity for a hu­man to op­er­ate the al­gorithm given a pen and a blank sheet of pa­per af­ter be­ing al­lowed to study the al­gorithm for a few hours ahead of time. After the ini­tial few hours, the hu­man would be “taken off” the source of in­for­ma­tion about the al­gorithm, which means that they couldn’t sim­ply mem­o­rize some large set of weight ma­tri­ces: they’d have to figure out ex­actly how the thing ac­tu­ally makes de­ci­sions.

Given a no­tion of the­ory simu­lata­bil­ity, we could make our mod­els more in­ter­pretable via a va­ri­ety of ap­proaches.

In the most ba­sic ap­proach, we limit our­selves to only us­ing al­gorithms which have a well-un­der­stood mean­ing, like MCTS. The down­side of this ap­proach is that it limits our ca­pa­bil­ities. In other words, we are re­stricted to us­ing al­gorithms that are not very pow­er­ful in or­der to ob­tain the benefit of the­ory simu­lata­bil­ity.

By con­trast, we could try to alle­vi­ate this is­sue by cre­at­ing small in­ter­pretable mod­els which at­tempt to ap­prox­i­mate the perfor­mance of large un­in­ter­pretable mod­els. This method falls un­der the ban­ner of model com­pres­sion.

In a more com­plex ad-hoc ap­proach, we could in­stead de­sign a way to ex­tract a the­ory simu­lat­able al­gorithm that our model is im­ple­ment­ing. In other words, given a neu­ral net­work, we run some type of meta-al­gorithm that an­a­lyzes the neu­ral net­work and spits out psue­docode which de­scribes what the neu­ral net­work uses to make de­ci­sions. As I un­der­stand, this is roughly what Daniel Filan writes about in Mechanis­tic Trans­parency for Ma­chine Learn­ing. Un­for­tu­nately, I pre­dict that the down­side of this ap­proach is that it is re­ally hard to do in gen­eral.

One way we can over­come the limi­ta­tions of ei­ther ap­proach is by an­a­lyz­ing trans­parency us­ing the tools of reg­u­lariza­tion. Typ­i­cally, reg­u­lariza­tion schemes have the in­tended pur­pose of al­low­ing mod­els to gen­er­al­ize bet­ter. Another way of think­ing about reg­u­lariza­tion is that it is sim­ply our way of tel­ling the learn­ing pro­ce­dure that we have a prefer­ence for mod­els in some re­gion of model-space. Un­der this way of think­ing, an penalty is a prefer­ence for mod­els which are close to the ori­gin point in model space. Whether this has the effect of al­low­ing greater gen­er­al­iza­tion is sec­ondary to the reg­u­lariza­tion pro­ce­dure; we can pose ad­di­tional goals.

We can there­fore ask, is there some way to put a prefer­ence on mod­els that are in­ter­pretable, so that the learn­ing pro­ce­dure will find them? Now we have a con­crete prob­lem, namely, the prob­lem of defin­ing which parts of our model-space yield in­ter­pretable mod­els.

Rather than think­ing about model space in the ab­stract, I find it helpful to imag­ine that we first take a known in­ter­pretable al­gorithm and then plot how well that known al­gorithm can ap­prox­i­mate the given neu­ral net­work. If the neu­ral net­work is not well ap­prox­i­mated by any known in­ter­pretable al­gorithm, then we give it a high penalty in the train­ing pro­ce­dure.

This ap­proach is es­sen­tially the ap­proach that Mike Wu et al. have taken in their pa­per Beyond Spar­sity: Tree Reg­u­lariza­tion of Deep Models for In­ter­pretabil­ity. Their known al­gorithm is that of a de­ci­sion tree. De­ci­sion trees are very sim­ple al­gorithms—they sim­ply ask a se­ries of yes-no ques­tions about the data and re­turn an an­swer in some finite amount of time. The full de­ci­sion tree is the plot of all pos­si­ble yes-no ques­tions and the re­sult­ing leaves of de­ci­sions. The pa­per defines the com­plex­ity of any par­tic­u­lar de­ci­sion tree as the av­er­age path length, or the ex­pected num­ber of yes-no ques­tions needed to ob­tain an an­swer, across in­put space. The more com­plex a de­ci­sion tree needs to be in or­der to ap­prox­i­mate the model, the less in­ter­pretable the model is. Speci­fi­cally, the pa­per ini­tially defines the penalty over the neu­ral net­work pa­ram­e­ters by the fol­low­ing algorithm

Since this penalty is not differ­en­tiable with re­spect to the model pa­ram­e­ters, , we must mod­ify the penalty to in­cor­po­rate it while train­ing. In or­der to define the penalty on gen­eral neu­ral net­works, Wu et al. in­tro­duce an in­de­pen­dent sur­ro­gate neu­ral net­work which es­ti­mates the above al­gorithm, while be­ing differ­en­tiable. There­fore, the penalty for the neu­ral net­work is defined by yet an­other neu­ral net­work.

This sur­ro­gate neu­ral net­work can be trained si­mul­ta­neously with the base neu­ral net­work trained to pre­dict la­bels, with restarts af­ter the model pa­ram­e­ters have drifted suffi­ciently far in some di­rec­tion. The ad­van­tage of si­mul­tan­u­ous train­ing and restart­ing is that it al­lows the sur­ro­gate penalty net­work to be well suited for es­ti­mat­ing penalties for base net­works near the one that it is pe­nal­iz­ing.

Ac­cord­ing to the pa­per, this method pro­duces neu­ral net­works that are com­pet­i­tive with state of the art ap­proaches and there­fore trade off lit­tle in terms of ca­pac­ity. Per­haps sur­pris­ingly, these neu­ral net­works perform much bet­ter than sim­ple de­ci­sion trees trained on the same task, pro­vid­ing ev­i­dence that this ap­proach is vi­able for cre­at­ing in­ter­pretable mod­els. Un­for­tu­nately, this ap­proach has a rather cru­cial flaw: it is ex­pen­sive to train; the pa­per claims that it nearly dou­bles the train­ing time for a neu­ral net­work.

One ques­tion re­mains: are these mod­els simu­lat­able? Strictly speak­ing, no. A hu­man given the de­ci­sion tree would still be able to get a rough idea of why the neu­ral net­work was perform­ing a par­tic­u­lar de­ci­sion. How­ever, with­out the model weights, a hu­man would still be forced to make an ap­prox­i­mate in­fer­ence rather than fol­low the de­ci­sion pro­ce­dure ex­actly. That’s be­cause af­ter the train­ing pro­ce­dure, we can only ex­tract a de­ci­sion tree that ap­prox­i­mates the neu­ral net­work de­ci­sions, not ex­tract a tree that perfectly simu­lates it. But this is by de­sign: if we wanted perfect in­ter­pretabil­ity then we would be do­ing ei­ther model com­pres­sion or mechanis­tic trans­parency any­way.

In my own opinion, the cog­ni­tive sep­a­ra­tion of de­ci­sion and the­ory simu­lata­bil­ity pro­vides a po­ten­tially rich agenda for ma­chine learn­ing trans­parency re­search. Cur­rently, most re­search that fo­cuses on cre­at­ing simu­lat­able mod­els, such as tree reg­u­lariza­tion, fo­cus ex­clu­sively on de­ci­sion simu­lata­bil­ity. This is use­ful for pre­sent-day re­searchers be­cause they just want a pow­er­ful method of ex­tract­ing the rea­son­ing be­hind ML de­ci­sions. How­ever, it’s not as use­ful for safety, be­cause in the long term we don’t re­ally care that much about why spe­cific sys­tems made de­ci­sions, just as long as we know they aren’t run­ning any bad cog­ni­tive poli­cies.

To be use­ful for al­ign­ment, we need some­thing more pow­er­ful, and more gen­eral than tree reg­u­lariza­tion. Still, the ba­sic in­sight of reg­u­lariz­ing neu­ral net­works to be in­ter­pretable might be use­ful for strik­ing a mid­dle ground be­tween build­ing a model from the ground up, and an­a­lyz­ing it post-hoc. Is there a way to ap­ply this in­sight to cre­ate more trans­par­ent mod­els?