Using GPT-N to Solve Interpretability of Neural Networks: A Research Agenda

Tl;dr We are at­tempt­ing to make neu­ral net­works (NN) mod­u­lar, have GPT-N in­ter­pret each mod­ule for us, in or­der to catch mesa-al­ign­ment and in­ner-al­ign­ment failures.

Com­pleted Project

Train a neu­ral net with an added loss term that en­forces the sort of mod­u­lar­ity that we see in well-de­signed soft­ware pro­jects. To use this pa­per’s in­for­mal defi­ni­tion of modularity

a net­work is mod­u­lar to the ex­tent that it can be par­ti­tioned into sets of neu­rons where each set is strongly in­ter­nally con­nected, but only weakly con­nected to other sets.

Ex­am­ple of a “Mo­du­lar” GPT. Each mod­ule should be densely con­nected w/​ rel­a­tively larger weights. In­ter­faces be­tween mod­ules should be sparsely con­nected w/​ rel­a­tively smaller weights.

Once we have a Mo­du­lar NN (for ex­am­ple, a GPT), we will use a nor­mal GPT to map each mod­ule into a nat­u­ral lan­guage de­scrip­tion. No­tice that there are two differ­ent GPT’s at work here.

GPT-N reads in each “Mo­d­ule” of the “Mo­du­lar GPT”, out­putting a nat­u­ral lan­guage de­scrip­tion for each mod­ule.

If suc­cess­ful, we could use GPT-N to in­ter­pret any mod­u­lar NN in nat­u­ral lan­guage. Not only should this help our un­der­stand­ing of what the model is do­ing, but it should also catch mesa-al­ign­ment and in­ner-al­ign­ment failures.


There are a few in­tu­itions we have that go counter to other’s in­tu­itions. Below is an elab­o­ra­tion of our thoughts and why we think this pro­ject could work.

Find­ing a Loss func­tion that In­duces Modularity

We cur­rently think a Go­mory-Hu Tree (GH Tree) cap­tures the rele­vant in­for­ma­tion. We will ini­tially con­vert a NN to a GH Tree to calcu­late the new loss func­tion. This con­ver­sion will be com­pu­ta­tion­ally costly, though more progress can be made to calcu­late the loss func­tion di­rectly from the NN. See Ap­pendix A for more details

Small NN’s are Hu­man Interpretable

We’re as­sum­ing hu­mans can in­ter­pret small NN’s, given enough time. A “Mo­du­lar” NN is just a col­lec­tion of small NN’s con­nected by sparse weights. If hu­mans could in­ter­pret each mod­ule in the­ory, then GPT-N could too. If hu­mans can in­ter­pret the in­ter­faces be­tween each, then GPT-N could too.

Ex­am­ples from NN Play­ground are read­ily in­ter­pretable (such as the above ex­am­ple).

GPT-3 can already turn com­ments into code. We don’t ex­pect the re­verse case to be fun­da­men­tally harder, and neu­ral nets can be in­ter­preted as just an­other pro­gram­ming lan­guage.

Micro­scope AI has had some suc­cess in in­ter­pret­ing large NN’s. Th­ese are NN’s that should be much harder to in­ter­pret than mod­u­lar NN’s that we would be in­ter­pret­ing.

Tech­ni­cal Ques­tions:

First ques­tion: Ca­pa­bil­ities will likely be lost by adding a mod­u­lar­ity loss term. Can we spot-check ca­pa­bil­ity of GPT by look­ing at the loss of the origi­nal loss terms? Or would we need to run it through NLP met­rics (like Wino­grad Schema Challenge ques­tions)?

To cre­ate a mod­u­lar GPT, we have two paths, but I’m un­sure of which is bet­ter.

  1. Train from scratch with mod­ified loss

  2. Train OpenAI’s gpt-2 on more data, but with added loss term. The in­tu­ition here is that it’s already ca­pa­ble, so op­ti­miz­ing for mod­u­lar­ity start­ing here will pre­serve ca­pa­bil­ities.

Help Wanted

If you are in­ter­ested in the in­ter­pretabil­ity of GPT (even un­re­lated to our pro­ject), I can add you to a dis­cord server full of GPT en­thu­si­asts (just DM me). If you’re in­ter­ested in helping out our pro­ject speci­fi­cally, DM me and we’ll figure out a way to divvy up tasks.

Ap­pendix A

Go­mory-Hu Tree Con­tains Rele­vant In­for­ma­tion on Modularity

Some read­ily ac­cessible in­sights:

  1. The size of the min­i­mum cut be­tween two neu­rons can be used to mea­sure the size of the in­ter­face be­tween their mod­ules.

  2. Call two graphs G and G’ on the same ver­tices equiv­a­lent if for ev­ery two u,v, the sizes of their min­i­mum cuts are the same in G and G’. It turns out that there always ex­ists a G’ which is a tree! (The Go­mory-Hu tree.)

  3. It turns out that the min­i­mum cut be­tween two neu­rons within a mod­ule never needs to ex­pose the in­nards of an­other mod­ule.

There­fore, the Go­mory-Hu tree prob­a­bly con­tains all the in­for­ma­tion needed to calcu­late the loss term and the hi­er­ar­chy of soft­ware mod­ules.