Chris Olah’s views on AGI safety

Note: I am not Chris Olah. This post was the re­sult of lots of back-and-forth with Chris, but ev­ery­thing here is my in­ter­pre­ta­tion of what Chris be­lieves, not nec­es­sar­ily what he ac­tu­ally be­lieves. Chris also wanted me to em­pha­size that his think­ing is in­formed by all of his col­leagues on the OpenAI Clar­ity team and at other or­ga­ni­za­tions.

In think­ing about AGI safety—and re­ally any com­plex topic on which many smart peo­ple dis­agree—I’ve of­ten found it very use­ful to build a col­lec­tion of differ­ent view­points from peo­ple that I re­spect that I feel like I un­der­stand well enough to be able to think from their per­spec­tive. For ex­am­ple, I will of­ten try to com­pare what an idea feels like when I put on my Paul Chris­ti­ano hat to what it feels like when I put on my Scott Garrabrant hat. Re­cently, I feel like I’ve gained a new hat that I’ve found ex­tremely valuable that I also don’t think many other peo­ple in this com­mu­nity have, which is my Chris Olah hat. The goal of this post is to try to give that hat to more peo­ple.

If you’re not fa­mil­iar with him, Chris Olah leads the Clar­ity team at OpenAI and formerly used to work at Google Brain. Chris has been a part of many of the most ex­cit­ing ML in­ter­pretabil­ity re­sults in the last five years, in­clud­ing Ac­ti­va­tion At­lases, Build­ing Blocks of In­ter­pretabil­ity, Fea­ture Vi­su­al­iza­tion, and Deep­Dream. Chris was also a coau­thor of “Con­crete Prob­lems in AI Safety.”

He also thinks a lot about tech­ni­cal AGI safety and has a lot of thoughts on how ML in­ter­pretabil­ity work can play into that—thoughts which, un­for­tu­nately, haven’t re­ally been recorded pre­vi­ously. So: here’s my take on Chris’s AGI safety wor­ld­view.

The benefits of trans­parency and interpretability

Since Chris pri­mar­ily works on ML trans­parency and in­ter­pretabil­ity, the ob­vi­ous first ques­tion to ask is how he imag­ines that sort of re­search aid­ing with AGI safety. When I was talk­ing with him, Chris listed four dis­tinct ways in which he thought trans­parency and in­ter­pretabil­ity could help, which I’ll go over in his or­der of im­por­tance.

Catch­ing prob­lems with auditing

First, Chris says, in­ter­pretabil­ity gives you a mul­li­gan. Be­fore you de­ploy your AI, you can throw all of your in­ter­pretabil­ity tools at it to check and see what it ac­tu­ally learned and make sure it learned the right thing. If it didn’t—if you find that it’s learned some sort of po­ten­tially dan­ger­ous proxy, for ex­am­ple—then you can throw your AI out and try again. As long as you’re in a do­main where your AI isn’t ac­tively try­ing to de­ceive your in­ter­pretabil­ity tools (via de­cep­tive al­ign­ment, per­haps), this sort of a mul­li­gan could help quite a lot in re­solv­ing more stan­dard ro­bust­ness prob­lems (proxy al­ign­ment, for ex­am­ple). That be­ing said, that doesn’t nec­es­sar­ily mean wait­ing un­til you’re on the verge of de­ploy­ment to look for flaws. Ideally you’d be able to dis­cover prob­lems early on via an on­go­ing au­dit­ing pro­cess as you build more and more ca­pa­ble sys­tems.

One of the OpenAI Clar­ity team’s ma­jor re­search thrusts right now is de­vel­op­ing the abil­ity to more rigor­ously and sys­tem­at­i­cally au­dit neu­ral net­works. This idea is that in­ter­pretabil­ity tech­niques shouldn’t have to “get lucky” to stum­ble across a prob­lem, but should in­stead re­li­ably catch any prob­le­matic be­hav­ior. In par­tic­u­lar, one way in which they’ve been eval­u­at­ing progress on this is the “au­dit­ing game.” In the au­dit­ing game, one re­searcher takes a neu­ral net­work and makes some mod­ifi­ca­tion to it—maybe images con­tain­ing both dogs and cats are now clas­sified as rifles, for ex­am­ple—and an­other re­searcher, given only the mod­ified net­work, has to di­ag­nose the prob­lem and figure out ex­actly what mod­ifi­ca­tion was made to the net­work us­ing only in­ter­pretabil­ity tools with­out look­ing at er­ror cases. Chris’s hope is that if we can re­li­ably catch prob­lems in an ad­ver­sar­ial con­text like the au­dit­ing game, it’ll trans­late into more re­li­ably be­ing able to catch al­ign­ment is­sues in the fu­ture.

De­liber­ate design

Se­cond, Chris ar­gues, ad­vances in trans­parency and in­ter­pretabil­ity could al­low us to sig­nifi­cantly change the way we de­sign ML sys­tems. In­stead of a sort of trial-and-er­ror pro­cess where we just throw lots of differ­ent tech­niques at the var­i­ous bench­marks and see what sticks, if we had sig­nifi­cantly bet­ter trans­parency tools we might be able to de­sign our sys­tems de­liber­ately by un­der­stand­ing why our mod­els work and how to im­prove them. In this world, be­cause we would be build­ing sys­tems with an un­der­stand­ing of why they work, we might be able to get a much bet­ter un­der­stand­ing of their failure cases as well and how to avoid them.

In ad­di­tion to these di­rect benefits, Chris ex­pects some large but harder-to-see benefits from such a shift as well. Right now, not know­ing any­thing about how your model works in­ter­nally is com­pletely nor­mal. If even partly un­der­stand­ing one’s model be­came nor­mal, how­ever, then the amount we don’t know might be­come glar­ing and con­cern­ing. Chris pro­vides the fol­low­ing anal­ogy to illus­trate this: if the only way you’ve seen a bridge be built be­fore is through un­prin­ci­pled piling of wood, you might not re­al­ize what there is to worry about in build­ing big­ger bridges. On the other hand, once you’ve seen an ex­am­ple of care­fully an­a­lyz­ing the struc­tural prop­er­ties of bridges, the ab­sence of such an anal­y­sis would stand out.

Giv­ing feed­back on process

Third, ac­cess to good trans­parency and in­ter­pretabil­ity tools lets you give feed­back to a model—in the form of a loss penalty, re­ward func­tion, etc.—not just on its out­put, but also on the pro­cess it used to get to that out­put. Chris and his coau­thors lay this ar­gu­ment out in “Build­ing Blocks of In­ter­pretabil­ity:”

One very promis­ing ap­proach to train­ing mod­els for these sub­tle ob­jec­tives is learn­ing from hu­man feed­back. How­ever, even with hu­man feed­back, it may still be hard to train mod­els to be­have the way we want if the prob­le­matic as­pect of the model doesn’t sur­face strongly in the train­ing regime where hu­mans are giv­ing feed­back. Hu­man feed­back on the model’s de­ci­sion-mak­ing pro­cess, fa­cil­i­tated by in­ter­pretabil­ity in­ter­faces, could be a pow­er­ful solu­tion to these prob­lems. It might al­low us to train mod­els not just to make the right de­ci­sions, but to make them for the right rea­sons. (There is how­ever a dan­ger here: we are op­ti­miz­ing our model to look the way we want in our in­ter­face — if we aren’t care­ful, this may lead to the model fool­ing us!)

The ba­sic idea here is that rather than just us­ing in­ter­pretabil­ity as a mul­li­gan at the end, you could also use it as part of your ob­jec­tive dur­ing train­ing, in­cen­tiviz­ing the model to be as trans­par­ent as pos­si­ble. Chris notes that this sort of thing is quite similar to the way in which we ac­tu­ally judge hu­man stu­dents by ask­ing them to show their work. Of course, this has risks—it could in­crease the prob­a­bil­ity that your model only looks trans­par­ent but isn’t ac­tu­ally—but it also has the huge benefit of helping your train­ing pro­cess steer clear of bad un­in­ter­pretable mod­els. In par­tic­u­lar, I see this as po­ten­tially be­ing a big boon for in­formed over­sight, as it al­lows you to in­cor­po­rate into your ob­jec­tive an in­cen­tive to be more trans­par­ent to an am­plified over­seer.

One way in par­tic­u­lar that the Clar­ity team’s work could be rele­vant here is a re­search di­rec­tion they’re work­ing on called model diffing. The idea of model diffing is to have a way of sys­tem­at­i­cally com­par­ing differ­ent mod­els and de­ter­min­ing what’s differ­ent from the point of view of high-level con­cepts and ab­strac­tions. In the con­text of in­formed over­sight—or speci­fi­cally re­laxed ad­ver­sar­ial train­ing—you could use model diffing to track ex­actly how your model is evolv­ing over the course of train­ing in a way which is in­spectable to the over­seer.[1]

Build­ing micro­scopes not agents

One point that Chris likes to talk about is that—de­spite talk­ing a lot about how we want to avoid race-to-the-bot­tom dy­nam­ics—the AI safety com­mu­nity seems to have just ac­cepted that we have to build agents, de­spite the dan­gers of agen­tic AIs.[2] Of course, there’s a rea­son for this: agents seem to be more com­pet­i­tive. Chris cites Gw­ern’s “Why Tool AIs Want to Be Agent AIs” here, and notes that he mostly agrees with it—it does seem like agents will be more com­pet­i­tive, at least by de­fault.

But that still doesn’t mean we have to build agents—there’s no uni­ver­sal law com­pel­ling us to do so. Rather, agents only seem to be on the de­fault path be­cause a lot of the peo­ple who cur­rently think about AGI see them as the short­est path.[3] But po­ten­tially, if trans­parency tools could be made sig­nifi­cantly bet­ter, or if a ma­jor re­al­ign­ment of the ML com­mu­nity could be achieved—which Chris thinks might be pos­si­ble, as I’ll talk about later—then there might be an­other path.

Speci­fi­cally, rather than us­ing ma­chine learn­ing to build agents which di­rectly take ac­tions in the world, we could use ML as a micro­scope—a way of learn­ing about the world with­out di­rectly tak­ing ac­tions in it. That is, rather than train­ing an RL agent, you could train a pre­dic­tive model on a bunch of data and use in­ter­pretabil­ity tools to in­spect it and figure out what it learned, then use those in­sights to in­form—ei­ther with a hu­man in the loop or in some au­to­mated way—what­ever ac­tions you ac­tu­ally want to take in the world.

Chris calls this al­ter­na­tive vi­sion of what an ad­vanced AI sys­tem might look like a micro­scope AI since the AI is be­ing used sort of like a micro­scope to learn about and build mod­els of the world. In con­trast with some­thing like a tool or or­a­cle AI that is de­signed to out­put use­ful in­for­ma­tion, the util­ity of a micro­scope AI wouldn’t come from its out­put but rather our abil­ity to look in­side of it and ac­cess all of the im­plicit knowl­edge it learned. Chris likes to ex­plain this dis­tinc­tion by con­trast­ing Google Trans­late—the or­a­cle/​tool AI in this anal­ogy—to an in­ter­face that could give you ac­cess to all the lin­guis­tic knowl­edge im­plic­itly pre­sent in Google Trans­late—the micro­scope AI.

Chris talks about this vi­sion in his post “Vi­su­al­iz­ing Rep­re­sen­ta­tions: Deep Learn­ing and Hu­man Be­ings:”

The vi­su­al­iza­tions are a bit like look­ing through a telescope. Just like a telescope trans­forms the sky into some­thing we can see, the neu­ral net­work trans­forms the data into a more ac­cessible form. One learns about the telescope by ob­serv­ing how it mag­nifies the night sky, but the re­ally re­mark­able thing is what one learns about the stars. Similarly, vi­su­al­iz­ing rep­re­sen­ta­tions teaches us about neu­ral net­works, but it teaches us just as much, per­haps more, about the data it­self.

(If the telescope is do­ing a good job, it fades from the con­scious­ness of the per­son look­ing through it. But if there’s a scratch on one of the telescope’s lenses, the scratch is highly visi­ble. If one has an ex­am­ple of a bet­ter telescope, the flaws in the worse one will sud­denly stand out. Similarly, most of what we learn about neu­ral net­works from rep­re­sen­ta­tions is in un­ex­pected be­hav­ior, or by com­par­ing rep­re­sen­ta­tions.)

Un­der­stand­ing data and un­der­stand­ing mod­els that work on that data are in­ti­mately linked. In fact, I think that un­der­stand­ing your model has to im­ply un­der­stand­ing the data it works on.

While the idea that we should try to vi­su­al­ize neu­ral net­works has ex­isted in our com­mu­nity for a while, this con­verse idea—that we can use neu­ral net­works for vi­su­al­iza­tion—seems equally im­por­tant [and] is al­most en­tirely un­ex­plored.

Shan Carter and Michael Niel­sen have also dis­cussed similar ideas in their Ar­tifi­cial In­tel­li­gence Aug­men­ta­tion ar­ti­cle in Distill.

Of course, the ob­vi­ous ques­tion with all of this is whether it could ever be any­thing but hope­lessly un­com­pet­i­tive. It is im­por­tant to note that Chris gen­er­ally agrees that micro­scopes are un­likely to be com­pet­i­tive—which is why he’s mostly bet­ting on the other routes to im­pact above. He just hasn’t en­tirely given up hope that a re­al­ign­ment of the ML com­mu­nity away from agents to­wards things like de­liber­ate de­sign and micro­scopes might still be pos­si­ble.

Fur­ther­more, even in a world where the ML com­mu­nity still looks very similar to how it does to­day, if we have re­ally good in­ter­pretabil­ity tools and the largest AI coal­i­tion has a strong lead over the next largest, then it might be pos­si­ble to stick with micro­scopes for quite some time. Per­haps enough to ei­ther figure out how to al­ign agents or oth­er­wise get some sort of de­ci­sive strate­gic ad­van­tage.

What if in­ter­pretabil­ity breaks down as AI gets more pow­er­ful?

Chris notes that one of the biggest differ­ences be­tween him and many of the other peo­ple in the AI safety com­mu­nity is his be­lief that very strong in­ter­pretabil­ity is at all pos­si­ble. The model that Chris has here is some­thing like a re­verse com­pila­tion pro­cess that turns a neu­ral net­work into hu­man-un­der­stable code. Chris notes that the re­sult­ing code might be truly gi­gan­tic—e.g. the en­tire Linux ker­nel—but that it would be faith­ful to the model and un­der­stand­able by hu­mans. Chris’s ba­sic in­tu­ition here is that neu­ral net­works re­ally do seem to learn mean­ingful fea­tures and that if you’re will­ing to put a lot of en­ergy in to un­der­stand them all—e.g. just ac­tu­ally in­spect ev­ery sin­gle neu­ron—then you can make it hap­pen. Chris notes that this is in con­trast to a lot of other neu­ral net­work in­ter­pretabil­ity work which is more aimed at ap­prox­i­mat­ing what neu­ral net­works do in par­tic­u­lar cases.

Of course, this is still heav­ily de­pen­dent on ex­actly what the scal­ing laws are like for how hard in­ter­pretabil­ity will be as our mod­els get stronger and more so­phis­ti­cated. Chris likes to use the fol­low­ing graph to de­scribe how he sees trans­parency and in­ter­pretabil­ity tools scal­ing up:

This graph has a cou­ple of differ­ent com­po­nents to it. First, sim­ple mod­els tend to be pretty in­ter­pretable—think for ex­am­ple lin­ear re­gres­sion, which gives you su­per easy-to-un­der­stand co­effi­cients. Se­cond, as you scale up past sim­ple stuff like lin­ear re­gres­sion, things get a lot messier. But Chris has a the­ory here: the rea­son these mod­els aren’t very in­ter­pretable is be­cause they don’t have the ca­pac­ity to ex­press the full con­cepts that they need, so they rely on con­fused con­cepts that don’t quite track the real thing. In par­tic­u­lar, Chris notes that he has found that bet­ter, more ad­vanced, more pow­er­ful mod­els tend to have crisper, clearer, more in­ter­pretable con­cepts—e.g. In­cep­tionV1 is more in­ter­pretable than AlexNet. Chris be­lieves that this sort of scal­ing up of in­ter­pretabil­ity will con­tinue for a while un­til you get to around hu­man-level perfor­mance, at which point Chris hy­poth­e­sizes that the trend will stop as mod­els start mov­ing away from crisp hu­man-level con­cepts to still crisp but now quite alien con­cepts.

If you buy this graph—or some­thing like it—then in­ter­pretabil­ity should be pretty use­ful all the way up to and in­clud­ing AGI—though per­haps not for very far past AGI. But if you buy a con­tin­u­ous-take­off wor­ld­view, then that’s still pretty use­ful. Fur­ther­more, in my opinion, I think that the drop­ping off of in­ter­pretabil­ity at the end of this graph is just an ar­ti­fact of us­ing a hu­man over­seer. If you in­stead sub­sti­tuted in an am­plified over­seer, then I think it’s plau­si­ble that in­ter­pretabil­ity could just keep go­ing up, or at least level off at some high level.

Im­prov­ing the field of ma­chine learning

One thing that Chris thinks could re­ally make a big differ­ence in achiev­ing a lot of the above goals would be some sort of re­al­ign­ment of the ma­chine learn­ing com­mu­nity. Cur­rently, the thing that the ML com­mu­nity pri­mar­ily cares about is chas­ing state-of-the-art re­sults on its var­i­ous bench­marks with­out re­gard for un­der­stand­ing what the ML tools they’re us­ing are ac­tu­ally do­ing. But that’s not what the ma­chine learn­ing dis­ci­ple has to look like, and in fact, it’s not what most sci­en­tific dis­ci­plines do look like.

Here’s Chris’s vi­sion for what an al­ter­na­tive field of ma­chine learn­ing might look like. Cur­rently, ma­chine learn­ing re­searchers pri­mar­ily make progress on bench­marks via trial and er­ror. In­stead, Chris wants to see a field which fo­cuses on de­liber­ate de­sign where un­der­stand­ing mod­els is pri­ori­tized and the way that peo­ple make progress is through deeply un­der­stand­ing their sys­tems. In this world, ML re­searchers pri­mar­ily make bet­ter mod­els by us­ing in­ter­pretabil­ity tools to un­der­stand why their mod­els are do­ing what they’re do­ing in­stead of just throw­ing lots of things at the wall and see­ing what sticks. Fur­ther­more, a large por­tion of the field in this world is just de­voted to gath­er­ing in­for­ma­tion on what mod­els do—cat­a­loging all the differ­ent types of cir­cuits that ap­pear across differ­ent neu­ral net­works, for ex­am­ple[4]—rather than on try­ing to build new mod­els.[5]

If you want to change the field in this way, there are es­sen­tially two ba­sic paths to mak­ing some­thing like that hap­pen—you can ei­ther:

  1. get cur­rent ML re­searchers to switch over to in­ter­pretabil­ity/​de­liber­ate de­sign/​micro­scope use or

  2. pro­duce new ML re­searchers work­ing on those things.

Chris has thoughts on how to do both of these, but I’ll start with the first one. Chris thinks that sev­eral fac­tors could make a high-qual­ity in­ter­pretabil­ity field ap­peal­ing for re­searchers. First, in­ter­pretabil­ity could be a way for re­searchers with­out ac­cess to large amounts of com­pute to stay rele­vant in a world where rel­a­tively few labs can train the largest ma­chine learn­ing mod­els. Se­cond, Chris thinks there’s lots of low hang­ing fruit in in­ter­pretabil­ity such that it should be fairly easy to have im­pres­sive re­search re­sults in the space over the next few years. Third, Chris’s vi­sion of in­ter­pretabil­ity is very al­igned with tra­di­tional sci­en­tific virtues—which can be quite mo­ti­vat­ing for many peo­ple—even if it isn’t very al­igned with the pre­sent paradigm of ma­chine learn­ing.

How­ever, If you want re­searchers to switch to a new re­search agenda and/​or style of re­search, it needs to be pos­si­ble for them to sup­port ca­reers based on it. Un­for­tu­nately, the unit of aca­demic credit in ma­chine learn­ing tends to be tra­di­tional pa­pers, pub­lished in con­fer­ences, eval­u­ated on whether they set a new state-of-the-art on a bench­mark (or more rarely by prov­ing the­o­ret­i­cal re­sults). This is what de­cides who gets hired, pro­moted, and tenured in ma­chine learn­ing.

To ad­dress this, Chris founded Distill, an aca­demic ma­chine learn­ing jour­nal that aims to pro­mote a differ­ent style of ma­chine learn­ing re­search. Distill aims to be a sort of “adapter” be­tween the tra­di­tional method of eval­u­at­ing re­search and the new style of re­search—based around things like de­liber­ate de­sign and micro­scope use—that Chris wants to see the field move to. Speci­fi­cally, Distill does this by be­ing differ­ent in a few key ways:

  1. Distill ex­plic­itly pub­lishes pa­pers vi­su­al­iz­ing ma­chine learn­ing sys­tems, or even just ex­pla­na­tions im­prov­ing Clar­ity of thought in ma­chine learn­ing (Distill’s ex­pos­i­tory ar­ti­cles have be­come widely used refer­ences).

  2. Distill has all of the nec­es­sary trap­pings to make it rec­og­nized as a le­gi­t­i­mate aca­demic jour­nal such that Distill pub­li­ca­tions will be taken se­ri­ously and cited.

  3. Distill has sup­port for all the sorts of nice in­ter­ac­tive di­a­grams that are of­ten nec­es­sary for pre­sent­ing in­ter­pretabil­ity re­search.

The sec­ond op­tion is to pro­duce new ML re­searchers pur­su­ing de­liber­ate de­sign rather than con­vert­ing old ones. Here, Chris has a pretty in­ter­est­ing take on how this can be done: con­vert neu­ro­scien­tists and sys­tems biol­o­gists.

Here’s Chris’s pitch. There are whole fields of neu­ro­science ded­i­cated to un­der­stand­ing all the differ­ent con­nec­tions, cir­cuits, path­ways, etc. in all differ­ent man­ner of an­i­mal brains. Similarly, for the sys­tems biol­o­gists, there are sig­nifi­cant com­mu­ni­ties of re­searchers study­ing in­di­vi­d­ual pro­teins, their in­ter­ac­tions and path­ways, etc. While neu­ral net­works are differ­ent from these lines of re­search at a de­tailed level, a lot of high level re­search ex­per­tise—e.g. epistemic stan­dards for study­ing cir­cuits, re­cur­ring mo­tifs, re­search in­tu­ition—may be just as helpful for this type of re­search as ma­chine learn­ing ex­per­tise. Chris thinks neu­ro­scien­tists or sys­tems biol­o­gists will­ing to make this tran­si­tion would be able to get fund­ing to do their re­search, a much eas­ier time run­ning ex­per­i­ments, and lots of low-hang­ing fruit in terms of new pub­lish­able re­sults that no­body has found yet.

Doesn’t this speed up ca­pa­bil­ities?

Yes, it prob­a­bly does—and Chris agrees that there’s a nega­tive com­po­nent to that—but he’s will­ing to bet that the pos­i­tives out­weigh the nega­tives.

Speci­fi­cally, Chris thinks the main ques­tion is whether prin­ci­pled and de­liber­ate model de­sign based on in­ter­pretabil­ity can beat au­to­mated model de­sign ap­proaches like neu­ral ar­chi­tec­ture search. If it can, we get ca­pa­bil­ities ac­cel­er­a­tion, but also a paradigm shift to­wards de­liber­ate model de­sign, which Chris ex­pects to sig­nifi­cantly aid al­ign­ment. If we don’t, in­ter­pretabil­ity loses one of its up­sides (other ad­van­tages like au­dit­ing still ex­ist in this world) but also doesn’t have the down­side of ac­cel­er­a­tion. Both the up­side and down­side go hand in hand, and Chris ex­pects the up­side to out­weigh the down­side.


  1. In par­tic­u­lar, this could be a way of get­ting trac­tion on ad­dress­ing gra­di­ent hack­ing. ↩︎

  2. As an ex­am­ple of the po­ten­tial dan­gers of agents, more agen­tic AI se­tups seem much more prone to mesa-op­ti­miza­tion. ↩︎

  3. A no­table ex­cep­tion to this, how­ever, is Eric Drexler’s “Refram­ing Su­per­in­tel­li­gence: Com­pre­hen­sive AI Ser­vices as Gen­eral In­tel­li­gence.” ↩︎

  4. An ex­am­ple of the sort of com­mon cir­cuit that ap­pears in lots of differ­ent mod­els that the Clar­ity team has found is the way in which con­volu­tional neu­ral net­works stay re­flec­tion-in­var­i­ant: to de­tect a dog, they sep­a­rately de­tect left­wards-fac­ing and right­wards-fac­ing dogs and then union them to­gether. ↩︎

  5. This re­sults in a large por­tion of the field be­ing fo­cused on what is effec­tively micro­scope use, which could also be quite rele­vant for mak­ing micro­scope AIs more ten­able. ↩︎