Optimization Provenance

Trans­parency is vi­tal for ML-type ap­proaches to AI al­ign­ment, and is also an im­por­tant part of agent foun­da­tions re­search. In this post, we lay out an agenda for for­mal­iz­ing trans­parency which we’ll call the Op­ti­miza­tion Prove­nance Agenda.

In par­tic­u­lar, the goal is to cre­ate a no­tion of trans­parency strong enough that an at­tempted de­cep­tion would be com­pletely trans­par­ent. The ba­sic idea is that at any point, not only should an agent’s world model and sub­goals be leg­ible, but the en­tire prove­nance of all of the op­ti­miza­tion pro­cesses which are part of the agent should be leg­ible as well.

This agenda is a joint de­vel­op­ment be­tween me and Evan Hub­inger. Spe­cial thanks to Hjal­mar Wijk and Evan Hub­inger for their com­ments and feed­back on this post!

Back­ground notions

In or­der to dis­cuss the no­tions here, it will be helpful to have work­ing defi­ni­tions of the key con­cepts in­volved.


In­tu­itively, leg­i­bil­ity means that a hu­man should be able to look at some­thing, and be able to un­der­stand it eas­ily and cor­rectly. If the thing in ques­tion is very large, we might at best only be able to have lo­cal leg­i­bil­ity, where any given part be­low a cer­tain size is leg­ible, but the whole thing is not un­der­stand­able by a hu­man. In an am­plifi­ca­tion sce­nario, lo­cal leg­i­bil­ity may be suffi­cient, with an am­plified hu­man be­ing ca­pa­ble of un­der­stand­ing the global struc­ture. For the pur­poses of this post, we’ll con­sider leg­i­bil­ity to also in­clude these cases. One ma­jor is­sue with the con­cept of leg­i­bil­ity is that it seems very difficult to cre­ate some­thing leg­ible through a leg­ible pro­cess.

It seems plau­si­ble to me that ex­ist­ing ML tech­niques could be com­bined and ex­tended to pro­duce nat­u­ral lan­guage de­scrip­tions of learned world mod­els. How­ever, this pro­cess it­self would most likely be very illeg­ible. Even in the case of hu­man com­mu­ni­ca­tion, it is pos­si­ble for a hu­man to pro­duce a leg­ible plan, but hu­mans do not seem to be very ca­pa­ble of pro­duc­ing a leg­ible ex­pla­na­tion of the pro­cess which pro­duced that plan. So it seems likely to me that we may have to de­cide some illeg­ible pro­cess to trust in or­der to get off the ground with this ap­proach. This could sim­ply be trust­ing illeg­ible hu­man men­tal pro­cesses, or it could be some­thing like trust­ing mod­els pro­duced in a math­e­mat­i­cally sim­ple way.

World model

A world model be­longs to a de­ci­sion mak­ing pro­cess, and is used by the pro­cess to pre­dict what the re­sult of var­i­ous de­ci­sions would be, so that it can make the best choice. It’s im­por­tant that the world model in­cludes ev­ery­thing go­ing into its de­ci­sion mak­ing process

Due to our mo­ti­va­tions in trans­parency, we will typ­i­cally think of world mod­els as be­ing made of highly com­pos­able mod­els, which each model an as­pect of the en­tire world (in­clud­ing very ab­stract as­pects). I be­lieve that Goguen’s sheaf se­man­tics is a promis­ing frame­work for for­mal­iz­ing this type of world model. It’s im­por­tant to note that world mod­els in cur­rent ML meth­ods are not com­pos­able this way, which makes these mod­els much less leg­ible.

World mod­els can also be im­plicit or ex­plicit. The canon­i­cal ex­am­ple of an im­plicit world model is that of a ther­mo­stat, where the world model is im­plic­itly rep­re­sented by the ther­mis­tor or bimetal­lic strip. An ex­plicit world model is rep­re­sented in a mod­el­ing frame­work, such as a sheaf model. The ex­act line be­tween ex­plicit and im­plicit world mod­els seems to be neb­u­lous. For our pur­poses, it is much bet­ter to have an ex­plicit world model, since it is more leg­ible to have some­thing ex­plic­itly. Note that im­plicit mod­els can still be leg­ible though, such as in the ther­mo­stat ex­am­ple.


We will con­sider an op­ti­mizer to be made of three com­po­nents:

1. A world model

2. An objective

3. An op­ti­miza­tion process

As an ex­am­ple, con­sider AlphaGo, which uses Monte-Carlo Tree Search. The world model is dis­tributed, with an im­plicit part in the se­lec­tion net­work, and an ex­plicit part in the ex­pan­sion and simu­la­tion steps of the search. The ob­jec­tive is to max­i­mize the prob­a­bil­ity of win­ning the game, and the op­ti­miza­tion pro­cess is sim­ply back­prop­a­ga­tion over the tree.

In or­der to be leg­ible, it is best if the ob­jec­tive is defined in terms of ex­plicit parts of the world model. While I do not be­lieve this is strictly nec­es­sary, I feel very skep­ti­cal that im­plicit ob­jec­tives will be ad­e­quately trans­par­ent.


The prove­nance of an ob­ject is the en­tire his­tory of how that ob­ject came to be. As an ex­am­ple, art mu­se­ums place high value on know­ing the com­plete prove­nance of a fa­mous paint­ing, as it greatly in­creases their con­fi­dence that the paint­ing is in­deed le­gi­t­i­mate.

In this con­text, we want to un­der­stand the prove­nance of all op­ti­miza­tion pro­cesses run­ning in the agent. In par­tic­u­lar, for each op­ti­mizer, we want to un­der­stand the op­ti­miza­tion pro­cess which cre­ated the op­ti­mizer.


In ad­di­tion to gen­er­ally im­prov­ing our un­der­stand­ing of trans­parency, there are two main mo­ti­va­tions go­ing into the op­ti­miza­tion prove­nance agenda. So be­fore we get into more de­tails of the agenda, we will in­tro­duce these mo­ti­va­tions here.

Treach­er­ous turn

In or­der to perform a treach­er­ous turn, or other kinds of de­cep­tive al­ign­ment, an agent must at some point con­ceive of this plan. This con­cep­tion will be the re­sult of an op­ti­mizer, and so this should be in the op­ti­miza­tion prove­nance. If this prove­nance is leg­ible, then this should be de­tected by an over­seer be­fore it comes into fruition. So we ex­pect that an agent with op­ti­miza­tion prove­nance will be much less likely to achieve or even at­tempt a treach­er­ous turn.

Good­hart’s curse

More gen­er­ally, we want to pre­vent Good­hart’s curse sce­nar­ios, and en­sure in­ner al­ign­ment. If the agent is op­ti­miz­ing for a proxy to a goal, this is po­ten­tially leg­ible. How­ever, we will typ­i­cally ex­pect a train­ing pro­cess to pro­duce a mesa-op­ti­mizer which is op­ti­miz­ing for some proxy, since val­ues are likely to be too com­plex to op­ti­mize for di­rectly. So how do we dis­t­in­guish be­tween ro­bustly al­igned mesa-op­ti­miz­ers and pseudo-al­igned mesa-op­ti­miz­ers (i.e. those that Good­hart)? If we un­der­stand the op­ti­mizer which cre­ated the mesa-op­ti­mizer in ad­di­tion to the mesa-op­ti­mizer it­self, we be­lieve this will give in­sight into whether this is likely to hap­pen or not. Un­der­stand­ing the prove­nance of op­ti­miz­ers could thus give us tech­niques to help us solve this prob­lem.

The op­ti­miza­tion prove­nance agenda

We’re now ready to un­der­stand the three key pieces of op­ti­miza­tion prove­nance agenda. We be­lieve that a solu­tion to these would rep­re­sent a large step in solv­ing trans­parency.

Leg­ible optimizers

In or­der for this to work, the op­ti­miz­ers run­ning as part of the agent need to be leg­ible. I be­lieve the hard­est part of this is en­sur­ing that the world model of the op­ti­mizer is leg­ible. Once we have a leg­ible world model, I ex­pect it will be rel­a­tively easy to en­sure the ob­jec­tive is also leg­ible.

En­sur­ing that the op­ti­miza­tion pro­cess is leg­ible falls more nat­u­rally into the next sec­tion.

1. Find an ad­e­quate frame­work for ex­plicit world models

2. Ex­plic­itly un­der­stand the objective

3. Un­der­stand what makes some­thing an im­plicit world model

Mesa-op­ti­mizer control

The op­ti­miza­tion pro­cesses used in the agent must all be ca­pa­ble of con­trol­ling whether it cre­ates a mesa-op­ti­mizer. In par­tic­u­lar, a mesa-op­ti­mizer should only be formed when its three com­po­nents are all known and as­sured. This re­quires the op­ti­miza­tion pro­cess to be leg­ible in the sense that we un­der­stand un­der what con­di­tions the pro­cess will cre­ate a new mesa-op­ti­mizer.

1. When are mesa-op­ti­miz­ers cre­ated?

2. How do we en­sure that the mesa-op­ti­mizer will be leg­ible?

3. Can such mesa-op­ti­miz­ers be made cor­rigible?

4. Does this ap­ply to other forms of op­ti­miza­tion dae­mons?

Prove­nance accountability

Hav­ing leg­ible prove­nance of ev­ery op­ti­mizer oc­cur­ring in the agent is a good step to­wards solv­ing trans­parency. But this is not enough by it­self, be­cause prove­nance may it­self be forged. In or­der for the prove­nance to be as­sured, an over­seer must in­ves­ti­gate any changes made to the op­ti­miza­tion prove­nance, and ver­ify that there is no op­ti­miza­tion to­ward un­al­igned goals.

1. What con­sti­tutes ad­e­quate over­sight?

2. How pos­si­ble is it for ac­ci­den­tal treach­ery to be pro­duced?

3. To what ex­tent can Good­hart’s curse be avoided with op­ti­miza­tion prove­nance?

Re­cur­sive uses

It seems likely that progress on the op­ti­miza­tion prove­nance agenda could be lev­er­aged to make progress on other sub­prob­lems. In par­tic­u­lar, I think that de­vel­op­ing frame­works for ex­plicit mod­els will make it eas­ier to solve the in­ner al­ign­ment prob­lem. I also be­lieve that the idea of op­ti­miza­tion prove­nance is a use­ful han­dle for think­ing about mesa-op­ti­miz­ers in gen­eral.