Learning with catastrophes

A catas­tro­phe is an event so bad that we are not will­ing to let it hap­pen even a sin­gle time. For ex­am­ple, we would be un­happy if our self-driv­ing car ever ac­cel­er­ates to 65 mph in a res­i­den­tial area and hits a pedes­trian.

Catas­tro­phes pre­sent a the­o­ret­i­cal challenge for tra­di­tional ma­chine learn­ing — typ­i­cally there is no way to re­li­ably avoid catas­trophic be­hav­ior with­out strong statis­ti­cal as­sump­tions.

In this post, I’ll lay out a very gen­eral model for catas­tro­phes in which they are avoid­able un­der much weaker statis­ti­cal as­sump­tions. I think this frame­work ap­plies to the most im­por­tant kinds of catas­tro­phe, and will be es­pe­cially rele­vant to AI al­ign­ment.

De­sign­ing prac­ti­cal al­gorithms that work in this model is an open prob­lem. In a sub­se­quent post I de­scribe what I cur­rently see as the most promis­ing an­gles of at­tack.

Model­ing catastrophes

We con­sider an agent A in­ter­act­ing with the en­vi­ron­ment over a se­quence of epi­sodes. Each epi­sode pro­duces a tran­script τ, con­sist­ing of the agent’s ob­ser­va­tions and ac­tions, along with a re­ward r ∈ [0, 1]. Our pri­mary goal is to quickly learn an agent which re­ceives high re­ward. (Su­per­vised learn­ing is the spe­cial case where each tran­scripts con­sist of a sin­gle in­put and a la­bel for that in­put.)

While train­ing, we as­sume that we have an or­a­cle which can de­ter­mine whether a tran­script τ is “catas­trophic.” For ex­am­ple, we might show a tran­script to a QA an­a­lyst and ask them if it looks catas­trophic. This or­a­cle can be ap­plied to ar­bi­trary se­quences of ob­ser­va­tions and ac­tions, in­clud­ing those that don’t arise from an ac­tual epi­sode. So train­ing can be­gin be­fore the very first in­ter­ac­tion with na­ture, us­ing only calls to the or­a­cle.

In­tu­itively, a tran­script should only be marked catas­trophic if it satis­fies two con­di­tions:

  1. The agent made a catas­troph­i­cally bad de­ci­sion.

  2. The agent’s ob­ser­va­tions are plau­si­ble: we have a right to ex­pect the agent to be able to han­dle those ob­ser­va­tions.

While ac­tu­ally in­ter­act­ing with the en­vi­ron­ment, the agent can­not query the or­a­cle — there is no time to wait for a QA en­g­ineer to re­view a pro­posed ac­tion to check if it would be catas­trophic.

More­over, if in­ter­ac­tion with na­ture ever pro­duces a catas­trophic tran­script, we im­me­di­ately fail. The perfor­mance of an al­gorithm is char­ac­ter­ized by two pa­ram­e­ters: the prob­a­bil­ity of catas­trophic failure, and the to­tal re­ward as­sum­ing no catas­trophic failure.

We as­sume that there are some poli­cies such that no mat­ter what na­ture does, the re­sult­ing tran­script is never catas­trophic.

Tra­di­tion­ally in RL the goal is to get as much re­ward as the best policy from some class C. We’ slightly weaken that goal, and in­stead aim to do as well as the best policy from C that never makes a catas­trophic de­ci­sion.

Batch learning

I’ve de­scribed an on­line ver­sion of learn­ing with catas­tro­phes. We can also con­sider the batch ver­sion, where the learner is first given a large num­ber of “train­ing” epi­sodes.

In the batch ver­sion, there is no penalty for catas­tro­phes at train­ing time, and we don’t care about train­ing er­ror. The two perfor­mance pa­ram­e­ters are test-time perfor­mance and test-time catas­tro­phe prob­a­bil­ity.

The oracle

This defi­ni­tion de­pends on an or­a­cle who de­ter­mines which tran­scripts are catas­trophic.

For weak AI sys­tems, the or­a­cle may be a hu­man. But a pow­er­ful AI sys­tem might take ac­tions which are catas­trophic but which look in­offen­sive to a hu­man judge, so this ap­proach doesn’t cut it.

In gen­eral, the judge should be a hu­man+AI team which is more com­pe­tent than the sys­tem be­ing trained, armed with an ad­e­quate solu­tion to the in­formed over­sight prob­lem.


Learn­ing with catas­tro­phes is straight­for­ward given an un­limited num­ber of queries to the catas­tro­phe or­a­cle. Given any on­line learn­ing al­gorithm A, we can “harden” it by run­ning the fol­low­ing pro­cess be­fore the be­gin­ning of each epi­sode (a similar pro­cess can be ap­plied to a batch learner):

  1. Search over all pos­si­ble en­vi­ron­ments, run­ning A on each one to ob­tain a tran­script.

  2. If we find a catas­trophic tran­script τ, then add τ to A’s train­ing data with a re­ward of −1, and go back to step 1.

It’s easy to prove that this pro­cess con­verges, if A is com­pe­tent: the num­ber of times we in­voke step 2 is at most the time re­quired to learn an op­ti­mal catas­tro­phe-free policy (plus the num­ber of epi­sodes).

The big prob­lem with this al­gorithm is the ex­po­nen­tial search in step 1.

Ad­ver­sar­ial training

The most ob­vi­ous ap­proach to tractable learn­ing with catas­tro­phes is to re­place step 1 with a tractable al­gorithm that ap­prox­i­mates the search. This pro­cess is an ex­am­ple of ad­ver­sar­ial train­ing.

We can for­mal­ize this as a sim­ple game be­tween the agent A and an ad­ver­sary B. The two agents al­ter­nate tak­ing moves, with A play­ing the role of the agent and B play­ing the role of the en­vi­ron­ment. B wins if the re­sult­ing tran­script is catas­trophic, A wins if it is not catas­trophic.

Be­fore each epi­sode, we play this game N times, where N > 0 is a hy­per­pa­ram­e­ter. If B is able to win any of the games, then we add the re­sult­ing tran­script to A’s train­ing data and start again. Once B is not able to win any games, then we move on to a “real” epi­sode. We hope that B is good enough at the game that if it can’t pro­duce a catas­trophic tran­script, then the real en­vi­ron­ment can’t pro­duce a catas­trophic tran­script ei­ther.

More pre­cisely, be­fore each epi­sode we perform the fol­low­ing pro­cess:

  1. Set i = 0.

  2. A and B al­ter­nate tak­ing moves, pro­duc­ing a tran­script τ.

  3. If τ is catas­trophic, we add τ to A’s train­ing data with a re­ward of −1, and add τ to B’s train­ing data with a re­ward of +1. Then we go back to step 1.

  4. If τ is not catas­trophic, we add τ to B’s train­ing data with a re­ward of −1.

  5. If i < N, we in­cre­ment i and go back to step 2.

I dis­cuss this idea in more de­tail in my post on red teams. There are se­ri­ous prob­lems with this ap­proach and I don’t think it can work on its own, but for­tu­nately it seems com­bin­able with other tech­niques.


Learn­ing with catas­tro­phes is a very gen­eral model of catas­trophic failures which avoids be­ing ob­vi­ously im­pos­si­ble. I think that de­sign­ing com­pe­tent al­gorithms for learn­ing with catas­tro­phes may be an im­por­tant in­gre­di­ent in a suc­cess­ful ap­proach to AI al­ign­ment.

This was origi­nally posted here on 28th May, 2016.

To­mor­row’s AI Align­ment se­quences post will be in the se­quence on Value Learn­ing by Ro­hin Shah.

The next post in this se­quence will be ‘Thoughts on Re­ward Eng­ineer­ing’ by Paul Chris­ti­ano, on Thurs­day.