# Learning with catastrophes

A catas­tro­phe is an event so bad that we are not will­ing to let it hap­pen even a sin­gle time. For ex­am­ple, we would be un­happy if our self-driv­ing car ever ac­cel­er­ates to 65 mph in a res­i­den­tial area and hits a pedes­trian.

Catas­tro­phes pre­sent a the­o­ret­i­cal challenge for tra­di­tional ma­chine learn­ing — typ­i­cally there is no way to re­li­ably avoid catas­trophic be­hav­ior with­out strong statis­ti­cal as­sump­tions.

In this post, I’ll lay out a very gen­eral model for catas­tro­phes in which they are avoid­able un­der much weaker statis­ti­cal as­sump­tions. I think this frame­work ap­plies to the most im­por­tant kinds of catas­tro­phe, and will be es­pe­cially rele­vant to AI al­ign­ment.

De­sign­ing prac­ti­cal al­gorithms that work in this model is an open prob­lem. In a sub­se­quent post I de­scribe what I cur­rently see as the most promis­ing an­gles of at­tack.

## Model­ing catastrophes

We con­sider an agent A in­ter­act­ing with the en­vi­ron­ment over a se­quence of epi­sodes. Each epi­sode pro­duces a tran­script τ, con­sist­ing of the agent’s ob­ser­va­tions and ac­tions, along with a re­ward r ∈ [0, 1]. Our pri­mary goal is to quickly learn an agent which re­ceives high re­ward. (Su­per­vised learn­ing is the spe­cial case where each tran­scripts con­sist of a sin­gle in­put and a la­bel for that in­put.)

While train­ing, we as­sume that we have an or­a­cle which can de­ter­mine whether a tran­script τ is “catas­trophic.” For ex­am­ple, we might show a tran­script to a QA an­a­lyst and ask them if it looks catas­trophic. This or­a­cle can be ap­plied to ar­bi­trary se­quences of ob­ser­va­tions and ac­tions, in­clud­ing those that don’t arise from an ac­tual epi­sode. So train­ing can be­gin be­fore the very first in­ter­ac­tion with na­ture, us­ing only calls to the or­a­cle.

In­tu­itively, a tran­script should only be marked catas­trophic if it satis­fies two con­di­tions:

2. The agent’s ob­ser­va­tions are plau­si­ble: we have a right to ex­pect the agent to be able to han­dle those ob­ser­va­tions.

While ac­tu­ally in­ter­act­ing with the en­vi­ron­ment, the agent can­not query the or­a­cle — there is no time to wait for a QA en­g­ineer to re­view a pro­posed ac­tion to check if it would be catas­trophic.

More­over, if in­ter­ac­tion with na­ture ever pro­duces a catas­trophic tran­script, we im­me­di­ately fail. The perfor­mance of an al­gorithm is char­ac­ter­ized by two pa­ram­e­ters: the prob­a­bil­ity of catas­trophic failure, and the to­tal re­ward as­sum­ing no catas­trophic failure.

We as­sume that there are some poli­cies such that no mat­ter what na­ture does, the re­sult­ing tran­script is never catas­trophic.

Tra­di­tion­ally in RL the goal is to get as much re­ward as the best policy from some class C. We’ slightly weaken that goal, and in­stead aim to do as well as the best policy from C that never makes a catas­trophic de­ci­sion.

## Batch learning

I’ve de­scribed an on­line ver­sion of learn­ing with catas­tro­phes. We can also con­sider the batch ver­sion, where the learner is first given a large num­ber of “train­ing” epi­sodes.

In the batch ver­sion, there is no penalty for catas­tro­phes at train­ing time, and we don’t care about train­ing er­ror. The two perfor­mance pa­ram­e­ters are test-time perfor­mance and test-time catas­tro­phe prob­a­bil­ity.

## The oracle

This defi­ni­tion de­pends on an or­a­cle who de­ter­mines which tran­scripts are catas­trophic.

For weak AI sys­tems, the or­a­cle may be a hu­man. But a pow­er­ful AI sys­tem might take ac­tions which are catas­trophic but which look in­offen­sive to a hu­man judge, so this ap­proach doesn’t cut it.

In gen­eral, the judge should be a hu­man+AI team which is more com­pe­tent than the sys­tem be­ing trained, armed with an ad­e­quate solu­tion to the in­formed over­sight prob­lem.

# Approach

Learn­ing with catas­tro­phes is straight­for­ward given an un­limited num­ber of queries to the catas­tro­phe or­a­cle. Given any on­line learn­ing al­gorithm A, we can “harden” it by run­ning the fol­low­ing pro­cess be­fore the be­gin­ning of each epi­sode (a similar pro­cess can be ap­plied to a batch learner):

1. Search over all pos­si­ble en­vi­ron­ments, run­ning A on each one to ob­tain a tran­script.

2. If we find a catas­trophic tran­script τ, then add τ to A’s train­ing data with a re­ward of −1, and go back to step 1.

It’s easy to prove that this pro­cess con­verges, if A is com­pe­tent: the num­ber of times we in­voke step 2 is at most the time re­quired to learn an op­ti­mal catas­tro­phe-free policy (plus the num­ber of epi­sodes).

The big prob­lem with this al­gorithm is the ex­po­nen­tial search in step 1.

The most ob­vi­ous ap­proach to tractable learn­ing with catas­tro­phes is to re­place step 1 with a tractable al­gorithm that ap­prox­i­mates the search. This pro­cess is an ex­am­ple of ad­ver­sar­ial train­ing.

We can for­mal­ize this as a sim­ple game be­tween the agent A and an ad­ver­sary B. The two agents al­ter­nate tak­ing moves, with A play­ing the role of the agent and B play­ing the role of the en­vi­ron­ment. B wins if the re­sult­ing tran­script is catas­trophic, A wins if it is not catas­trophic.

Be­fore each epi­sode, we play this game N times, where N > 0 is a hy­per­pa­ram­e­ter. If B is able to win any of the games, then we add the re­sult­ing tran­script to A’s train­ing data and start again. Once B is not able to win any games, then we move on to a “real” epi­sode. We hope that B is good enough at the game that if it can’t pro­duce a catas­trophic tran­script, then the real en­vi­ron­ment can’t pro­duce a catas­trophic tran­script ei­ther.

More pre­cisely, be­fore each epi­sode we perform the fol­low­ing pro­cess:

1. Set i = 0.

2. A and B al­ter­nate tak­ing moves, pro­duc­ing a tran­script τ.

3. If τ is catas­trophic, we add τ to A’s train­ing data with a re­ward of −1, and add τ to B’s train­ing data with a re­ward of +1. Then we go back to step 1.

4. If τ is not catas­trophic, we add τ to B’s train­ing data with a re­ward of −1.

5. If i < N, we in­cre­ment i and go back to step 2.

I dis­cuss this idea in more de­tail in my post on red teams. There are se­ri­ous prob­lems with this ap­proach and I don’t think it can work on its own, but for­tu­nately it seems com­bin­able with other tech­niques.

# Conclusion

Learn­ing with catas­tro­phes is a very gen­eral model of catas­trophic failures which avoids be­ing ob­vi­ously im­pos­si­ble. I think that de­sign­ing com­pe­tent al­gorithms for learn­ing with catas­tro­phes may be an im­por­tant in­gre­di­ent in a suc­cess­ful ap­proach to AI al­ign­ment.

This was origi­nally posted here on 28th May, 2016.

To­mor­row’s AI Align­ment se­quences post will be in the se­quence on Value Learn­ing by Ro­hin Shah.

The next post in this se­quence will be ‘Thoughts on Re­ward Eng­ineer­ing’ by Paul Chris­ti­ano, on Thurs­day.

• I’m not sure how nec­es­sary it is to ex­plic­itly aim to avoid catas­trophic be­hav­ior—it seems that even a low ca­pa­bil­ity cor­rigible agent would still know enough to avoid catas­trophic be­hav­ior in prac­tice. Of course, it would be bet­ter to have stronger guaran­tees against catas­trophic be­hav­ior, so I cer­tainly sup­port re­search on learn­ing from catas­tro­phes—but if it turns out to be too hard, or im­pose too much over­head, it could still be fine to aim for cor­rigi­bil­ity alone.

I do want to make a per­haps ob­vi­ous note: the as­sump­tion that “there are some poli­cies such that no mat­ter what na­ture does, the re­sult­ing tran­script is never catas­trophic” is some­what strong. In par­tic­u­lar, it pre­cludes the fol­low­ing sce­nario: the en­vi­ron­ment can do any­thing com­putable, and the or­a­cle eval­u­ates be­hav­ior only based on out­comes (ob­ser­va­tions). In this case, for any ob­ser­va­tion that the or­a­cle would la­bel as catas­trophic, there is an en­vi­ron­ment that re­gard­less of the agent’s ac­tion out­puts that ob­ser­va­tion. So for this prob­lem to be solv­able, we need to ei­ther have a limit on what the en­vi­ron­ment “could do”, or an or­a­cle that judges “catas­tro­phe” based on the agent’s ac­tion in ad­di­tion to out­comes (which I sus­pect will cache out to “are the ac­tions in this tran­script know­ably go­ing to cause some­thing bad to hap­pen”). In the lat­ter case, it sounds like we are try­ing to train “ro­bust cor­rigi­bil­ity” as op­posed to “never let­ting a catas­tro­phe hap­pen”. Do you have a sense for which of these two as­sump­tions you would want to make?

• I’m not sure how nec­es­sary it is to ex­plic­itly aim to avoid catas­trophic be­hav­ior—it seems that even a low ca­pa­bil­ity cor­rigible agent would still know enough to avoid catas­trophic be­hav­ior in prac­tice.

Paul gave a bit more mo­ti­va­tion here: (It’s a bit con­fus­ing that these two posts are re­posted here out of or­der. ETA on 1/​28/​19: Strange, the date on that re­post just changed to to­day’s date. Yes­ter­day it was dated Novem­ber 2018.)

If pow­er­ful ML sys­tems fail catas­troph­i­cally, they may be able to quickly cause ir­re­versible dam­age. To be safe, it’s not enough to have an av­er­age-case perfor­mance guaran­tee on the train­ing dis­tri­bu­tion — we need to en­sure that even if our sys­tems fail on new dis­tri­bu­tions or with small prob­a­bil­ity, they will never fail too badly.

My in­ter­pre­ta­tion of this is that learn­ing with catas­tro­phes /​ op­ti­miz­ing worst-case perfor­mance (I be­lieve these are refer­ring to the same thing, which is also con­fus­ing) is needed to train an agent that can be called cor­rigible in the first place. Without it, we could end up with an agent that looks cor­rigible on the train­ing dis­tri­bu­tion, but would do some­thing ma­lign (“ap­plies its in­tel­li­gence in the ser­vice of an un­in­tended goal”) af­ter de­ploy­ment.

• Yeah, that makes sense, also the dis­tinc­tion be­tween be­nign and ma­lign failures in that post seems right. It makes much more sense that learn­ing with catas­tro­phes is nec­es­sary for cor­rigi­bil­ity.

• In par­tic­u­lar, it pre­cludes the fol­low­ing sce­nario: the en­vi­ron­ment can do any­thing com­putable, and the or­a­cle eval­u­ates be­hav­ior only based on out­comes (ob­ser­va­tions).

Paul ex­plic­itly writes that the or­a­cle sees both ob­ser­va­tions and ac­tions: ‘This or­a­cle can be ap­plied to ar­bi­trary se­quences of ob­ser­va­tions and ac­tions […].’

or an or­a­cle that judges “catas­tro­phe” based on the agent’s ac­tion in ad­di­tion to out­comes (which I sus­pect will cache out to “are the ac­tions in this tran­script know­ably go­ing to cause some­thing bad to hap­pen”)

This is also cov­ered:

In­tu­itively, a tran­script should only be marked catas­trophic if it satis­fies two con­di­tions:

2. The agent’s ob­ser­va­tions are plau­si­ble: we have a right to ex­pect the agent to be able to han­dle those ob­ser­va­tions.

• Paul ex­plic­itly writes that the or­a­cle sees both ob­ser­va­tions and ac­tions: ‘This or­a­cle can be ap­plied to ar­bi­trary se­quences of ob­ser­va­tions and ac­tions […].’

I know; I’m ask­ing how the or­a­cle would have to work in prac­tice. Pre­sum­ably at some point we will want to ac­tu­ally run the “learn­ing with catas­tro­phes al­gorithm”, and it will need an or­a­cle, and I’d like to know what needs to be true of the or­a­cle.

This is also covered

In­deed, my point with that sen­tence was that it sounds like we are only try­ing to avoid catas­tro­phes that could have been fore­seen, as op­posed to liter­ally all catas­tro­phes as the post sug­gests, which is why the next sen­tence is:

In the lat­ter case, it sounds like we are try­ing to train “ro­bust cor­rigi­bil­ity” as op­posed to “never let­ting a catas­tro­phe hap­pen”.
• “never let­ting a catas­tro­phe hap­pen” would in­cen­tivize the agent to spend a lot of re­sources on fore­see­ing catas­tro­phes and build­ing ca­pac­ity to ward them off. This would dis­tract from the agent’s main task. So we have to give the agent some slack. Is this what you’re get­ting at? The or­a­cle needs to de­cide whether or not the agent can be held ac­countable for a catas­tro­phe, but the ar­ti­cle doesn’t say any­thing how it would do this?

• The or­a­cle needs to de­cide whether or not the agent can be held ac­countable for a catas­tro­phe, but the ar­ti­cle doesn’t say any­thing how it would do this?

Yes, ba­si­cally. I’m not say­ing the ar­ti­cle should spec­ify how the or­a­cle should do this, I’m say­ing that it should flag this as a nec­es­sary prop­erty of the or­a­cle (or ar­gue why it is not a nec­es­sary prop­erty).

• I agree.

• May be we could also add an on­board pre­dic­tor of the Catas­trophic Or­a­cle? It works as fol­low­ing:

1) In a game en­vi­ron­ment, agent A sug­gests many tran­scripts which are eval­u­ated by the Catas­tro­phe Or­a­cle (CO).

2) Based on this data, the agent A learns to pre­dict the out­put of the Catas­trophic Or­a­cle; in other words, there is a sep­a­rate neu­ral net or other best available ma­chine learn­ing sys­tem which learns to pre­dict the Or­a­cle’s out­put. This is a CO-pre­dic­tor.

3) When act­ing in real world in a new situ­a­tion and with a new tran­script, but when there is no time to ac­tu­ally query the real Catas­trophic Or­a­cle, the agent pre­dicts the pos­si­ble out­put of the Or­a­cle based on the func­tion learned on Step 2 and chose the strate­gies which are less likely to trig­ger the Catas­trophic Or­a­cle.

4) Between all strate­gies which will not trig­ger the CO pre­dic­tor, the agent chose those which small vari­a­tions will also will not trig­ger CO pre­dic­tor. This is needed to pre­vent com­plex crazy plans to es­cape CO (E.g. if a self-driv­ing car tries to pass be­tween two pedes­tri­ans via drift­ing, it is a frag­ile strat­egy).

Ex­am­ple: a child learns that her mother is an­gry when she broke a cup and some other things. Now when she takes a frag­ile thing, she could imag­ine that she could broke it and that her mother will be un­happy; in other words, she in­ter­nal­ised par­ent con­trol via her abil­ity to pre­dict anger of her mother.