Catastrophe Mitigation Using DRL (Appendices)

% operators that are separated from the operand by a space

% autosize deliminaters

% operators that require brackets

% operators that require parentheses

% Paper specific

These are Appendices B and C for the essay Catastrophe Mitigation Using DRL. They appear in a separate post because of a length limit in the website.

##Appendix B

Given , we denote , .

#Proposition B.1

Consider a universe which an -realization of an MDP with state function , a stationary policy , an arbitrary -policy and some . Then,

#Proof of Proposition B.1

For the sake of encumbering the notation less, we will omit the parameter in functions that depend on it. We will use implicitly, i.e. given a function on and , . Finally, we will omit , using the shorthand notations , .

For any , it is easy to see that

Taking expected value over , we get

It is easy to see that the second term vanishes, yielding the desired result.

#Proposition B.2

Consider some , , a universe that is an -realization of with state function , a stationary policy and an arbitrary -policy . For any , let be an -policy s.t. for any

Assume that

i. For any

ii. For any and

Then, for any ,

#Proof of Proposition B.2

For the sake of encumbering the notation less, we will use implicitly, i.e. given a function on and , . Also, we will omit , using the shorthand notations , .

By Proposition B.1, for any

coincides with after , therefore the corresponding expected values vanish.

Subtracting the equalities for and , we get

and coincide until , therefore

Denote , . We also use the shorthand notations , , . Both and coincide with after , therefore

Denote . By the mean value theorem, for each there is s.t.

It follows that

Here, an expected value w.r.t. the difference of two probability measures is understood to mean the corresponding difference of expected values.

It is easy to see that assumption i implies that is a submartingale for (whereas it is a martingale for ) and therefore

We get

Summing over , we get

Applying Proposition B.1 to the right hand side

#Proof of Lemma A.1

Fix , and . Denote . To avoid cumbersome notation, whenever should appear a subscript, we will replace it by . Let be a probability space\Comment{ and a filtration of }. Let be \Comment{measurable w.r.t. }a random variable and the following be stochastic processes\Comment{ adapted to }

We also define by

(The following conditions on and imply that the range of the above is indeed in .) Let and be as in Proposition C.1 (we assume w.l.o.g. that ). We construct \Comment{, }, , , , , , and s.t is uniformly distributed and for any , , and , denoting

Note that the last equation has the form of a Bayesian update which is allowed to be arbitrary when update is on “impossible” information.

We now construct the -policy s.t. for any , s.t. and

That is, we perform Thompson sampling at time intervals of size , moderated by the delegation routine , and discard from our belief state hypotheses whose probability is below and hypotheses sampling which resulted in recommending “unsafe” actions i.e. actions that refused to perform.

In order to prove has the desired property, we will define the stochastic processes , , , , and , each process of the same type as its shriekless counterpart (thus is constructed to accommodate them). These processes are required to satisfy the following:

For any , we construct the -policy s.t. for any , s.t. and

Given any -policy and -policy we define by

Here, is a constant defined s.t. the probabilities sum to 1. We define the -policy by

Condition iii of Proposition C.1 and condition i of Definition A.1 imply that for any

This means we can apply Proposition B.2 and get

Here, the -policy is defined as in Proposition B.2. We also define the -policies and by

Denote

For each , denote

We have

Condition iv of Proposition C.1 and condition ii of Definition A.1 imply that, given s.t.

Therefore, , and we remain with

We have

Since , it follows that

Using condition i of Proposition C.1, we conclude

Define the random variables by

Averaging the previous inequality over , we get

We apply Proposition C.2 to each term in the sum over .

Condition ii of Proposition C.1 implies that

Here, the factor of 2 comes from the difference between the equations for and (we can construct and intermediate policy between and and use the triangle inequality for ). We conclude

Now we set

Without loss of generality, we can assume that (because of the form of the bound we are proving), which implies that and . We get

##Appendix C

The following is a simple special case of what appeared as “Proposition A.2” in the previous essay, where we restrict to be single-valued (the more general case isn’t needed).

#Proposition C.1

Fix an interface , , , . Consider some . Then, there exist and with the following properties. Given , we denote its projection to . Thus, . Given an -environment, , and , we can define as follows

We require that for every , and as above, the following conditions hold

i.

ii.

iii. For all , if then

iv. For all , if then


The following appeared in the previous essay as “Proposition A.1”.

#Proposition C.2

Consider a probability space , , a finite set and random variables , and . Assume that and . Then