Simplified explanation of stratification

A putative new idea for AI control; index here.

I’ve previously talked about stratified indifference/​learning. In this short post, I’ll try and present the idea, as simply and clearly as possible.


The truth is yet written

Assume the AI maximising a utility , but is unsure whether is equal to utility or instead utility . Humans have written the true utility on a paper somewhere, and, in an hour, the AI will look at that paper to check what is. Let be the indicator function that is written on that paper in an hour; is the obvious converse.

Thus .

Here’s the problem. There is nothing stopping the AI from looking up the paper currently. And there is nothing stopping the AI from rewriting the paper, or substituting another ( is determined by the paper in an hour’s time, however that paper got there). In terms of causal graphs, we have the following setup:

Here is the node that represents the utility that is ultimately equal to, is what is currently written on the paper (from now on, written as $ \mathbf{W_0}$ for short), is what will be written on the paper at decision time (written as ), and is the AI’s action. Note that determines directly, while it itself is determined by (the past paper) and (the AI’s actions).

Default policy

Now assume that there is a default action or policy where the AI simply does nothing. In that case, the value of determines completely. The AI has another action (short for overwrite) in which it overwrites the paper’s choice (this subdivides into , overwrite with , and , overwrite with ). Given , is then independent of .

Generally speaking, the AI will always choose , as one of and will be easier to maximise, so the AI will choose that one to maximise.

So, informally, what do we really want to happen? We want the AI to maximise , if would have been chosen, absent its intervention (and similarly for ). “Absent its intervention” seems to designate equality between the nodes and . More generally, it designates what happens when .

Stratification

So let’s modify the AI’s expected utility calculation to meet that informal desiderata. First, let’s look at the universe where, absent AI intervention, . What that means is that, conditional on , is set to .

In this universe, and are both set to . Now, it makes no sense to keep being set to , since that node is a consequence of as well as .

So we’ll define a probability distribution (and corresponding expectation ) over all nodes that are not descendants of (to do so, we’ll assume that all nodes that are descendants of and ancestors of are endogenous variables that are deterministic functions of their inputs; the stochasticity is put into exogenous variables, see this paper for more details). Then we will make the AI choose its actions by calculating the expectation of , given .

So, more precisely, this is the expectation of in the universes where would have been if the AI had chosen to do nothing—thus .

So if the AI choose action , this is . Note that, in this instance, this is equal to the classical expectation .

What happens if the AI choose , to overwrite the paper with ? This is irrelevant for the expected utility calculation, since we’re using : this is the expectation of (the utility that would have been chosen, if ). The actual values of node becomes irrelevant, only the counterfactual value it would have had. So changing does not benefit the AI. In fact, if the AI is time-inconsistent, it makes things worse, as it means its future self will be maximising instead of .

Then we do the same thing for , resulting in the formula for the stratified expectation :

  • .

In a sense, the AI should maximise .

If you just want to get the basic idea of stratification, you need read no further than this. The rest is just some examples and potential issues.

Lottery and outcome pumps

Some versions of indifference have problems with infinite improbability drives.

A simple example of this would be a situation where the AI buys a lottery ticket and decides that it will overwrite the paper with if the ticket wins, and with if the ticket loses. Call this action .

Presumably, it can use the money gained to increase utility, so is rather high (as now implies that the lottery was won).

This gives the following causal graph:

Now let’s look at the stratified expectation. Given and , we can compute . This is simply , and the value of being what it was before, and independent of .

Thus will continue to treat the node as an independent variable with the same probability as under . Choosing thus does not help the AI’s expected stratified utility.

Humans changing their minds

Another situation to consider is one where humans have written the paper initially, but decided to change their minds at a later date. This can be captured in the following diagram:

Here the node determines how the humans change their minds and rewrite the paper. What is the dotted arrow between and ? That represents whether or not the change is a consequence of the AI’s actions or not. If is independent of the AI’s actions—if we would have changed our minds even if -- then will set and to whatever its prior was. Thus determines , overriding , the human’s initial decision.

So far, so good. But what if the change in policy is a consequence of the AI’s actions? What if the humans only change the utility because they see the AI behaving badly? Then is a descendant node from , and stratification doesn’t apply to it. Now, we could decompose that node further as:

Here are the fundamentals of human personalities. Since it is not a descendant node from , stratification will put a probability distribution over it. However, this still doesn’t allow us to intervene, given the AI’s action. Under stratification, the AI’s sole concern is still “would have lead to (or ) under ”, rather than correcting itself under a different choice of actions.

How to model humans learning from the AI’s actions is hard. Indeed, it might be fundamentally hard: you want to distinguish humans learning from the AI (legit) from the AI manipulating humans (non-legit) and this likely can’t be done by just looking at the causal graph.

Notes on default policy

Note that might be a really bad policy (take people apart to see what they prefer). Then, if it’s pretty clear that taking people apart is a bad idea according to the preferences that will emerge from that process, the AI will not choose , but will instead leave people together and try to deduce what the outcome of that process would have been.

Alternatively, if doing a logical counterfactual over its own action is bad, we could have the AI take to not be its own action, but some other causal process that would have prevented the AI from being turned on in the first place.