# Decision Theory

*(A longer text-based version of this post is also available on MIRI’s blog* *here, and the bibliography for the whole sequence can be found* *here.)*

*The next post in this sequence, ‘Embedded Agency’, will come out on Friday, November 2nd.*

*Tomorrow’s AI Alignment Forum sequences post will be ‘What is Ambitious Value Learning?’ in the sequence ‘Value Learning’.*

Cross-posting some comments from the MIRI Blog:

Konstantin Surkov:

Abram Demski:

Sure, one can imagine hypothetically taking $5, even if in reality they would take $10. That’s a spurious output from a different algorithm altogether. it assumes the world where you are not the same person who takes $10. So, it would make sense to examine which of the two you are, if you don’t yet know that you will take $10, but not if you already know it. Which of the two is it?

I’m not convinced that an inconsequential grain of uncertainty couldn’t handle this 5-10 problem. Consider an agent whose actions are probability distributions on {5,10} that are nowhere 0. We can call these points in the open affine space spanned by the points 5 and 10. U is then a linear function from this affine space to utilities. The agent would search for proofs that U is some particular such linear function. Once it finds one, it uses that linear function to compute the optimal action. To ensure that there is an optimum, we can adjoin infinitesimal values to the possible probabilities and utilities.

If the agent were to find a proof that the linear function is the one induced by mapping 5 to 5 and 10 to 0, it would return (1-ε)⋅5+ε⋅10 and get utility 5+5ε instead of the expected 5-5ε, so Löb’s theorem wouldn’t make this self-fulfilling.

So, your suggestion is not just an inconsequential grain of uncertainty, it is an grain of exploration. The agent actually does take 10 with some small probability. If you try to do this with just uncertainty, things would be worse, since that uncertainty would not be justified.

One problem is that you actually do explore a bunch, and since you don’t get a reset button, you will sometimes explore into irreversible actions, like shutting yourself off. However, if the agent has a source of randomness, and also the ability to simulate worlds in which that randomness went another way, you can have an agent that with probability 1−ε does not explore ever, and learns from the other worlds in which it does explore. So, you can either explore forever, and shut yourself off, or you can explore very very rarely and learn from other possible worlds.

The problem with learning from other possible worlds is to get good results out of it, you have to assume that the environment does not also learn from other possible worlds, which is not very embedded.

But you are suggesting actually exploring a bunch, and there is a problem other than just shutting yourself off. You are getting past this problem in this case by only allowing linear functions, but that is not an accurate assumption. Let’s say you are playing matching pennies with Omega, who has the ability to predict what probability you will pick but not what action you will pick.

(In matching pennies, you each choose H or T, you win if they match, they win if they don’t.)

Omega will pick H if your probability of H is less that

^{1}⁄_{2}and T otherwise. Your utility as a function of probability is piecewise linear with two parts. Trying to assume that it will be linear will make things messy.There is this problem where sometimes the outcome of exploring into taking 10, and the outcome of actually taking 10 because it is good are different. More on this here.

I am talking about the surreal number ε, which is smaller than any positive real. Events of likelihood ε do not actually happen, we just keep them around so the counterfactual reasoning does not divide by 0.

Within the simulation, the AI might be able to conclude that it just made an ε-likelihood decision and must therefore be in a counterfactual simulation. It should of course carry on as it were, in order to help the simulating version of itself.

Why shouldn’t the environment be learning?

To the Omega scenario I would say that since we have an Omega-proof random number generator, we get new strategic options that should be included in the available actions. The linear function then goes from the ε-adjoined open affine space generated by {Pick H with probability p | p real, non-negative and at most 1} to the ε-adjoined utilities, and we correctly solve Omega’s problem by using p=1/2.

Yeah, so its like you have this private data, which is an infinite sequence of bits, and if you see all 0′s you take an exploration action. I think that by giving the agent these private bits and promising that the bits do not change the rest of the world, you are essentially giving the agent access to a causal counterfactual that you constructed. You don’t even have to mix with what the agent actually does, you can explore with every action and ask if it is better to explore and take 5 or explore and take 10. By doing this, you are essentially giving the agent access to a causal counterfactual, because conditioning on these infinitesimals is basically like coming in and changing what the agent does. I think giving the agent a true source of randomness actually does let you implement CDT.

If the environment learns from the other possible worlds, It might punish or reward you in one world for stuff that you do in the other world, so you cant just ask which world is best to figure out what to do.

I agree that that is how you want to think about the matching pennies problem. However the point is that your proposed solution assumed linearity. It didn’t empirically observe linearity. You have to be able to tell the difference between the situations in order to know not to assume linearity in the matching pennies problem. The method for telling the difference is how you determine whether or not and in what ways you have logical control over Omega’s prediction of you.

I posit that linearity always holds. In a deterministic universe, the linear function is between the ε-adjoined open affine space generated by our primitive set of actions and the ε-adjoined utilities. (Like in my first comment.)

In a probabilistic universe, the linear function is between the ε-adjoined open affine space generated by (the set of points in) the closed affine space generated by our primitive set of actions and the ε-adjoined utilities. (Like in my second comment.)

I got from one of your comments that assuming linearity wards off some problem. Does it come back in the probabilistic-universe case?

My point was that I don’t know where to assume the linearity is. Whenever I have private randomness, I have linearity over what I end up choosing with that randomness, but not linearity over what probability I choose. But I think this is non getting at the disagreement, so I pivot to:

In your model, what does it mean to prove that U is some linear affine function? If I prove that my probability p is

^{1}⁄_{2}and that U=7.5, have I proven that U is the constant function 7.5? If there is only one value of p, it is not defined what the utility function is, unless I successfully carve the universe in such a way as to let me replace the action with various things and see what happens. (or, assuming linearity replace the probability with enough linearly independent things (in this case 2) to define the function.In the matching pennies game, U() would be proven to be ∫A()(p)∗min(p,1−p)dp. A could maximize this by returning ε when p isn’t 12, and 1−∫ ε dp (where ε is so small that this is still infinitesimally close to 1) when p is 12.

The linearity is always in the function between ε-adjoined open affine spaces. Whether the utilities also end up linear in the closed affine space (ie nobody cares about our reasoning process) is for the object-level information gathering process to deduce from the environment.

You never prove that you will with certainty decide p=12. You always leave a so-you’re-saying-there’s-a chance of exploration, which produces a grain of uncertainty. To execute the action, you inspect the ceremonial Boltzmann Bit (which is implemented by being constantly set to “discard the ε“), but which you treat as having an ε chance of flipping.

The self-modification module could note that inspecting that bit is a no-op, see that removing it would make the counterfactual reasoning module crash, and leave up the Chesterton fence.

But how do you avoid proving with certainty that p=1/2?

Since your proposal does not say what to do if you find inconsistent proofs that the linear function is two different things, I will assume that if it finds multiple different proofs, it defaults to 5 for the following.

Here is another example:

You are in a 5 and 10 problem. You have twin that is also in a 5 and 10 problem. You have exactly the same source code. There is a consistency checker, and if you and your twin do different things, you both get 0 utility.

You can prove that you and your twin do the same thing. Thus you can prove that the function is 5+5p. You can also prove that your twin takes 5 by Lob’s theorem. (You can also prove that you take 5 by Lob’s theorem, but you ignore that proof, since “there is always a chance”) Thus, you can prove that the function is 5-5p. Your system doesn’t know what to do with two functions, so it defaults to 5. (If it is provable that you both take 5, you both take 5, completing the proof by Lob.)

I am doing the same thing as before, but because I put it outside of the agent, it does not get flagged with the “there is always a chance” module. This is trying to illustrate that your proposal takes advantage of a separation between the agent and the environment that was snuck in, and could be done incorrectly.

Two possible fixes:

1) You could say that the agent, instead of taking 5 when finding inconsistency takes some action that exhibits the inconsistency (something that the two functions give different values). This is very similar to the chicken rule, and if you add something like this, you don’t really need the rest of your system. If you take an agent that whenever it proves it does something, it does something else. This agent will prove (given enough time) that if it takes 5 it gets 5, and if it takes 10 it gets 10.

2) I had one proof system, and just ignored the proofs that I found that I did a thing. I could instead give the agent a special proof system that is incapable of proving what it does, but how do you do that? Chicken rule seems like the place to start.

One problem with the chicken rule is that it was developed in a system that was deductively closed, so you can’t prove something that passes though a proof of P without proving P. If you violate this, by having a random theorem prover, you might have an system that fails to prove “I take 5” but proves “I take 5 and 1+1=2″ and uses this to complete the Lob loop.

I can’t prove what I’m going to do and I can’t prove that I and the twin are going to do the same thing, because of the Boltzmann Bits in both of our decision-makers that might turn out different ways. But I can prove that we have a 1−2ε+2ε2 chance of doing the same thing, and my expected utility is (1−ε)2⋅10+ε2⋅5, rounding to 10 once it actually happens.

Content feedback : the inferential distance between Löb’s theorem and spurious counterfactuals seems larger than that of the other points. Maybe that’s because I haven’t internalised the theorem, not being a logician and all.

Unnecessary nitpick: the gears in the robot’s brain would turn just fine as drawn: since the outer gears are both turning anticlockwise, the inner gear would just turn clockwise. (I think my inner engineer is showing)

If you know your own actions, why would you reason about taking different actions? Wouldn’t you reason about someone who is almost like you, but just different enough to make a different choice?

Sure. How do you do that?

Notice (well, you already know that) that accepting that identical agents make identical decisions (superrationality, as it were) and to make different decisions in identical circumstances the agents must necessarily be different, gets you out of many pickles. For example, in the 5&10 game an agent would examine its own algorithm, see that it leads to taking $10 and stop there. There is no “what would happen if you took a different action”, because the agent taking a different action would not be you, not exactly. So, no Lobian obstacle. In return, you give up something a lot more emotionally valuable: the delusion of making conscious decisions. Pick your poison.

Why do even that much if this reasoning could not be used? The question is about the reasoning that could contribute to the decision, that could describe the algorithm, and so has the option to not “stop there”. What if you see that your algorithm leads to taking the $10 and instead of stopping there, you take the $5?

Nothing stops you. This is the “chicken rule” and it solves some issues, but more importantly illustrates the possibility in how a decision algorithm can function. The fact that this is a thing is evidence that there may be something wrong with the “stop there” proposal. Specifically, you usually don’t know that your reasoning is actual, that it’s even logically possible and not part of an impossible counterfactual, but this is not a hopeless hypothetical where nothing matters. Nothing compels you to affirm what you know about your actions or conclusions, this is not a necessity in a decision making algorithm, but different things you do may have an impact on what happens, because the situation may be actual after all, depending on what happens or what you decide, or it may be predicted from within an actual situation and influence what happens there. This motivates learning to reason in and about possibly impossible situations.

What if you examine your algorithm and find that it takes the $5 instead? It could be the same algorithm that takes the $10, but you don’t know that, instead you arrive at the $5 conclusion using reasoning that could be impossible, but that you don’t know to be impossible, that you haven’t decided yet to make impossible. One way to solve the issue is to render the situation where that holds impossible, by contradicting the conclusion with your action, or in some other way. To know when to do that, you should be able to reason about and within such situations that could be impossible, or could be made impossible, including by the decisions made in them. This makes the way you reason in them relevant, even when in the end these situations don’t occur, because you don’t a priori know that they don’t occur.

(The 5-and-10 problem is not specifically about this issue, and explicit reasoning about impossible situations may be avoided, perhaps should be avoided, but my guess is that the crux in this comment thread is about things like usefulness of reasoning from within possibly impossible situations, where even your own knowledge arrived at by pure computation isn’t necessarily correct.)

Thank you for your explanation! Still trying to understand it. I understand that there is no point examining one’s algorithm if you already execute it and see what it does.

I don’t understand that point. you say “nothing stops you”, but that is only possible if you could act contrary to your own algorithm, no? Which makes no sense to me, unless the same algorithm gives different outcomes for different inputs, e.g. “if I simply run the algorithm, I take $10, but if I examine the algorithm before running it and then run it, I take $5″. But it doesn’t seem like the thing you mean, so I am confused.

How can it be possible? if your examination of your algorithm is accurate, it gives the same outcome as mindlessly running it, with is taking $10, no?

So your reasoning is inaccurate, in that you arrive to a wrong conclusion about the algorithm output, right? You just don’t know where the error lies, or even that there is an error to begin with. But in this case you would arrive to a wrong conclusion about the same algorithm run by a different agent, right? So there is nothing special about it being your own algorithm and not someone else’s. If so, the issue is reduced to finding an accurate algorithm analysis tool, for an algorithm that demonstrably halts in a very short time, producing one of the two possible outcomes. This seems to have little to do with decision theory issues, so I am lost as to how this is relevant to the situation.

I am clearly missing some of your logic here, but I still have no idea what the missing piece is, unless it’s the libertarian free will thing, where one can act contrary to one’s programming. Any further help would be greatly appreciated.

Rather there is no point if you are not going to do anything with the results of the examination. It may be useful if you make the decision based on what you observe (about how you make the decision).

You can, for a certain value of “can”. It won’t have happened, of course, but you may still decide to act contrary to how you act, two different outcomes of the same algorithm. The contradiction proves that you didn’t face the situation that triggers it in actuality, but the contradiction results precisely from deciding to act contrary to the observed way in which you act, in a situation that a priori could be actual, but is rendered counterlogical as a result of your decision. If instead you affirm the observed action, then there is no contradiction and so it’s possible that you have faced the situation in actuality. Thus the “chicken rule”, playing chicken with the universe, making the present situation impossible when you don’t like it.

You don’t know that it’s inaccurate, you’ve just run the computation and it said $5. Maybe this didn’t actually happen, but you are considering this situation without knowing if it’s actual. If you ignore the computation, then why run it? If you run it, you need responses to all possible results, and all possible results except one are not actual, yet you should be ready to respond to them without knowing which is which. So I’m discussing what you might do for the result that says that you take the $5. And in the end, the use you make of the results is by choosing to take the $5 or the $10.

This map from predictions to decisions could be anything. It’s trivial to write an algorithm that includes such a map. Of course, if the map diagonalizes, then the predictor will fail (won’t give a prediction), but the map is your reasoning in these hypothetical situations, and the fact that the map may say anything corresponds to the fact that you may decide anything. The map doesn’t have to be identity, decision doesn’t have to reflect prediction, because you may write an algorithm where it’s not identity.

This confuses me even more. You can imagine act contrary to your own algorithm, but the imagining different possible outcomes is a side effect of running the main algorithm that takes $10. It is never the outcome of it. Or an outcome. Since you know you will end up taking $10, I also don’t understand the idea of playing chicken with the universe. Are there any references for it?

Wait, what? We started with the assumption that examining the algorithm, or running it, shows that you will take $10, no? I guess I still don’t understand how

is even possible, or worth considering.

Hmm, maybe this is where I miss some of the logic. If the predictions are accurate, the map is bijective. If the predictions are inaccurate, you need a better algorithm analysis tool.

To me this screams “get a better algorithm analyzer!” and has nothing to do with whether it’s your own algorithm, or someone else’s. Can you maybe give an example where one ends up in a situation where there is no obvious algorithm analyzer one can apply?

Content feedback:

The Preface to the Sequence on Value Learning contains the following advice on research directions for that sequence:

This provides specific direction on what to look at and what work needs done. If such a statement for this sequence is possible, I think it would be valuable to include.

It was not until reading this that I really understood that I am in the habit of reasoning about myself as just a part of the environment.

The kicker is that we don’t reason directly about ourselves as such, we use a simplified model of ourselves. And we’re REALLY GOOD at using that model for causal reasoning, even when it is reflective, and involves multiple levels of self-reflection and counterfactuals—at least when we bother to try. (We try rarely because explicit modelling is cognitively demanding, and we usually use defaults / conditioned reasoning. Sometimes that’s OK.)

Example: It is 10PM. A 5-page report is due in 12 hours, at 10AM.

Default: Go to sleep at 1AM, set alarm for 8AM. Result: Don’t finish report tonight, have too little time to do so tomorrow.

Conditioned reasoning: Stay up to finish the report first. 5 hours of work, and stay up until 3AM. Result? Write bad report, still feel exhausted the next day

Counterfactual reasoning: I should nap / get some amount of sleep so that I am better able to concentrate, which will outweigh the lost time. I could set my alarm for any amount of time; what amount does my model of myself imply will lead to an optimal well-rested / sufficient time trade-off?

Self-reflection problem, second use of mini-self model: I’m worse at reasoning at 1AM than I am at 10PM. I should decide what to do now, instead of delaying until then. I think going to sleep at 12AM and waking at 3AM gives me enough rest and time to do a good job on the report.

Consider counterfactual and impact: How does this impact the rest of my week’s schedule? 3 hours is locally optimal, but I will crash tomorrow and I have a test to study for the next day. Decide to work a bit, go to sleep at 12:30 and set alarm for 5:30AM. Finish the report, turn it in by 10AM, then nap another 2 hours before studying.

We built this model based on not only small samples of our own history, but learning from others, incorporating data about seeing other people’s experiences. We don’t consider staying up all night and then driving to handing the report, because we realize exhausted driving is dangerous—because we heard stories of people doing so, and know that we would be similarly unsteady. Is a person going to explore and try different strategies by staying up all night and driving? If you die, you can’t learn from the experience—so you have good ideas ab out what parts of the exploration space are safe to try. You might use Adderall because it’s been tried before and is relatively safe, but you don’t ingest arbitrary drugs to see if they help you think.

BUT an AI doesn’t (at first) have that sample data to reason from, nor does a singleton have observation of other near-structurally identical AI systems and the impacts of their decisions, nor does it have a fundamental understanding about what is safe to explore.

## Thoughts on counterfactual reasoning

These examples of counterfactuals are presented as equivalent, but they seem meaningfully distinct:

Specifically, they don’t seem equally difficult for

meto evaluate. I can easily imagine the sun going out, but I’m not even sure what it would mean if 2+2=3. It confuses me that these two different examples are presented as equivalent, because they seem to be instances of meaningfully distinct classes ofsomething. I spent some time trying to characterize why the sun example is intuitively easy for me and the math example is intuitively difficult for me. I came up with some ideas, but I won’t go into details yet because they seem like the obvious sorts of things that anyone who has read The Sequences (a.k.a., Rationality: A-Z) would have thought of. I strongly suspect there’s prior work. It is also possible that I don’t fully understand the problem yet.## Questions about counterfactual reasoning

The two counterfactual reasoning examples above (and others) are presented as equivalent, but they seem like they are not.

1. Is this an intentional simplification for the benefit of new readers?

2. If so, can someone point me to the prior work exploring the omitted nuances of counterfactuals? I don’t want to re-invent the wheel.

3. If not, would exploration of the characteristics of

different kindsof counterfactualsbe a fruitful area of research?