No mathematical decision theory requires verbal explanations to be part of the model that it operates on. (It’s true that when learning a causal model from data, you need causal assumptions; but when a problem provides the model rather than the data, this is not necessary.)
What I’m saying is that the only way to solve any decision theory problem is to learn a causal model from data. It just doesn’t make sense to postulate particular correlations between an EDT agent’s decisions and other things before you even know what EDT decides! The only reason you get away with assuming graphs like lesion -> (CDT Agent) -> action for CDT is because the first thing CDT does when calculating a decision is break all connections to parents by means of do(...).
Take Jiro’s example. The lesion makes people jump into volcanoes. 100% of them, and no-one else. Furthermore, I’ll postulate that all of them are using decision theory “check if I have the lesion, if so, jump into a volcano, otherwise don’t”. Should you infer the causal graph lesion -> (EDT decision: jump?) -> die with a perfect correlation between lesion and jump? (Hint: no, that would be stupid, since we’re not using jump-based-on-lesion-decision-theory, we’re using EDT.)
There is a mindset that programming cultivates, which is that the system does exactly what you tell it to, with the corollary that your intentions have no weight.
In programming, we also say “garbage in, garbage out”. You are feeding EDT garbage input by giving it factually wrong joint probability distributions.
Ok, what about cases where there are multiple causal hypotheses that are observationally indistinguishable:
a → b → c
vs
a ← b ← c
Both models imply the same joint probability distribution p(a,b,c) with a single conditional independence (a independent of c given b) and cannot be told apart without experimentation. That is, you cannot call p(a,b,c) “factually wrong” because the correct causal model implies it. But the wrong causal model implies it too! To figure out which is which requires causal information. You can give it to EDT and it will work—but then it’s not EDT anymore.
I can give you a graph which implies the same independences as my HAART example but has a completely different causal structure, and the procedure you propose here:
will give the right answer in one case and the wrong answer in another.
The point is, EDT lacks a rich enough input language to avoid getting garbage inputs in lots of standard cases. Or, more precisely, EDT lacks a rich enough input languages to tell when input is garbage and when it isn’t. This is why EDT is a terrible decision theory.
What I’m saying is that the only way to solve any decision theory problem is to learn a causal model from data.
I think there are a couple of confusions this sentence highlights.
First, there are approaches to solving decision theory problems that don’t use causal models. Part of what has made this conversation challenging is that there are several different ways to represent the world- and so even if CDT is the best / natural one, it needs to be distinguished from other approaches. EDT is not CDT in disguise; the two are distinct formulas / approaches.
Second, there are good reasons to modularize the components of the decision theory, so that you can treat learning a model from data separately from making a decision given a model. An algorithm to turn models into decisions should be able to operate on an arbitrary model, where it sees a → b → c as isomorphic to Drunk → Fall → Death.
To tell an anecdote, when my decision analysis professor would teach that subject to petroleum engineers, he quickly learned not to use petroleum examples. Say something like “suppose the probability of striking oil by drilling a well here is 40%” and an engineer’s hand will shoot up, asking “what kind of rock is it?”. The kind of rock is useful for determining whether or not the probability is 40% or something else, but the question totally misses the point of what the professor is trying to teach. The primary example he uses is choosing a location for a party subject to the uncertainty of the weather.
It just doesn’t make sense to postulate particular correlations between an EDT agent’s decisions and other things before you even know what EDT decides!
I’m not sure how to interpret this sentence.
The way EDT operates is to perform the following three steps for each possible action in turn:
Assume that I saw myself doing X.
Perform a Bayesian update on this new evidence.
Calculate and record my utility.
It then chooses the possible action which had the highest calculated utility.
One interpretation is you saying that EDT doesn’t make sense, but I’m not sure I agree with what seems to be the stated reason. It looks to me like you’re saying “it doesn’t make sense to assume that you do X until you know what you decide!”, when I think that does make sense, but the problem is using that assumption as Bayesian evidence as if it were an observation.
The way EDT operates is to perform the following three steps for each possible action in turn:
Assume that I saw myself doing X.
Perform a Bayesian update on this new evidence.
Calculate and record my utility.
Ideal Bayesian updates assume logical omniscience, right? Including knowledge about logical fact of what EDT would do for any given input. If you know that you are an EDT agent, and condition on all of your past observations and also on the fact that you do X, but X is not in fact what EDT does given those inputs, then as an ideal Bayesian you will know that you’re conditioning on something impossible. More generally, what update you perform in step 2 depends on EDT’s input-output map, thus making the definition circular.
So, is EDT really underspecified? Or are you supposed to search for a fixed point of the circular definition, if there is one? Or does it use some method other than Bayes for the hypothetical update? Or does an EDT agent really break if it ever finds out its own decision algorithm? Or did I totally misunderstand?
Ideal Bayesian updates assume logical omniscience, right? Including knowledge about logical fact of what EDT would do for any given input.
Note that step 1 is “Assume that I saw myself doing X,” not “Assume that EDT outputs X as the optimal action.” I believe that excludes any contradictions along those lines. Does logical omniscience preclude imagining counterfactual worlds?
If I already know “I am EDT”, then “I saw myself doing X” does imply “EDT outputs X as the optimal action”. Logical omniscience doesn’t preclude imagining counterfactual worlds, but imagining counterfactual worlds is a different operation than performing Bayesian updates. CDT constructs counterfactuals by severing some of the edges in its causal graph and then assuming certain values for the nodes that no longer have any causes. TDT does too, except with a different graph and a different choice of edges to sever.
I don’t know how I can fail to communicate so consistently.
Yes, you can technically apply “EDT” to any causal model or (more generally) joint probability distribution containing a “EDT agent decision” node. But in practice this freedom is useless, because to derive an accurate model you generally need to take account of a) the fact that the agent is using EDT and b) any observations the agent does or does not make. To be clear, the input EDT requires is a probabilistic model describing the EDT agent’s situation (not describing historical data of “similar” situations).
There are people here trying to argue against EDT by taking a model describing historical data (such as people following dumb decision theories jumping into volcanoes) and feeding this model directly into EDT. Which is simply wrong. A model that describes the historical behaviour of agents using some other decision theory does not in general accurately describe an EDT agent in the same situation.
The fact that this egregious mistake looks perfectly normal is an artifact of the fact that CDT doesn’t care about causal parents of the “CDT decision” node.
I don’t know how I can fail to communicate so consistently.
I suspect it’s because what you are referring to as “EDT” is not what experts in the field use that technical term to mean.
nsheppard-EDT is, as far as I can tell, the second half of CDT. Take a causal model and use the do() operator to create the manipulated subgraph that would result taking possible action (as an intervention). Determine the joint probability distribution from the manipulated subgraph. Condition on observing that action with the joint probability distribution, and calculate the probabilistically-weighted mean utility of the possible outcomes. This is isomorphic to CDT, and so referring to it as EDT leads to confusion.
What I’m saying is that the only way to solve any decision theory problem is to learn a causal model from data. It just doesn’t make sense to postulate particular correlations between an EDT agent’s decisions and other things before you even know what EDT decides! The only reason you get away with assuming graphs like
lesion -> (CDT Agent) -> action
for CDT is because the first thing CDT does when calculating a decision is break all connections to parents by means ofdo(...)
.Take Jiro’s example. The lesion makes people jump into volcanoes. 100% of them, and no-one else. Furthermore, I’ll postulate that all of them are using decision theory “check if I have the lesion, if so, jump into a volcano, otherwise don’t”. Should you infer the causal graph
lesion -> (EDT decision: jump?) -> die
with a perfect correlation betweenlesion
andjump
? (Hint: no, that would be stupid, since we’re not using jump-based-on-lesion-decision-theory, we’re using EDT.)In programming, we also say “garbage in, garbage out”. You are feeding EDT garbage input by giving it factually wrong joint probability distributions.
Ok, what about cases where there are multiple causal hypotheses that are observationally indistinguishable:
a → b → c
vs
a ← b ← c
Both models imply the same joint probability distribution p(a,b,c) with a single conditional independence (a independent of c given b) and cannot be told apart without experimentation. That is, you cannot call p(a,b,c) “factually wrong” because the correct causal model implies it. But the wrong causal model implies it too! To figure out which is which requires causal information. You can give it to EDT and it will work—but then it’s not EDT anymore.
I can give you a graph which implies the same independences as my HAART example but has a completely different causal structure, and the procedure you propose here:
http://lesswrong.com/lw/hwq/evidential_decision_theory_selection_bias_and/9d6f
will give the right answer in one case and the wrong answer in another.
The point is, EDT lacks a rich enough input language to avoid getting garbage inputs in lots of standard cases. Or, more precisely, EDT lacks a rich enough input languages to tell when input is garbage and when it isn’t. This is why EDT is a terrible decision theory.
I think there are a couple of confusions this sentence highlights.
First, there are approaches to solving decision theory problems that don’t use causal models. Part of what has made this conversation challenging is that there are several different ways to represent the world- and so even if CDT is the best / natural one, it needs to be distinguished from other approaches. EDT is not CDT in disguise; the two are distinct formulas / approaches.
Second, there are good reasons to modularize the components of the decision theory, so that you can treat learning a model from data separately from making a decision given a model. An algorithm to turn models into decisions should be able to operate on an arbitrary model, where it sees a → b → c as isomorphic to Drunk → Fall → Death.
To tell an anecdote, when my decision analysis professor would teach that subject to petroleum engineers, he quickly learned not to use petroleum examples. Say something like “suppose the probability of striking oil by drilling a well here is 40%” and an engineer’s hand will shoot up, asking “what kind of rock is it?”. The kind of rock is useful for determining whether or not the probability is 40% or something else, but the question totally misses the point of what the professor is trying to teach. The primary example he uses is choosing a location for a party subject to the uncertainty of the weather.
I’m not sure how to interpret this sentence.
The way EDT operates is to perform the following three steps for each possible action in turn:
Assume that I saw myself doing X.
Perform a Bayesian update on this new evidence.
Calculate and record my utility.
It then chooses the possible action which had the highest calculated utility.
One interpretation is you saying that EDT doesn’t make sense, but I’m not sure I agree with what seems to be the stated reason. It looks to me like you’re saying “it doesn’t make sense to assume that you do X until you know what you decide!”, when I think that does make sense, but the problem is using that assumption as Bayesian evidence as if it were an observation.
Ideal Bayesian updates assume logical omniscience, right? Including knowledge about logical fact of what EDT would do for any given input. If you know that you are an EDT agent, and condition on all of your past observations and also on the fact that you do X, but X is not in fact what EDT does given those inputs, then as an ideal Bayesian you will know that you’re conditioning on something impossible. More generally, what update you perform in step 2 depends on EDT’s input-output map, thus making the definition circular.
So, is EDT really underspecified? Or are you supposed to search for a fixed point of the circular definition, if there is one? Or does it use some method other than Bayes for the hypothetical update? Or does an EDT agent really break if it ever finds out its own decision algorithm? Or did I totally misunderstand?
Note that step 1 is “Assume that I saw myself doing X,” not “Assume that EDT outputs X as the optimal action.” I believe that excludes any contradictions along those lines. Does logical omniscience preclude imagining counterfactual worlds?
If I already know “I am EDT”, then “I saw myself doing X” does imply “EDT outputs X as the optimal action”. Logical omniscience doesn’t preclude imagining counterfactual worlds, but imagining counterfactual worlds is a different operation than performing Bayesian updates. CDT constructs counterfactuals by severing some of the edges in its causal graph and then assuming certain values for the nodes that no longer have any causes. TDT does too, except with a different graph and a different choice of edges to sever.
I don’t know how I can fail to communicate so consistently.
Yes, you can technically apply “EDT” to any causal model or (more generally) joint probability distribution containing a “EDT agent decision” node. But in practice this freedom is useless, because to derive an accurate model you generally need to take account of a) the fact that the agent is using EDT and b) any observations the agent does or does not make. To be clear, the input EDT requires is a probabilistic model describing the EDT agent’s situation (not describing historical data of “similar” situations).
There are people here trying to argue against EDT by taking a model describing historical data (such as people following dumb decision theories jumping into volcanoes) and feeding this model directly into EDT. Which is simply wrong. A model that describes the historical behaviour of agents using some other decision theory does not in general accurately describe an EDT agent in the same situation.
The fact that this egregious mistake looks perfectly normal is an artifact of the fact that CDT doesn’t care about causal parents of the “CDT decision” node.
I suspect it’s because what you are referring to as “EDT” is not what experts in the field use that technical term to mean.
nsheppard-EDT is, as far as I can tell, the second half of CDT. Take a causal model and use the do() operator to create the manipulated subgraph that would result taking possible action (as an intervention). Determine the joint probability distribution from the manipulated subgraph. Condition on observing that action with the joint probability distribution, and calculate the probabilistically-weighted mean utility of the possible outcomes. This is isomorphic to CDT, and so referring to it as EDT leads to confusion.
Whatever. I give up.