This is part of a sequence on decision analysis.
Decision-making under certainty is pretty boring. You know exactly what each choice will do, and so you order the outcomes based on your preferences, and pick the action that leads to the best outcome.
Human decision-making, though, is made in the presence of uncertainty. Decision analysis—careful decision making—is all about coping with the existence of uncertainty.
Some terminology: a distinction is something uncertain; an event is each of the possible outcomes of that distinction; a prospect is an event that you have a personal stake in, and a deal is a distinction over prospects. This post will focus on distinctions and events. If you’re comfortable with probability just jump to the four bolded questions and make sure you get the answers right. Deals are the interesting part, but require this background.
I should say from the very start that I am quantifying uncertainty as “probability.” There is only one 800th digit of Pi (in base 10), other people already know it, and it’s not going to change. I don’t know what it is, though, and so when I talk about the probability that the 800th digit of Pi is a particular number what I’m describing is what’s going on in my head. Right now, my map is mostly blank (I assign .1 probability to 0 to 9); once I look it up, the map will change but the territory will not. I’ll use uncertainty and probability interchangeably throughout this post.
The 800th digit of Pi (in base 10) is a distinction with 10 possible events, 0 through 9. To be sensible, distinctions should be clear and unambiguous. A distinction like “the temperature tomorrow” is unclear- the temperature where, and at what time tomorrow? A distinction like “the maximum temperature recorded by the National Weather Service at the Austin-Bergstrom International Airport in the 24 hours before midnight (EST) on 11/30/2011″ is unambiguous. Think of it like PredictionBook- you want to be able to create this distinction such that anyone could come across it and know what you’re referring to.
Possibilities can be discrete or continuous. There are only a finite number of possible digits for the 800th digit of Pi, but the temperature is continuous and unbounded.1 A biased coin has a continuous parameter p that refers to how likely it is to land on heads in certain conditions; while that’s bounded by 0 and 1, there are an infinite number of possibilities in between.
For now, let’s focus on distinctions with discrete possibilities. Suppose we have four cards- two blue and two red. We shuffle the cards and draw two of them. What is the probability that both drawn cards will be red? (answer below the picture)
This is a simple problem, but one that many people get wrong, so let’s step through it as carefully as possible. There are two distinctions here- the color of the first drawn card, and the color of the second drawn card. For each distinction, the possible events are blue (B) and red (R). The probability that the first card is red we’ll express as P(R|&). That should be read as “probability of drawing a red card given background knowledge.” The “&” refers to all the knowledge the problem has given us; sometimes it’s left off and we just talk about P(R). There are four possible cards, two of which are red, and so P(R|&)=2/4=1/2.
Now we need to figure out the probability that the second card is red. We’ll express that as P(R|R&), which means “the probability of drawing a red card given background knowledge and a drawn red card.” There are three cards left, one of which is red, and so the probability is now 1⁄3.
But what we’re really interested in is P(RR|&), “the probability of drawing two red cards given background knowledge.” We can divide this single distinction into two distinctions: P(RR|&)=P(R|R&)*P(R|&)=1/2*1/3=1⁄6. Probabilities are conjoined by multiplication.
Notice that, for the first two cards drawn, there are four events: RR, RB, BR, and BB. Those events have different probabilities: 1⁄6, 1⁄3, 1⁄3, and 1⁄6. Those represent the joint probability distribution of the first two cards, and the joint probability distribution contains all the information we need. If you’re interested in the chance that the second card is blue with no information about the first (P(*B|&)), you add up RB and BB to get 1/3+1/6=1/2 (which is what you should have expected it to be).
Bayes’ Rule, by the way, is easy to see when discussing events. If I wanted to figure out P(RB|*B&), what I want to do is take the event RB (probability 1⁄3) and make it more likely by dividing out the probability of my current state of knowledge (that the second card was blue, probability 1⁄2). Alternatively, I could consider the event RB as a fraction of the set of events that fit my knowledge, which is both RB and BB- (1/3)/(1/3+1/6)=2/3.
Most people who get the question about cards wrong get it wrong because they square 1⁄2 to get 1⁄4, forgetting that the second card depends on the first. Since there’s a limited supply of cards, as soon as you draw one you can be more certain that the next card isn’t that color.
Dependence is distinct from causality. If I hear the weatherman claim that it will rain with 50% probability, that will adjust my certainty that it will rain, even though the weatherman can’t directly influence whether or not it will rain. Some people use the word relevance instead, as it’s natural to think that the weatherman’s prediction is relevant to the likelihood of rain but may not be natural to think that the chance of rain depends on the weatherman’s prediction.
Relevance goes both ways. If the weatherman’s prediction gives me knowledge about whether or not it will rain, then knowing whether or not it rained gives me knowledge about what the weatherman’s prediction was. Bayes’ Rule is critical for maneuvering through relevant distinctions. Suppose the weatherman could give only two predictions: Sunny or Rainy. If he predicts Sunny, it will rain with 10% probability. If he predicts Rainy, it will rain with 50% probability. If it rains 20% of the time, how often does he predict Rainy? (answer)
Suppose it rains. What’s the chance that the weatherman predicted Rainy? (answer below the picture)
This is a simple application of Bayes’ Rule: P(Rainy|Rain)=P(Rain|Rainy)P(Rainy)/P(Rain).
Alternatively, we can figure out the probabilities of the four elementary events: P(Rainy,Rain)=.125, P(Rainy,Sun)=.125, P(Sunny,Rain)=.075, P(Sunny,Sun)=.675. If we know it rained and want to know if he predicted Rainy, we care about P(Rainy,Rain)/(P(Rainy,Rain)+P(Sunny,Rain)).
This can get very complicated if there are a large number of events or relevant distinctions, but software exists to solve that problem.
Suppose, though, that you don’t have just two events to assign probability to. Instead of being uncertain about whether or not it will rain, I might be uncertain about how much it will rain, conditioned on it raining.2 If I try to elicit a probability for every possible amount, that’ll take me a long time (unless I bin the heights, making it discrete, which still might take far longer or be far harder than I can deal with, if there are lots of bins).
In that case, I would express my uncertainty as a probability density function (pdf) or cumulative probability density function (cdf). The first is the probability density at a particular value, whereas the second is the density integrated from the beginning of the domain to that value. To get a probability from a density, you have to integrate. A pdf can have any non-negative value and any shape over the domain, though it has to integrate to 1, while a cdf has a minimum of 0, a maximum of 1, and is non-decreasing.
Let’s take the example of the biased coin. To make it more precise, since coin flips are messy and physical, suppose I have some random number generator that uniformly generates any real number between 0 and 1, and a device hooked up to it with an unknown threshold value p between 0 and 1.3 When I press a button, the generator generates a random number, hands it to the device, which then shows a picture of heads if the number is below or equal to the threshold and a picture of tails if the number is above the threshold. I don’t get to see the number that was generated- just a head or tail every time I press the button.
I begin by being uncertain about the threshold value, except knowing its domain. I assign a uniform prior- I think it’s equally likely that the threshold value is at every point between 0 and 1. Mathematically, that means my pdf is P(p=x)=1. I can integrate that from 0 to y to get a cdf of C(p≤y)=∫1dx=y. Like we needed, the pdf integrates to 1, the cdf has a minimum of 0 and maximum of 1, and is non-decreasing. From those, we can calculate my certainty that the threshold value is in a particular range (by integrating the pdf over that range) or any particular point (0, because it’s an integral of 0 width).
Now we press the button, see something, and need to update our uncertainty (probability distribution). How should we do that?
Well, by Bayes’ rule of course! But I’ll do it in a somewhat roundabout way, to give you some more intuition why the rule works. Suppose we saw heads. For each possible threshold value, we know how likely that was- p, the threshold value. We can now compute the probability density of (heads if p) and (p) by multiplying those together, and x times 1 = x. So my pdf is now P(p=x)=x and cdf is C(p≤y)=.5y2.
Well, not quite. My pdf doesn’t integrate to 1, and my cdf, while it does have a min at 0, doesn’t have a max of 1. I need to renormalize- that is, divide by the chance that I saw heads in the first place. That was 1⁄2, and so I get P(p=x)=2x and C(p≤y)=y2 and everything works out. If I saw tails, my likelihood is instead 1-p, and that propagates through to P(p=x)=2-2x and C(p≤y)=2y-y2.
Suppose my setup were even less helpful. Instead of showing heads or tails, it instead generates two numbers, computes heads or tails for each number separately, and then prints out either “S” if both results were the same or “D” if the results were different. If I start with a uniform prior, what will my pdf and cdf on the threshold value p be after I see S? If I saw D instead? (If you don’t know calculus, don’t worry- most of the rest of this sequence will deal with discrete events.)
That’s a lot of work to do every time you get information, though. If you pick what’s called a conjugate prior, updating is simple, whereas it requires multiplication and integration for an arbitrary prior. The uniform prior is a conjugate prior for the simple biased coin problem, because uniform is a special case of the beta distribution. You can use Be(heads+1,tails+1) as your posterior probability for any number of heads and tails that you see, and the math is already done for you. Conjugate priors are a big part of doing continuous Bayesian analysis in practice, but won’t be too relevant to the rest of this sequence.
1. The temperature as recorded by the National Weather Service is not continuous and is, in practice, bounded. (The NWS will only continue existing for some temperature range, and even if a technical error caused the NWS to record a bizarre temperature, they’re limited by how their system stores numbers.)
2. I would probably narrow my prediction down to the height of the water in a graduated cylinder set in a representative location.
3. In case you’re wondering, this sort of thing is fairly easy to create with a two-level quantum system and thus get “genuine” randomness.