Uncertainty

This is part of a se­quence on de­ci­sion anal­y­sis.

De­ci­sion-mak­ing un­der cer­tainty is pretty bor­ing. You know ex­actly what each choice will do, and so you or­der the out­comes based on your prefer­ences, and pick the ac­tion that leads to the best out­come.

Hu­man de­ci­sion-mak­ing, though, is made in the pres­ence of un­cer­tainty. De­ci­sion anal­y­sis—care­ful de­ci­sion mak­ing—is all about cop­ing with the ex­is­tence of un­cer­tainty.

Some ter­minol­ogy: a dis­tinc­tion is some­thing un­cer­tain; an event is each of the pos­si­ble out­comes of that dis­tinc­tion; a prospect is an event that you have a per­sonal stake in, and a deal is a dis­tinc­tion over prospects. This post will fo­cus on dis­tinc­tions and events. If you’re com­fortable with prob­a­bil­ity just jump to the four bolded ques­tions and make sure you get the an­swers right. Deals are the in­ter­est­ing part, but re­quire this back­ground.

I should say from the very start that I am quan­tify­ing un­cer­tainty as “prob­a­bil­ity.” There is only one 800th digit of Pi (in base 10), other peo­ple already know it, and it’s not go­ing to change. I don’t know what it is, though, and so when I talk about the prob­a­bil­ity that the 800th digit of Pi is a par­tic­u­lar num­ber what I’m de­scribing is what’s go­ing on in my head. Right now, my map is mostly blank (I as­sign .1 prob­a­bil­ity to 0 to 9); once I look it up, the map will change but the ter­ri­tory will not. I’ll use un­cer­tainty and prob­a­bil­ity in­ter­change­ably through­out this post.

The 800th digit of Pi (in base 10) is a dis­tinc­tion with 10 pos­si­ble events, 0 through 9. To be sen­si­ble, dis­tinc­tions should be clear and un­am­bigu­ous. A dis­tinc­tion like “the tem­per­a­ture to­mor­row” is un­clear- the tem­per­a­ture where, and at what time to­mor­row? A dis­tinc­tion like “the max­i­mum tem­per­a­ture recorded by the Na­tional Weather Ser­vice at the Austin-Bergstrom In­ter­na­tional Air­port in the 24 hours be­fore mid­night (EST) on 11/​30/​2011″ is un­am­bigu­ous. Think of it like Pre­dic­tionBook- you want to be able to cre­ate this dis­tinc­tion such that any­one could come across it and know what you’re refer­ring to.

Pos­si­bil­ities can be dis­crete or con­tin­u­ous. There are only a finite num­ber of pos­si­ble digits for the 800th digit of Pi, but the tem­per­a­ture is con­tin­u­ous and un­bounded.1 A bi­ased coin has a con­tin­u­ous pa­ram­e­ter p that refers to how likely it is to land on heads in cer­tain con­di­tions; while that’s bounded by 0 and 1, there are an in­finite num­ber of pos­si­bil­ities in be­tween.

For now, let’s fo­cus on dis­tinc­tions with dis­crete pos­si­bil­ities. Sup­pose we have four cards- two blue and two red. We shuffle the cards and draw two of them. What is the prob­a­bil­ity that both drawn cards will be red? (an­swer be­low the pic­ture)

This is a sim­ple prob­lem, but one that many peo­ple get wrong, so let’s step through it as care­fully as pos­si­ble. There are two dis­tinc­tions here- the color of the first drawn card, and the color of the sec­ond drawn card. For each dis­tinc­tion, the pos­si­ble events are blue (B) and red (R). The prob­a­bil­ity that the first card is red we’ll ex­press as P(R|&). That should be read as “prob­a­bil­ity of draw­ing a red card given back­ground knowl­edge.” The “&” refers to all the knowl­edge the prob­lem has given us; some­times it’s left off and we just talk about P(R). There are four pos­si­ble cards, two of which are red, and so P(R|&)=2/​4=1/​2.

Now we need to figure out the prob­a­bil­ity that the sec­ond card is red. We’ll ex­press that as P(R|R&), which means “the prob­a­bil­ity of draw­ing a red card given back­ground knowl­edge and a drawn red card.” There are three cards left, one of which is red, and so the prob­a­bil­ity is now 13.

But what we’re re­ally in­ter­ested in is P(RR|&), “the prob­a­bil­ity of draw­ing two red cards given back­ground knowl­edge.” We can di­vide this sin­gle dis­tinc­tion into two dis­tinc­tions: P(RR|&)=P(R|R&)*P(R|&)=1/​2*1/​3=16. Prob­a­bil­ities are con­joined by mul­ti­pli­ca­tion.

No­tice that, for the first two cards drawn, there are four events: RR, RB, BR, and BB. Those events have differ­ent prob­a­bil­ities: 16, 13, 13, and 16. Those rep­re­sent the joint prob­a­bil­ity dis­tri­bu­tion of the first two cards, and the joint prob­a­bil­ity dis­tri­bu­tion con­tains all the in­for­ma­tion we need. If you’re in­ter­ested in the chance that the sec­ond card is blue with no in­for­ma­tion about the first (P(*B|&)), you add up RB and BB to get 1/​3+1/​6=1/​2 (which is what you should have ex­pected it to be).

Bayes’ Rule, by the way, is easy to see when dis­cussing events. If I wanted to figure out P(RB|*B&), what I want to do is take the event RB (prob­a­bil­ity 13) and make it more likely by di­vid­ing out the prob­a­bil­ity of my cur­rent state of knowl­edge (that the sec­ond card was blue, prob­a­bil­ity 12). Alter­na­tively, I could con­sider the event RB as a frac­tion of the set of events that fit my knowl­edge, which is both RB and BB- (1/​3)/​(1/​3+1/​6)=2/​3.

Rele­vance

Most peo­ple who get the ques­tion about cards wrong get it wrong be­cause they square 12 to get 14, for­get­ting that the sec­ond card de­pends on the first. Since there’s a limited sup­ply of cards, as soon as you draw one you can be more cer­tain that the next card isn’t that color.

Depen­dence is dis­tinct from causal­ity. If I hear the weath­er­man claim that it will rain with 50% prob­a­bil­ity, that will ad­just my cer­tainty that it will rain, even though the weath­er­man can’t di­rectly in­fluence whether or not it will rain. Some peo­ple use the word rele­vance in­stead, as it’s nat­u­ral to think that the weath­er­man’s pre­dic­tion is rele­vant to the like­li­hood of rain but may not be nat­u­ral to think that the chance of rain de­pends on the weath­er­man’s pre­dic­tion.

Rele­vance goes both ways. If the weath­er­man’s pre­dic­tion gives me knowl­edge about whether or not it will rain, then know­ing whether or not it rained gives me knowl­edge about what the weath­er­man’s pre­dic­tion was. Bayes’ Rule is crit­i­cal for ma­neu­ver­ing through rele­vant dis­tinc­tions. Sup­pose the weath­er­man could give only two pre­dic­tions: Sunny or Rainy. If he pre­dicts Sunny, it will rain with 10% prob­a­bil­ity. If he pre­dicts Rainy, it will rain with 50% prob­a­bil­ity. If it rains 20% of the time, how of­ten does he pre­dict Rainy? (an­swer)

Sup­pose it rains. What’s the chance that the weath­er­man pre­dicted Rainy? (an­swer be­low the pic­ture)

This is a sim­ple ap­pli­ca­tion of Bayes’ Rule: P(Rainy|Rain)=P(Rain|Rainy)P(Rainy)/​P(Rain).

Alter­na­tively, we can figure out the prob­a­bil­ities of the four el­e­men­tary events: P(Rainy,Rain)=.125, P(Rainy,Sun)=.125, P(Sunny,Rain)=.075, P(Sunny,Sun)=.675. If we know it rained and want to know if he pre­dicted Rainy, we care about P(Rainy,Rain)/​(P(Rainy,Rain)+P(Sunny,Rain)).

This can get very com­pli­cated if there are a large num­ber of events or rele­vant dis­tinc­tions, but soft­ware ex­ists to solve that prob­lem.

Con­tin­u­ous Distri­bu­tions

Sup­pose, though, that you don’t have just two events to as­sign prob­a­bil­ity to. In­stead of be­ing un­cer­tain about whether or not it will rain, I might be un­cer­tain about how much it will rain, con­di­tioned on it rain­ing.2 If I try to elicit a prob­a­bil­ity for ev­ery pos­si­ble amount, that’ll take me a long time (un­less I bin the heights, mak­ing it dis­crete, which still might take far longer or be far harder than I can deal with, if there are lots of bins).

In that case, I would ex­press my un­cer­tainty as a prob­a­bil­ity den­sity func­tion (pdf) or cu­mu­la­tive prob­a­bil­ity den­sity func­tion (cdf). The first is the prob­a­bil­ity den­sity at a par­tic­u­lar value, whereas the sec­ond is the den­sity in­te­grated from the be­gin­ning of the do­main to that value. To get a prob­a­bil­ity from a den­sity, you have to in­te­grate. A pdf can have any non-nega­tive value and any shape over the do­main, though it has to in­te­grate to 1, while a cdf has a min­i­mum of 0, a max­i­mum of 1, and is non-de­creas­ing.

Let’s take the ex­am­ple of the bi­ased coin. To make it more pre­cise, since coin flips are messy and phys­i­cal, sup­pose I have some ran­dom num­ber gen­er­a­tor that uniformly gen­er­ates any real num­ber be­tween 0 and 1, and a de­vice hooked up to it with an un­known thresh­old value p be­tween 0 and 1.3 When I press a but­ton, the gen­er­a­tor gen­er­ates a ran­dom num­ber, hands it to the de­vice, which then shows a pic­ture of heads if the num­ber is be­low or equal to the thresh­old and a pic­ture of tails if the num­ber is above the thresh­old. I don’t get to see the num­ber that was gen­er­ated- just a head or tail ev­ery time I press the but­ton.

I be­gin by be­ing un­cer­tain about the thresh­old value, ex­cept know­ing its do­main. I as­sign a uniform prior- I think it’s equally likely that the thresh­old value is at ev­ery point be­tween 0 and 1. Math­e­mat­i­cally, that means my pdf is P(p=x)=1. I can in­te­grate that from 0 to y to get a cdf of C(p≤y)=∫1dx=y. Like we needed, the pdf in­te­grates to 1, the cdf has a min­i­mum of 0 and max­i­mum of 1, and is non-de­creas­ing. From those, we can calcu­late my cer­tainty that the thresh­old value is in a par­tic­u­lar range (by in­te­grat­ing the pdf over that range) or any par­tic­u­lar point (0, be­cause it’s an in­te­gral of 0 width).

Up­dat­ing

Now we press the but­ton, see some­thing, and need to up­date our un­cer­tainty (prob­a­bil­ity dis­tri­bu­tion). How should we do that?

Well, by Bayes’ rule of course! But I’ll do it in a some­what round­about way, to give you some more in­tu­ition why the rule works. Sup­pose we saw heads. For each pos­si­ble thresh­old value, we know how likely that was- p, the thresh­old value. We can now com­pute the prob­a­bil­ity den­sity of (heads if p) and (p) by mul­ti­ply­ing those to­gether, and x times 1 = x. So my pdf is now P(p=x)=x and cdf is C(p≤y)=.5y2.

Well, not quite. My pdf doesn’t in­te­grate to 1, and my cdf, while it does have a min at 0, doesn’t have a max of 1. I need to renor­mal­ize- that is, di­vide by the chance that I saw heads in the first place. That was 12, and so I get P(p=x)=2x and C(p≤y)=y2 and ev­ery­thing works out. If I saw tails, my like­li­hood is in­stead 1-p, and that prop­a­gates through to P(p=x)=2-2x and C(p≤y)=2y-y2.

Sup­pose my setup were even less helpful. In­stead of show­ing heads or tails, it in­stead gen­er­ates two num­bers, com­putes heads or tails for each num­ber sep­a­rately, and then prints out ei­ther “S” if both re­sults were the same or “D” if the re­sults were differ­ent. If I start with a uniform prior, what will my pdf and cdf on the thresh­old value p be af­ter I see S? If I saw D in­stead? (If you don’t know calcu­lus, don’t worry- most of the rest of this se­quence will deal with dis­crete events.)

I recom­mend giv­ing it a try be­fore check­ing, but the pdf is linked here for S and here for D. (cdfs: S and D)

Con­ju­gate Priors

That’s a lot of work to do ev­ery time you get in­for­ma­tion, though. If you pick what’s called a con­ju­gate prior, up­dat­ing is sim­ple, whereas it re­quires mul­ti­pli­ca­tion and in­te­gra­tion for an ar­bi­trary prior. The uniform prior is a con­ju­gate prior for the sim­ple bi­ased coin prob­lem, be­cause uniform is a spe­cial case of the beta dis­tri­bu­tion. You can use Be(heads+1,tails+1) as your pos­te­rior prob­a­bil­ity for any num­ber of heads and tails that you see, and the math is already done for you. Con­ju­gate pri­ors are a big part of do­ing con­tin­u­ous Bayesian anal­y­sis in prac­tice, but won’t be too rele­vant to the rest of this se­quence.


1. The tem­per­a­ture as recorded by the Na­tional Weather Ser­vice is not con­tin­u­ous and is, in prac­tice, bounded. (The NWS will only con­tinue ex­ist­ing for some tem­per­a­ture range, and even if a tech­ni­cal er­ror caused the NWS to record a bizarre tem­per­a­ture, they’re limited by how their sys­tem stores num­bers.)

2. I would prob­a­bly nar­row my pre­dic­tion down to the height of the wa­ter in a grad­u­ated cylin­der set in a rep­re­sen­ta­tive lo­ca­tion.

3. In case you’re won­der­ing, this sort of thing is fairly easy to cre­ate with a two-level quan­tum sys­tem and thus get “gen­uine” ran­dom­ness.