# Uncertainty

This is part of a se­quence on de­ci­sion anal­y­sis.

De­ci­sion-mak­ing un­der cer­tainty is pretty bor­ing. You know ex­actly what each choice will do, and so you or­der the out­comes based on your prefer­ences, and pick the ac­tion that leads to the best out­come.

Hu­man de­ci­sion-mak­ing, though, is made in the pres­ence of un­cer­tainty. De­ci­sion anal­y­sis—care­ful de­ci­sion mak­ing—is all about cop­ing with the ex­is­tence of un­cer­tainty.

Some ter­minol­ogy: a dis­tinc­tion is some­thing un­cer­tain; an event is each of the pos­si­ble out­comes of that dis­tinc­tion; a prospect is an event that you have a per­sonal stake in, and a deal is a dis­tinc­tion over prospects. This post will fo­cus on dis­tinc­tions and events. If you’re com­fortable with prob­a­bil­ity just jump to the four bolded ques­tions and make sure you get the an­swers right. Deals are the in­ter­est­ing part, but re­quire this back­ground.

I should say from the very start that I am quan­tify­ing un­cer­tainty as “prob­a­bil­ity.” There is only one 800th digit of Pi (in base 10), other peo­ple already know it, and it’s not go­ing to change. I don’t know what it is, though, and so when I talk about the prob­a­bil­ity that the 800th digit of Pi is a par­tic­u­lar num­ber what I’m de­scribing is what’s go­ing on in my head. Right now, my map is mostly blank (I as­sign .1 prob­a­bil­ity to 0 to 9); once I look it up, the map will change but the ter­ri­tory will not. I’ll use un­cer­tainty and prob­a­bil­ity in­ter­change­ably through­out this post.

The 800th digit of Pi (in base 10) is a dis­tinc­tion with 10 pos­si­ble events, 0 through 9. To be sen­si­ble, dis­tinc­tions should be clear and un­am­bigu­ous. A dis­tinc­tion like “the tem­per­a­ture to­mor­row” is un­clear- the tem­per­a­ture where, and at what time to­mor­row? A dis­tinc­tion like “the max­i­mum tem­per­a­ture recorded by the Na­tional Weather Ser­vice at the Austin-Bergstrom In­ter­na­tional Air­port in the 24 hours be­fore mid­night (EST) on 11/​30/​2011″ is un­am­bigu­ous. Think of it like Pre­dic­tionBook- you want to be able to cre­ate this dis­tinc­tion such that any­one could come across it and know what you’re refer­ring to.

Pos­si­bil­ities can be dis­crete or con­tin­u­ous. There are only a finite num­ber of pos­si­ble digits for the 800th digit of Pi, but the tem­per­a­ture is con­tin­u­ous and un­bounded.1 A bi­ased coin has a con­tin­u­ous pa­ram­e­ter p that refers to how likely it is to land on heads in cer­tain con­di­tions; while that’s bounded by 0 and 1, there are an in­finite num­ber of pos­si­bil­ities in be­tween.

For now, let’s fo­cus on dis­tinc­tions with dis­crete pos­si­bil­ities. Sup­pose we have four cards- two blue and two red. We shuffle the cards and draw two of them. What is the prob­a­bil­ity that both drawn cards will be red? (an­swer be­low the pic­ture)

This is a sim­ple prob­lem, but one that many peo­ple get wrong, so let’s step through it as care­fully as pos­si­ble. There are two dis­tinc­tions here- the color of the first drawn card, and the color of the sec­ond drawn card. For each dis­tinc­tion, the pos­si­ble events are blue (B) and red (R). The prob­a­bil­ity that the first card is red we’ll ex­press as P(R|&). That should be read as “prob­a­bil­ity of draw­ing a red card given back­ground knowl­edge.” The “&” refers to all the knowl­edge the prob­lem has given us; some­times it’s left off and we just talk about P(R). There are four pos­si­ble cards, two of which are red, and so P(R|&)=2/​4=1/​2.

Now we need to figure out the prob­a­bil­ity that the sec­ond card is red. We’ll ex­press that as P(R|R&), which means “the prob­a­bil­ity of draw­ing a red card given back­ground knowl­edge and a drawn red card.” There are three cards left, one of which is red, and so the prob­a­bil­ity is now 13.

But what we’re re­ally in­ter­ested in is P(RR|&), “the prob­a­bil­ity of draw­ing two red cards given back­ground knowl­edge.” We can di­vide this sin­gle dis­tinc­tion into two dis­tinc­tions: P(RR|&)=P(R|R&)*P(R|&)=1/​2*1/​3=16. Prob­a­bil­ities are con­joined by mul­ti­pli­ca­tion.

No­tice that, for the first two cards drawn, there are four events: RR, RB, BR, and BB. Those events have differ­ent prob­a­bil­ities: 16, 13, 13, and 16. Those rep­re­sent the joint prob­a­bil­ity dis­tri­bu­tion of the first two cards, and the joint prob­a­bil­ity dis­tri­bu­tion con­tains all the in­for­ma­tion we need. If you’re in­ter­ested in the chance that the sec­ond card is blue with no in­for­ma­tion about the first (P(*B|&)), you add up RB and BB to get 1/​3+1/​6=1/​2 (which is what you should have ex­pected it to be).

Bayes’ Rule, by the way, is easy to see when dis­cussing events. If I wanted to figure out P(RB|*B&), what I want to do is take the event RB (prob­a­bil­ity 13) and make it more likely by di­vid­ing out the prob­a­bil­ity of my cur­rent state of knowl­edge (that the sec­ond card was blue, prob­a­bil­ity 12). Alter­na­tively, I could con­sider the event RB as a frac­tion of the set of events that fit my knowl­edge, which is both RB and BB- (1/​3)/​(1/​3+1/​6)=2/​3.

## Rele­vance

Most peo­ple who get the ques­tion about cards wrong get it wrong be­cause they square 12 to get 14, for­get­ting that the sec­ond card de­pends on the first. Since there’s a limited sup­ply of cards, as soon as you draw one you can be more cer­tain that the next card isn’t that color.

Depen­dence is dis­tinct from causal­ity. If I hear the weath­er­man claim that it will rain with 50% prob­a­bil­ity, that will ad­just my cer­tainty that it will rain, even though the weath­er­man can’t di­rectly in­fluence whether or not it will rain. Some peo­ple use the word rele­vance in­stead, as it’s nat­u­ral to think that the weath­er­man’s pre­dic­tion is rele­vant to the like­li­hood of rain but may not be nat­u­ral to think that the chance of rain de­pends on the weath­er­man’s pre­dic­tion.

Rele­vance goes both ways. If the weath­er­man’s pre­dic­tion gives me knowl­edge about whether or not it will rain, then know­ing whether or not it rained gives me knowl­edge about what the weath­er­man’s pre­dic­tion was. Bayes’ Rule is crit­i­cal for ma­neu­ver­ing through rele­vant dis­tinc­tions. Sup­pose the weath­er­man could give only two pre­dic­tions: Sunny or Rainy. If he pre­dicts Sunny, it will rain with 10% prob­a­bil­ity. If he pre­dicts Rainy, it will rain with 50% prob­a­bil­ity. If it rains 20% of the time, how of­ten does he pre­dict Rainy? (an­swer)

Sup­pose it rains. What’s the chance that the weath­er­man pre­dicted Rainy? (an­swer be­low the pic­ture)

This is a sim­ple ap­pli­ca­tion of Bayes’ Rule: P(Rainy|Rain)=P(Rain|Rainy)P(Rainy)/​P(Rain).

Alter­na­tively, we can figure out the prob­a­bil­ities of the four el­e­men­tary events: P(Rainy,Rain)=.125, P(Rainy,Sun)=.125, P(Sunny,Rain)=.075, P(Sunny,Sun)=.675. If we know it rained and want to know if he pre­dicted Rainy, we care about P(Rainy,Rain)/​(P(Rainy,Rain)+P(Sunny,Rain)).

This can get very com­pli­cated if there are a large num­ber of events or rele­vant dis­tinc­tions, but soft­ware ex­ists to solve that prob­lem.

## Con­tin­u­ous Distri­bu­tions

Sup­pose, though, that you don’t have just two events to as­sign prob­a­bil­ity to. In­stead of be­ing un­cer­tain about whether or not it will rain, I might be un­cer­tain about how much it will rain, con­di­tioned on it rain­ing.2 If I try to elicit a prob­a­bil­ity for ev­ery pos­si­ble amount, that’ll take me a long time (un­less I bin the heights, mak­ing it dis­crete, which still might take far longer or be far harder than I can deal with, if there are lots of bins).

In that case, I would ex­press my un­cer­tainty as a prob­a­bil­ity den­sity func­tion (pdf) or cu­mu­la­tive prob­a­bil­ity den­sity func­tion (cdf). The first is the prob­a­bil­ity den­sity at a par­tic­u­lar value, whereas the sec­ond is the den­sity in­te­grated from the be­gin­ning of the do­main to that value. To get a prob­a­bil­ity from a den­sity, you have to in­te­grate. A pdf can have any non-nega­tive value and any shape over the do­main, though it has to in­te­grate to 1, while a cdf has a min­i­mum of 0, a max­i­mum of 1, and is non-de­creas­ing.

Let’s take the ex­am­ple of the bi­ased coin. To make it more pre­cise, since coin flips are messy and phys­i­cal, sup­pose I have some ran­dom num­ber gen­er­a­tor that uniformly gen­er­ates any real num­ber be­tween 0 and 1, and a de­vice hooked up to it with an un­known thresh­old value p be­tween 0 and 1.3 When I press a but­ton, the gen­er­a­tor gen­er­ates a ran­dom num­ber, hands it to the de­vice, which then shows a pic­ture of heads if the num­ber is be­low or equal to the thresh­old and a pic­ture of tails if the num­ber is above the thresh­old. I don’t get to see the num­ber that was gen­er­ated- just a head or tail ev­ery time I press the but­ton.

I be­gin by be­ing un­cer­tain about the thresh­old value, ex­cept know­ing its do­main. I as­sign a uniform prior- I think it’s equally likely that the thresh­old value is at ev­ery point be­tween 0 and 1. Math­e­mat­i­cally, that means my pdf is P(p=x)=1. I can in­te­grate that from 0 to y to get a cdf of C(p≤y)=∫1dx=y. Like we needed, the pdf in­te­grates to 1, the cdf has a min­i­mum of 0 and max­i­mum of 1, and is non-de­creas­ing. From those, we can calcu­late my cer­tainty that the thresh­old value is in a par­tic­u­lar range (by in­te­grat­ing the pdf over that range) or any par­tic­u­lar point (0, be­cause it’s an in­te­gral of 0 width).

## Up­dat­ing

Now we press the but­ton, see some­thing, and need to up­date our un­cer­tainty (prob­a­bil­ity dis­tri­bu­tion). How should we do that?

Well, by Bayes’ rule of course! But I’ll do it in a some­what round­about way, to give you some more in­tu­ition why the rule works. Sup­pose we saw heads. For each pos­si­ble thresh­old value, we know how likely that was- p, the thresh­old value. We can now com­pute the prob­a­bil­ity den­sity of (heads if p) and (p) by mul­ti­ply­ing those to­gether, and x times 1 = x. So my pdf is now P(p=x)=x and cdf is C(p≤y)=.5y2.

Well, not quite. My pdf doesn’t in­te­grate to 1, and my cdf, while it does have a min at 0, doesn’t have a max of 1. I need to renor­mal­ize- that is, di­vide by the chance that I saw heads in the first place. That was 12, and so I get P(p=x)=2x and C(p≤y)=y2 and ev­ery­thing works out. If I saw tails, my like­li­hood is in­stead 1-p, and that prop­a­gates through to P(p=x)=2-2x and C(p≤y)=2y-y2.

Sup­pose my setup were even less helpful. In­stead of show­ing heads or tails, it in­stead gen­er­ates two num­bers, com­putes heads or tails for each num­ber sep­a­rately, and then prints out ei­ther “S” if both re­sults were the same or “D” if the re­sults were differ­ent. If I start with a uniform prior, what will my pdf and cdf on the thresh­old value p be af­ter I see S? If I saw D in­stead? (If you don’t know calcu­lus, don’t worry- most of the rest of this se­quence will deal with dis­crete events.)

I recom­mend giv­ing it a try be­fore check­ing, but the pdf is linked here for S and here for D. (cdfs: S and D)

## Con­ju­gate Priors

That’s a lot of work to do ev­ery time you get in­for­ma­tion, though. If you pick what’s called a con­ju­gate prior, up­dat­ing is sim­ple, whereas it re­quires mul­ti­pli­ca­tion and in­te­gra­tion for an ar­bi­trary prior. The uniform prior is a con­ju­gate prior for the sim­ple bi­ased coin prob­lem, be­cause uniform is a spe­cial case of the beta dis­tri­bu­tion. You can use Be(heads+1,tails+1) as your pos­te­rior prob­a­bil­ity for any num­ber of heads and tails that you see, and the math is already done for you. Con­ju­gate pri­ors are a big part of do­ing con­tin­u­ous Bayesian anal­y­sis in prac­tice, but won’t be too rele­vant to the rest of this se­quence.

1. The tem­per­a­ture as recorded by the Na­tional Weather Ser­vice is not con­tin­u­ous and is, in prac­tice, bounded. (The NWS will only con­tinue ex­ist­ing for some tem­per­a­ture range, and even if a tech­ni­cal er­ror caused the NWS to record a bizarre tem­per­a­ture, they’re limited by how their sys­tem stores num­bers.)

2. I would prob­a­bly nar­row my pre­dic­tion down to the height of the wa­ter in a grad­u­ated cylin­der set in a rep­re­sen­ta­tive lo­ca­tion.

3. In case you’re won­der­ing, this sort of thing is fairly easy to cre­ate with a two-level quan­tum sys­tem and thus get “gen­uine” ran­dom­ness.

• I’m hav­ing difficul­ties with your ter­minol­ogy. You’ve given spe­cial mean­ings to “dis­tinc­tion”, “prospect”, and “deal” that IMO don’t bear any ob­vi­ous re­la­tion­ship to their com­mon us­age (“event” makes more sense). Hence, I don’t find those terms helpful in evok­ing the in­tended con­cepts. See­ing “A deal is a dis­tinc­tion over prospects” is roughly as use­ful to me as see­ing “A flim is a fnord over grun­gas”. In both case, I have to keep a cheat-sheet handy to un­der­stand what you mean, since I can’t rely on an as­so­ci­a­tion be­tween word and con­cept that I’ve already in­ter­nal­ized. Maybe this is ac­cepted ter­minol­ogy that I’m not aware of?

• I’m hav­ing difficul­ties with your ter­minol­ogy.

I’m not sure yet how much the ter­minol­ogy will pop up in fu­ture ar­ti­cles (one of the pit­falls of post­ing them as you go). I don’t think it will mat­ter much, but if it fu­ture posts are un­clear point out where the lan­guage is prob­le­matic and I’ll try to make things clearer.

• While the prob­a­bil­is­tic rea­son­ing em­ployed in the card ques­tion is cor­rect and fits in with your over­all point, it’s rather la­bor-in­ten­sive to ac­tu­ally think through.

In or­der to get two red cards, you need to pick the right pair of cards. Only one pair will do. There are six ways to pick a pair of cards out of a group of 4 (when, as here, or­der doesn’t mat­ter). There­fore, the odds are 16, as one out of the six pos­si­ble pairs you’ll pick will be the cor­rect pair.

Similarly, we know the weath­er­per­son cor­rectly pre­dicts 12.5% of days that will be rainy. We know that 20% of days will ac­tu­ally be rain­ing. That gives us “12.5/​20 = 5/​8” pretty quickly. Grind­ing our way through all the P(X [ ~X) rep­re­sen­ta­tion makes a sim­ple and in­tu­itive calcu­la­tion look re­ally in­timi­dat­ing.

I’m not en­tirely sure of your pur­pose in this se­quence, but it seems to be to im­prove peo­ple’s prob­a­bil­is­tic rea­son­ing. Ex­plain­ing prob­a­bil­ities through this long and de­tailed method seems guaran­teed to fail. Peo­ple who are perfectly com­fortable with such com­plex ex­pla­na­tions gen­er­ally already get their ap­pli­ca­tion. Peo­ple who are not so com­fortable throw up their hands and stick with their gut. I sus­pect that a large part of the ex­pla­na­tion of math­e­mat­i­cal illiter­acy is that peo­ple aren’t ac­tu­ally taught how to ap­ply math­e­mat­ics in any prac­ti­cal sense; they’re given a log­i­cally rigor­ous and for­mal proof in un­nec­es­sary de­tail which is too com­plex to use in in­for­mal rea­son­ing.

• Speak­ing only for my­self, I’m in that awk­ward mid­dle stage—I un­der­stand prob­a­bil­ity well enough to solve toy prob­lems, and to fol­low ex­pla­na­tions of it in real prob­lems, but not enough to be con­fi­dent in my own prob­a­bil­is­tic in­ter­pre­ta­tion of new prob­lem do­mains. I’m look­ing for­ward to this se­quence as part of my ed­u­ca­tion and definitely ap­pre­ci­ate see­ing the for­mal­ity be­hind the ap­pli­ca­tions.

• I’m glad this is in­tu­itive for you!

The rea­son I spotlighted la­bor-in­ten­sive meth­ods is be­cause this post is tar­geted at peo­ple who don’t find this in­tu­itive. I’d rather give them a method that can be ex­tended to other situ­a­tions with low risk (ap­ply­ing Bayes’ Rule, imag­in­ing the world af­ter re­ceiv­ing an up­date and calcu­lat­ing new prob­a­bil­ities) rather than iden­ti­fy­ing sym­me­tries in the prob­lems and us­ing those to quickly get an­swers.

The rest of the se­quence uses this as back­ground, but prob­a­bil­ity calcu­la­tions play a sec­ondary role. The tech­niques I’ll dis­cuss re­quire a mod­er­ate level of com­fort with prob­a­bil­ities, but not with prob­a­bil­is­tic calcu­la­tions- those can (and prob­a­bly should) be offloaded to a calcu­la­tor. The challenge is set­ting up the right prob­lem, not solv­ing a prob­lem once you’ve set it up.

• Can you elab­o­rate on the calcu­la­tion for S? I think it should be this, but I’m not con­fi­dent in my math.

• Yours was cor­rect; edit­ing the post. I skipped a step and that made my pre­vi­ous an­swer wrong.

• Maybe I’m miss­ing some­thing ob­vi­ous here, but I’m un­sure how to calcu­late P(S). I’d ap­pre­ci­ate it if some­one could post an ex­pla­na­tion.

• Sure. S re­sults from HH or from TT, so we’ll calcu­late those in­de­pen­dently and add them to­gether at the end. We’ll do that by this equa­tion: P(p=x|S) = P(p=x|HH) P(H) + P(p=x|TT) P(T).

We start out with a uniform prior: P(p=x) = 1. After ob­serv­ing one H, by Bayes’ rule, P(p=x|H) = P(H|p=x) P(p=x) /​ P(H). P(H|p=x) is just x. Our prior is 1. P(H) is our prior, mul­ti­plied by x, in­te­grated from 0 to 1. That’s 12. So P(p=x|H) = x1/​(1/​2) = 2x.

Ap­ply the same pro­cess again for the sec­ond H. Bayes’ rule: P(p=x|HH) = P(H|p=x,H) P(p=x|H) /​ P(H|H). The first term is still just x. The sec­ond term is our up­dated be­lief, 2x. The de­nom­i­na­tor is our up­dated be­lief, mul­ti­plied by x, in­te­grated from 0 to 1. That’s 23 this time. So P(p=x|HH) = x2x/​(2/​3) = 3x^2.

Calcu­lat­ing tails is similar, ex­cept we up­date with 1-x in­stead of x. So our be­lief goes from 1, to 2-2x, to 3x^2-6x+3. Then sub­sti­tute both of these into the origi­nal equa­tion: (3/​2)(x^2) + (3/​2)(x^2 − 2x + 1). From there it’s just a bit of alge­bra to get it into the form I linked to.

• I am re­ally happy to see more for­mal Bayes on LW. Ditto for de­ci­sion anal­y­sis. They get talked about fre­quently but I don’t usu­ally see to much math be­ing used. That said I was slightly con­fused, speci­fi­cally its pretty clear what cdf and pdf are in terms of how they are de­rived from prob­a­bil­ity den­sity. How­ever its not quite clear what you mean by prob­a­bil­ity den­sity. Am I over­look­ing/​mi­s­un­der­stand­ing a ex­pla­na­tion or are we as­sumed to already know what it is?

• A prob­a­bil­ity den­sity is just like any other kind of den­sity; it’s the amount of prob­a­bil­ity per unit vol­ume. (In one-di­men­sion, the ‘vol­ume’ equiv­a­lent is length.) You need it when you have a con­tin­u­ous be­lief space but not when you have a dis­crete be­lief space. If you’re do­ing billiard ball physics with point masses, you don’t need mass den­si­ties; like­wise if you’re com­par­ing billiard ball be­liefs rather than real ones (the weath­er­man doesn’t say “Rainy” or “Sunny” but ex­presses a per­centage) you don’t need prob­a­bil­ity den­si­ties.

• Ok, that makes sense.

• The wolfram alpha links in the ar­ti­cle and pre­vi­ous com­ments seem to be bro­ken in that the paren­the­ses in the math­e­mat­i­cal ex­pres­sion are miss­ing, mean­ing that the links pre­sent read­ers with the wrong an­swer. It was rather con­fus­ing for me for a bit. You might want to up­date the links to some­thing like this:

``````S pdf: http://​​www.wolfra­malpha.com/​​in­put/​​?i=in­te­grate+3%2F2+*+%281+-+2*x+%2B+2*x^2%29+++from+0+to+1
D pdf: http://​​www.wolfra­malpha.com/​​in­put/​​?i=in­te­grate+6+*+%28x+-+x^2%29+++from+0+to+1
``````
• Also the pages says that there are two com­ments but it is only load­ing one. Be­fore I posted it similarly said that there was one com­ment but dis­played no com­ments. Does any­one know whats go­ing on?

• I have no clue, but I’m also see­ing it as count­ing a phan­tom com­ment.