# Probability, knowledge, and meta-probability

This ar­ti­cle is the first in a se­quence that will con­sider situ­a­tions where prob­a­bil­ity es­ti­mates are not, by them­selves, ad­e­quate to make ra­tio­nal de­ci­sions. This one in­tro­duces a “meta-prob­a­bil­ity” ap­proach, bor­rowed from E. T. Jaynes, and uses it to an­a­lyze a gam­bling prob­lem. This situ­a­tion is one in which rea­son­ably straight­for­ward de­ci­sion-the­o­retic meth­ods suffice. Later ar­ti­cles in­tro­duce in­creas­ingly prob­le­matic cases.

## A sur­pris­ing de­ci­sion anomaly

Let’s say I’ve re­cruited you as a sub­ject in my thought ex­per­i­ment. I show you three cu­bi­cal plas­tic boxes, about eight inches on a side. There’s two green ones—iden­ti­cal as far as you can see—and a brown one. I ex­plain that they are gam­bling ma­chines: each has a fa­ce­plate with a slot that ac­cepts a dol­lar coin, and an out­put slot that will re­turn ei­ther two or zero dol­lars.

I un­screw the fa­ce­plates to show you the mechanisms in­side. They are quite sim­ple. When you put a coin in, a wheel spins. It has a hun­dred holes around the rim. Each can be blocked, or not, with a teeny rub­ber plug. When the wheel slows to a halt, a sen­sor checks the near­est hole, and dis­penses ei­ther zero or two coins.

The brown box has 45 holes open, so it has prob­a­bil­ity p=0.45 of re­turn­ing two coins. One green box has 90 holes open (p=0.9) and the other has none (p=0). I let you ex­per­i­ment with the boxes un­til you are satis­fied these prob­a­bil­ities are ac­cu­rate (or very nearly so).

Then, I screw the fa­ce­plates back on, and put all the boxes in a black cloth sack with an elas­tic clo­sure. I squidge the sack around, to mix up the boxes in­side, and you reach in and pull one out at ran­dom.

I give you a hun­dred one-dol­lar coins. You can put as many into the box as you like. You can keep as many coins as you don’t gam­ble, plus what­ever comes out of the box.

If you pul­led out the brown box, there’s a 45% chance of get­ting \$2 back, and the ex­pected value of putting a dol­lar in is \$0.90. Ra­tion­ally, you should keep the hun­dred coins I gave you, and not gam­ble.

If you pul­led out a green box, there’s a 50% chance that it’s the one that pays two dol­lars 90% of the time, and a 50% chance that it’s the one that never pays out. So, over­all, there’s a 45% chance of get­ting \$2 back.

Still, ra­tio­nally, you should put some coins in the box. If it pays out at least once, you should gam­ble all the coins I gave you, be­cause you know that you got the 90% box, and you’ll nearly dou­ble your money.

If you get noth­ing out af­ter a few tries, you’ve prob­a­bly got the never-pay box, and you should hold onto the rest of your money. (Ex­er­cise for read­ers: how many no-pay­outs in a row should you ac­cept be­fore quit­ting?)

What’s in­ter­est­ing is that, when you have to de­cide whether or not to gam­ble your first coin, the prob­a­bil­ity is ex­actly the same in the two cases (p=0.45 of a \$2 pay­out). How­ever, the ra­tio­nal course of ac­tion is differ­ent. What’s up with that?

Here, a sin­gle prob­a­bil­ity value fails to cap­ture ev­ery­thing you know about an un­cer­tain event. And, it’s a case in which that failure mat­ters.

Such limi­ta­tions have been rec­og­nized al­most since the be­gin­ning of prob­a­bil­ity the­ory. Dozens of solu­tions have been pro­posed. In the rest of this ar­ti­cle, I’ll ex­plore one. In sub­se­quent ar­ti­cles, I’ll look at the prob­lem more gen­er­ally.

## Meta-probability

To think about the green box, we have to rea­son about the prob­a­bil­ities of prob­a­bil­ities. We could call this meta-prob­a­bil­ity, al­though that’s not a stan­dard term. Let’s de­velop a method for it.

Pull a penny out of your pocket. If you flip it, what’s the prob­a­bil­ity it will come up heads? 0.5. Are you sure? Pretty darn sure.

What’s the prob­a­bil­ity that my lo­cal ju­nior high school sports­ball team will win its next game? I haven’t a ghost of a clue. I don’t know any­thing even about pro­fes­sional sports­ball, and cer­tainly noth­ing about “my” team. In a match be­tween two teams, I’d have to say the prob­a­bil­ity is 0.5.

My girlfriend asked me to­day: “Do you think Raley’s will have dol­mades?” Raley’s is our lo­cal su­per­mar­ket. “I don’t know,” I said. “I guess it’s about 5050.” But un­like sports­ball, I know some­thing about su­per­mar­kets. A fancy Whole Foods is very likely to have dol­mades; a 7-11 al­most cer­tainly won’t; Raley’s is some­where in be­tween.

How can we model these three cases? One way is by as­sign­ing prob­a­bil­ities to each pos­si­ble prob­a­bil­ity be­tween 0 and 1. In the case of a coin flip, 0.5 is much more prob­a­ble than any other prob­a­bil­ity:

We can’t be ab­solutely sure the prob­a­bil­ity is 0.5. In fact, it’s al­most cer­tainly not ex­actly that, be­cause coins aren’t perfectly sym­met­ri­cal. And, there’s a very small prob­a­bil­ity that you’ve been given a tricky penny that comes up tails only 10% of the time. So I’ve illus­trated this with a tight Gaus­sian cen­tered around 0.5.

In the sports­ball case, I have no clue what the odds are. They might be any­thing be­tween 0 to 1:

In the Raley’s case, I have some knowl­edge, and ex­tremely high and ex­tremely low prob­a­bil­ities seem un­likely. So the curve looks some­thing like this:

Each of these curves av­er­ages to a prob­a­bil­ity of 0.5, but they ex­press differ­ent de­grees of con­fi­dence in that prob­a­bil­ity.

Now let’s con­sider the gam­bling ma­chines in my thought ex­per­i­ment. The brown box has a curve like this:

Whereas, when you’ve cho­sen one of the two green boxes at ran­dom, the curve looks like this:

Both these curves give an av­er­age prob­a­bil­ity of 0.45. How­ever, a ra­tio­nal de­ci­sion the­ory has to dis­t­in­guish be­tween them. Your op­ti­mal strat­egy in the two cases is quite differ­ent.

With this frame­work, we can con­sider an­other box—a blue one. It has a fixed pay­out prob­a­bil­ity some­where be­tween 0 and 0.9. I put a ran­dom num­ber of plugs in the holes in the spin­ning disk—leav­ing be­tween 0 and 90 holes open. I used a noise diode to choose; but you don’t get to see what the odds are. Here the prob­a­bil­ity-of-prob­a­bil­ity curve looks rather like this:

This isn’t quite right, be­cause 0.23 and 0.24 are much more likely than 0.235—the plot should look like a comb—but for strat­egy choice the differ­ence doesn’t mat­ter.

What is your op­ti­mal strat­egy in this case?

As with the green box, you ought to spend some coins gath­er­ing in­for­ma­tion about what the odds are. If your es­ti­mate of the prob­a­bil­ity is less than 0.5, when you get con­fi­dent enough in that es­ti­mate, you should stop. If you’re con­fi­dent enough that it’s more than 0.5, you should con­tinue gam­bling.

If you en­joy this sort of thing, you might like to work out what the ex­act op­ti­mal al­gorithm is.

In the next ar­ti­cle in this se­quence, we’ll look at some more com­pli­cated and in­ter­est­ing cases.

The “meta-prob­a­bil­ity” ap­proach I’ve taken here is the Ap dis­tri­bu­tion of E. T. Jaynes. I find it highly in­tu­itive, but it seems to have had al­most no in­fluence or ap­pli­ca­tion in prac­tice. We’ll see later that it has some prob­lems, which might ex­plain this.

The green and blue boxes are re­lated to “multi-armed ban­dit prob­lems.” A “one-armed ban­dit” is a cas­ino slot ma­chine, which has defined odds of pay­out. A multi-armed ban­dit is a hy­po­thet­i­cal gen­er­al­iza­tion with sev­eral arms, each of which may have differ­ent, un­known odds. In gen­eral, you ought to pull each arm sev­eral times, to gain in­for­ma­tion. The ques­tion is: what is the op­ti­mal al­gorithm for de­cid­ing which arms to pull how many times, given the pay­ments you have re­ceived so far?

If you read the Wikipe­dia ar­ti­cle and fol­low some links, you’ll find the con­cepts you need to find the op­ti­mal green and blue box strate­gies. But it might be more fun to try on your own first! The green box is sim­ple. The blue box is harder, but the same gen­eral ap­proach ap­plies.

Wikipe­dia also has an ac­ci­den­tal list of for­mal ap­proaches for prob­lems where or­di­nary prob­a­bil­ity the­ory fails. This is far from com­plete, but a good start­ing point for a browser tab ex­plo­sion.

## Acknowledgements

Thanks to Rin’dzin Pamo, St. Rev., Matt_Simp­son, Kaj_So­tala, and Vaniver for helpful com­ments on drafts. Of course, they may dis­agree with my analy­ses, and aren’t re­spon­si­ble for my mis­takes!

• Or­di­nary prob­a­bil­ity the­ory and ex­pected util­ity are suffi­cient to han­dle this puz­zle. You just have to calcu­late the ex­pected util­ity of each strat­egy be­fore choos­ing a strat­egy. In this puz­zle a strat­egy is more com­pli­cated than sim­ply putting some num­ber of coins in the ma­chine: it re­quires de­cid­ing what to do af­ter each coin ei­ther suc­ceeds or fails to suc­ceed in re­leas­ing two coins.

In other words, a strat­egy is a choice of what you’ll do at each point in the game tree—just like a strat­egy in chess.﻿

We don’t ex­pect to do well at chess if we de­cide on a course of ac­tion that ig­nores our op­po­nent’s moves. Similarly, we shouldn’t ex­pect to do well in this prob­a­bil­is­tic game if we only con­sider strate­gies that ig­nore what the ma­chine does. If we con­sider all strate­gies, com­pute their ex­pected util­ity based on the in­for­ma­tion we have, and choose the one that max­i­mizes this, we’ll do fine.

I’m say­ing es­sen­tially the same thing Jeremy Salwen said.

• So, let me try again to ex­plain why I think this is miss­ing the point… I wrote “a sin­gle prob­a­bil­ity value fails to cap­ture ev­ery­thing you know about an un­cer­tain event.” Maybe “sim­ple” would have been bet­ter than “sin­gle”?

The point is that you can’t solve this prob­lem with­out some­how rea­son­ing about prob­a­bil­ities of prob­a­bil­ities. You can solve it by rea­son­ing about the ex­pected value of differ­ent strate­gies. (I said so in the OP; I con­structed the ex­am­ple to make this the ob­vi­ously cor­rect ap­proach.) But those strate­gies con­tain rea­son­ing about prob­a­bil­ities within them. So the “outer” prob­a­bil­ities (about strate­gies) are meta-prob­a­bil­is­tic.

[Added:] Ev­i­dently, my OP was un­clear and failed to com­mu­ni­cate, since sev­eral peo­ple missed the same point in the same way. I’ll think about how to re­vise it to make it clearer.

• The ex­po­si­tion of meta-prob­a­bil­ity is well done, and shows an in­ter­est­ing way of ex­am­in­ing and eval­u­at­ing sce­nar­ios. How­ever, I would take is­sue with the first sec­tion of this ar­ti­cle in which you es­tab­lish sin­gle prob­a­bil­ity (ex­pected util­ity) calcu­la­tions as in­suffi­cient for the prob­lem, and pre­sent meta-prob­a­bil­ity as the solu­tion.

In par­tic­u­lar, you say

What’s in­ter­est­ing is that, when you have to de­cide whether or not to gam­ble your first coin, the prob­a­bil­ity is ex­actly the same in the two cases (p=0.45 of a \$2 pay­out). How­ever, the ra­tio­nal course of ac­tion is differ­ent. What’s up with that?

Here, a sin­gle prob­a­bil­ity value fails to cap­ture ev­ery­thing you know about an un­cer­tain event. And, it’s a case in which that failure mat­ters.

I do not be­lieve that this is a failure of ap­ply­ing a sin­gle prob­a­bil­ity to the situ­a­tion, but merely calcu­lat­ing the prob­a­bil­ity wrongly, by ig­nor­ing fu­ture effects of your choice. I think this is most clearly illus­trated by scal­ing the prob­lem down to the case where you are handed a green box, and only two coins. In this sim­plified prob­lem, we can clearly ex­am­ine all pos­si­ble strate­gies.

• Strat­egy 1 would be to hold on to your two dol­lar coins. There is a 100% chance of a \$2.00 payout

• Strat­egy 2 would be to in­sert both of your coins into the box. There is a 50.5% chance of a \$0.00 pay­out, 40.5% chance of a \$4.00 pay­out and a 9% chance of a \$2.00 pay­out.

• Strat­egy 3 would be to in­sert one coin, and then in­sert the sec­ond only if the first pays out. There is a 55% chance of \$1.00 pay­out, a 4.5% chance of a \$2.00 pay­out, and a 40.5% chance of a \$4.00 pay­out.

• Strat­egy 4 would be to in­sert one coin, and then in­sert the sec­ond only if the first doesn’t pay out. There is a 50.5% chance of a 0.00\$ pay­out, a 4.5% chance of a \$2.00 pay­out, and a 45% chance of a \$3.00 pay­out.

When put in these terms, it seems quite ob­vi­ous that your choice to open the box would de­pend on more than the ex­pected pay­off from only the first box, be­cause quite clearly your choice to open the first box pays off (or doesn’t pay off) when open­ing (or not open­ing) the other boxes as well. This seems like an er­ror in calcu­lat­ing the pay­off ma­trix rather than a flaw with the tech­nique of sin­gle prob­a­bil­ity val­ues it­self. It ig­nores the fact that open­ing the first box not only pays you off im­me­di­ately, but also pays you off in the fu­ture by giv­ing you in­for­ma­tion about the other boxes.

This prob­lem eas­ily suc­cumbs to stan­dard ex­pected value calcu­la­tions if all ac­tions are con­sid­ered. The steps re­main the same as always:

1. As­sign a util­ity to each dol­lar amount outcome

2. Calcu­late the ex­pected util­ity of all pos­si­ble strategies

3. Choose the strat­egy with the high­est ex­pected utility

In the case of two coins, we were able to triv­ially calcu­late the out­comes of all pos­si­ble strate­gies, but in larger in­stances of the prob­lem, it might be ad­vis­able to use short­cuts in the calcu­la­tions. How­ever, it still re­mains true that the best choice will still be the one you would have got­ten if you had done out the full ex­pected value calcu­la­tion.

I think the con­fu­sion arises be­cause a lot of the time prob­lems are pre­sented in a way that screens them off from the rest of the world. For ex­am­ple, you are given a box, and it ei­ther has \$10.00 or \$100.00. Once you open the box, the only effect it has on you is the amount of money you got. After you get the money, the box does not mat­ter to the rest of the world. Prob­lems are pre­sented this way so that it is easy to fac­tor out the de­ci­sions and calcu­la­tions you have to make from ev­ery other de­ci­sion you have to make. How­ever, de­ci­sion are not nec­es­sar­ily this way (in fact in real life, very few de­ci­sions are). In the choice of in­sert­ing the first coin or not, this is sim­ply not the case, de­spite hav­ing su­perfi­cial similar­i­ties to stan­dard “box” prob­lems.

Although you clearly un­der­stand that the pay­offs from the boxes are en­tan­gled, you only ap­ply this knowl­edge in your in­for­mal ap­proach to the prob­lem. The failure to con­sider the full effects of your ac­tions in open­ing the first box may be psy­cholog­i­cally en­couraged by the tech­nique of “sin­gle prob­a­bil­ity calcu­la­tions”, but it is cer­tainly not a failure of the tech­nique it­self to cap­ture such situ­a­tions.

• The sub­stan­tive point here isn’t about EU calcu­la­tions per se. Run­ning a full anal­y­sis of ev­ery­thing that might hap­pen and do­ing an EU calcu­la­tion on that ba­sis is fine, and I don’t think the OP dis­putes this.

The sub­tlety is about what nu­mer­i­cal data can for­mally rep­re­sent your full state of knowl­edge. The claim is that a mere prob­a­bil­ity of get­ting the \$2 pay­out does not. It’s the case that on the first use of a box, the prob­a­bil­ity of the pay­out given its colour is 0.45 re­gard­less of the colour.

How­ever, if you merely hold onto that prob­a­bil­ity, then if you put in a coin and so learn some­thing about the boxes you can’t up­date that prob­a­bil­ity to figure out what the prob­a­bil­ity of pay­out for the sec­ond at­tempt is. You need to go back and also re­mem­ber whether the box is green or brown. The point of Jaynes and the A_p dis­tri­bu­tion is that it ac­tu­ally does screen off all other in­for­ma­tion. If you keep track of it you never need to worry about re­mem­ber­ing the colour of the box, or the setup of the ex­per­i­ment. Just this “meta-dis­tri­bu­tion”.

• The sub­tlety is about what nu­mer­i­cal data can for­mally rep­re­sent your full state of knowl­edge. The claim is that a mere prob­a­bil­ity of get­ting the \$2 pay­out does not.

How­ever, a sin­gle prob­a­bil­ity for each out­come given each strat­egy is all the in­for­ma­tion needed. The prob­lem is not with us­ing sin­gle prob­a­bil­ities to rep­re­sent knowl­edge about the world, it’s the straw math that was used to rep­re­sent the tech­nique. To me, this rea­son­ing is equiv­a­lent to the fol­low­ing:

“You work at a store where man­age­ment is highly di­s­or­ga­nized. Although they pre­cisely track the num­ber of days you have worked since the last pay­day, they never re­mem­ber when they last paid you, and thus ev­ery day of the work week has a 15 chance of be­ing a pay­day. For sim­plic­ity’s sake, let’s as­sume you earn \$100 a day.

You wake up on Mon­day and do the fol­low­ing calcu­la­tion: If you go in to work, you have a 15 chance of be­ing paid. Thus the ex­pected pay­off of work­ing to­day is \$20, which is too low for it to be worth it. So you skip work. On Tues­day, you make the same calcu­la­tion, and de­cide that it’s not worth it to work again, and so you con­tinue for­ever.

I visit you and im­me­di­ately point out that you’re be­ing ir­ra­tional. After all, a salary of \$100 a day clearly is worth it to you, yet you are not work­ing. I look at your calcu­la­tions, and im­me­di­ately find the prob­lem: You’re us­ing a sin­gle prob­a­bil­ity to rep­re­sent your ex­pected pay­off from work­ing! I tell you that us­ing a meta-prob­a­bil­ity dis­tri­bu­tion fixes this prob­lem, and so you ex­cit­edly scrap your pre­vi­ous calcu­la­tions and set about us­ing a meta-prob­a­bil­ity dis­tri­bu­tion in­stead. We de­cide that a Gaus­sian sharply peaked at 0.2 best rep­re­sents our meta-prob­a­bil­ity dis­tri­bu­tion, and I send you on your way.”

Of course, in this case, the meta-prob­a­bil­ity dis­tri­bu­tion doesn’t change any­thing. You still con­tinue skip­ping work, be­cause I have de­vised the hy­po­thet­i­cal situ­a­tion to illus­trate my point (evil laugh). The point is that in this prob­lem the meta-prob­a­bil­ity dis­tri­bu­tion solves noth­ing, be­cause the prob­lem is not with a lack of meta-prob­a­bil­ity, but rather a lack of con­sid­er­ing fu­ture con­se­quences.

In both the OPs ex­am­ple and mine, the prob­lem is that the math was done in­cor­rectly, not that you need meta-prob­a­bil­ities. As you said, meta-prob­a­bil­ities are a method of screen­ing off ad­di­tional la­bels on your prob­a­bil­ity dis­tri­bu­tions for a par­tic­u­lar class of prob­lems where you are tak­ing re­peated sam­ples that are en­tan­gled in a very par­tic­u­lar sort of way. As I said above, I ap­pre­ci­ate the ex­po­si­tion of meta-prob­a­bil­ities as a tool, and your com­ment as well has helped me bet­ter un­der­stand their in­stru­men­tal na­ture, but I take is­sue with what sort of tool they are pre­sented as.

If you do the calcu­la­tions di­rectly with the prob­a­bil­ities, your calcu­la­tion will suc­ceed if you do the math right, and fail if you do the math wrong. Meta-prob­a­bil­ities are a par­tic­u­lar way of rep­re­sent­ing a cer­tain calcu­la­tion that suc­ceed and fail on their own right. If you use them to rep­re­sent the cor­rect di­rect prob­a­bil­ities, you will get the right an­swer, but they are only an aid in the calcu­la­tion, they never fix any prob­lem with di­rect prob­a­bil­ity calcu­la­tions. The fix­ing of the calcu­la­tion and the use of prob­a­bil­ities are or­thog­o­nal is­sues.

To make a blunt anal­ogy, this is like some­one try­ing to plug an Eth­er­net ca­ble into a phone jack, and then say­ing “when Eth­er­net fails, wifi works”, con­ve­niently plug­ging in the wifi adapter cor­rectly.

The key of the dis­pute in my eyes is not whether wifi can work for cer­tain situ­a­tions, but whether there’s any­thing ac­tu­ally wrong with Eth­er­net in the first place.

• So, my ob­ser­va­tion is that with­out meta-dis­tri­bu­tions (or A_p), or con­di­tion­ing on a pile of past in­for­ma­tion (and thus track­ing /​more/​ than just a prob­a­bil­ity dis­tri­bu­tion over cur­rent out­comes), you don’t have the room in your knowl­edge to be able to even talk about sen­si­tivity to new in­for­ma­tion co­her­ently. Once you can talk about a com­plete state of knowl­edge, you can be­gin to talk about the util­ity of long term strate­gies.

For ex­am­ple, in your ex­am­ple, one would have the same prob­a­bil­ity of be­ing paid to­day if 20% of em­ploy­ers ac­tu­ally pay you ev­ery day, whilst 80% of em­ploy­ers never paid you. But in such an en­vi­ron­ment, it would not make sense to work a sec­ond day in 80% of cases. The op­ti­mal strat­egy de­pends on what you know, and to rep­re­sent that in gen­eral re­quires more than a straight prob­a­bil­ity.

There are differ­ent prob­lems com­ing from the dis­tinc­tion be­tween choos­ing a long term policy to fol­low, and choos­ing a one shot ac­tion. But we can’t even ap­proach this ques­tion in gen­eral un­less we can talk sen­si­bly about a suffi­cient set of in­for­ma­tion to keep track of about. There are two dis­tinct prob­lems, one prior to the other.

Jaynes does dis­cuss a prob­lem which is closer to your con­cerns (that of es­ti­mat­ing neu­tron mul­ti­pli­ca­tion in a 1-d ex­per­i­ment 18.15, pp579. He’s com­par­ing two ap­proaches, which for my pur­poses differ in their prior A_p dis­tri­bu­tion.

• Jeremy, I think the ap­par­ent dis­agree­ment here is due to unclar­ity about what the point of my ar­gu­ment was. The point was not that this situ­a­tion can’t be an­a­lyzed with de­ci­sion the­ory; it cer­tainly can, and I did so. The point is that differ­ent de­ci­sions have to be made in two situ­a­tions where the prob­a­bil­ities are the same.

Your dis­cus­sion seems to equate “prob­a­bil­ity” with “util­ity”, and the whole point of the ex­am­ple is that, in this case, they are not the same.

• I guess my po­si­tion is thus:

While there are sets of prob­a­bil­ities which by them­selves are not ad­e­quate to cap­ture the in­for­ma­tion about a de­ci­sion, there always is a set of prob­a­bil­ities which is ad­e­quate to cap­ture the in­for­ma­tion about a de­ci­sion.

In that sense I do not see your ar­ti­cle as an ar­gu­ment against us­ing prob­a­bil­ities to rep­re­sent de­ci­sion in­for­ma­tion, but rather a re­minder to use the cor­rect set of prob­a­bil­ities.

• In that sense I do not see your ar­ti­cle as an ar­gu­ment against us­ing prob­a­bil­ities to rep­re­sent de­ci­sion in­for­ma­tion, but rather a re­minder to use the cor­rect set of prob­a­bil­ities.

My un­der­stand­ing of Chap­man’s broader point (which may differ wildly from his un­der­stand­ing) is that de­ter­min­ing which set of prob­a­bil­ities is cor­rect for a situ­a­tion can be rather hard, and so it de­serves care­ful and se­ri­ous study from peo­ple who want to think about the world in terms of prob­a­bil­ities.

• It may be helpful to read some re­lated posts (linked by luke­prog in a com­ment on this post): Es­ti­mate sta­bil­ity, and Model Sta­bil­ity in In­ter­ven­tion Assess­ment, which com­ments on Why We Can’t Take Ex­pected Value Es­ti­mates Liter­ally (Even When They’re Un­bi­ased). The first of those mo­ti­vates the A_p (meta-prob­a­bil­ity) ap­proach, the sec­ond uses it, and the third ex­plains in­tu­itively why it’s im­por­tant in prac­tice.

• Thanks, Jonathan, yes, that’s how I un­der­stand it.

Jaynes’ dis­cus­sion mo­ti­vates A_p as an effi­ciency hack that al­lows you to save mem­ory by for­get­ting some de­tails. That’s cool, al­though not the point I’m try­ing to make here.

• I do not be­lieve that this is a failure of ap­ply­ing a sin­gle prob­a­bil­ity to the situ­a­tion, but merely calcu­lat­ing the prob­a­bil­ity wrongly

A sin­gle prob­a­bil­ity can­not sum up our knowl­edge.

Be­fore we talk about plans, as you went on to, we must talk about the world as it stands. We know there is a 50% chance of a 0% ma­chine and a 50% chance of a 90% ma­chine. Say­ing 45% does not en­code this in­for­ma­tion. No other num­ber does ei­ther.

Scalar prob­a­bil­ities of bi­nary out­comes are such a use­ful ham­mer that we need to stop and re­mem­ber some­times that not all un­cer­tain­ties are nails.

• Jeremy, thank you for this. To be clear, I wasn’t sug­gest­ing that meta-prob­a­bil­ity is the solu­tion. It’s a solu­tion. I chose it be­cause I plan to use this frame­work in later ar­ti­cles, where it will (I hope) be par­tic­u­larly illu­mi­nat­ing.

I would take is­sue with the first sec­tion of this ar­ti­cle in which you es­tab­lish sin­gle prob­a­bil­ity (ex­pected util­ity) calcu­la­tions as in­suffi­cient for the prob­lem.

I don’t think it’s cor­rect to equate prob­a­bil­ity with ex­pected util­ity, as you seem to do here. The prob­a­bil­ity of a pay­out is the same in the two situ­a­tions. The point of this ex­am­ple is that the prob­a­bil­ity of a par­tic­u­lar event does not de­ter­mine the op­ti­mal strat­egy. Be­cause util­ity is de­pen­dent on your strat­egy, that also differs.

This prob­lem eas­ily suc­cumbs to stan­dard ex­pected value calcu­la­tions if all ac­tions are con­sid­ered.

Yes, ab­solutely! I chose a par­tic­u­larly sim­ple prob­lem, in which the cor­rect de­ci­sion-the­o­retic anal­y­sis is ob­vi­ous, in or­der to show that prob­a­bil­ity does not always de­ter­mine op­ti­mal strat­egy. In this case, the op­ti­mal strate­gies are clear (ex­cept for the ex­act stop­ping con­di­tion), and clearly differ­ent, even though the prob­a­bil­ities are the same.

I’m us­ing this as an in­tro­duc­tory wedge ex­am­ple. I’ve opened a Pan­dora’s Box: prob­a­bil­ity by it­self is not a fully ad­e­quate ac­count of ra­tio­nal­ity. Many odd things will leap and creep out of that box so long as we leave it open.

• I don’t think it’s cor­rect to equate prob­a­bil­ity with ex­pected util­ity, as you seem to do here. The prob­a­bil­ity of a pay­out is the same in the two situ­a­tions. The point of this ex­am­ple is that the prob­a­bil­ity of a par­tic­u­lar event does not de­ter­mine the op­ti­mal strat­egy. Be­cause util­ity is de­pen­dent on your strat­egy, that also differs.

Hmmm. I was equat­ing them as part of the stan­dard tech­nique of calcu­lat­ing the prob­a­bil­ity of out­comes from your ac­tions, and then from there mul­ti­ply­ing by the util­ities of the out­comes and sum­ming to find the ex­pected util­ity of a given ac­tion.

I think it’s just a ques­tion of what you think the er­ror is in the origi­nal calcu­la­tion. I find the er­ror to be the con­fla­tion of “pay­out” (as in im­me­di­ate re­ward from in­sert­ing the coin) with “pay­out” (as in the ex­pected re­ward from your ac­tion in­clud­ing short term and long-term re­wards). It seems to me that you are say­ing that you can’t look at the im­me­di­ate prob­a­bil­ity of payout

The point of this ex­am­ple is that the prob­a­bil­ity of a par­tic­u­lar event does not de­ter­mine the op­ti­mal strat­egy. Be­cause util­ity is de­pen­dent on your strat­egy, that also differs.

which I agree with. But you seem to ig­nore the ob­vi­ous solu­tion of con­sid­er­ing the prob­a­bil­ity of to­tal pay­out, in­clud­ing con­sid­er­a­tions about your strat­egy. In that case, you re­ally do have a sin­gle prob­a­bil­ity rep­re­sent­ing the like­li­hood of a sin­gle out­come, and you do get the cor­rect an­swer. So I don’t see where the is­sue with us­ing a sin­gle prob­a­bil­ity comes from. It seems to me an is­sue with us­ing the wrong sin­gle prob­a­bil­ity.

And es­pe­cially trou­bling is that you seem to agree that us­ing di­rect prob­a­bil­ities to calcu­late the sin­gle prob­a­bil­ity of each out­come and then weigh­ing them by de­sir­a­bil­ity will give you the cor­rect an­swer, but then you say

prob­a­bil­ity by it­self is not a fully ad­e­quate ac­count of ra­tio­nal­ity.

which may be true, but I don’t think is demon­strated at all by this ex­am­ple.

Thank you for fur­ther ex­plain­ing your think­ing.

• I don’t think is demon­strated at all by this ex­am­ple.

Yes, I see your point (al­though I don’t al­to­gether agree). But, again, what I’m do­ing here is set­ting up an­a­lyt­i­cal ap­para­tus that will be helpful for more difficult cases later.

In the mean time, the LW posts I pointed to here may mo­ti­vate more strongly the claim that prob­a­bil­ity alone is an in­suffi­cient guide to ac­tion.

• I think a much bet­ter ap­proach is to as­sign mod­els to the prob­lem (e.g. “it’s a box that has 100 holes, 45 open and 65 plugged, the ma­chine picks one hole, you get 2 coins if the hole is open and noth­ing if it’s plugged.”), and then have a prob­a­bil­ity dis­tri­bu­tion over mod­els. This is bet­ter be­cause keeps prob­a­bil­ities as­signed to facts about the world.

It’s true that prob­a­bil­ities-of-prob­a­bil­ities are just an ab­strac­tion of this (when used cor­rectly), but I’ve found that peo­ple get con­fused re­ally fast if you ask them to think in terms of prob­a­bil­ities-of-prob­a­bil­ities. (See ev­ery con­fused dis­cus­sion of “what’s the stan­dard de­vi­a­tion of the stan­dard de­vi­a­tion?”)

• I think a much bet­ter ap­proach is to as­sign mod­els to the prob­lem …and then have a prob­a­bil­ity dis­tri­bu­tion over mod­els...It’s true that prob­a­bil­ities-of-prob­a­bil­ities are just an ab­strac­tion of this

Isn’t Chap­man’s ap­proach and your ap­proach com­pletely iden­ti­cal?

As per OP’s graphs, each point on the X axis rep­re­sents a model and the height of the blue line as the prob­a­bil­ity as­signed to that model.

Or did you just mean that your way is a bet­ter way to phrase it for not con­fus­ing ev­ery­one?

• Or did you just mean that your way is a bet­ter way to phrase it for not con­fus­ing ev­ery­one?

Right. It’s good for not con­fus­ing new peo­ple, and some­times also good for not con­fus­ing your­self.

• Oh ok.

I mis­in­ter­preted be­cause you said “bet­ter” (im­ply­ing a differ­ence), and “ab­strac­tion” is not nec­es­sar­ily the same as “iden­ti­cal”.

• Sup­pose we’re us­ing Laplace’s Rule of Suc­ces­sion on a coin. On the ze­roth round be­fore we have seen any ev­i­dence, we as­sign prob­a­bil­ity 0.5 to the first coin­flip com­ing up heads. We also as­sign marginal prob­a­bil­ity 0.5 to the sec­ond flip com­ing up heads, the third flip com­ing up heads, and so on. What dis­t­in­guishes the Laplace epistemic state from the ‘cer­tainty of a fair coin’ epistemic state is that they rep­re­sent differ­ent prob­a­bil­ity dis­tri­bu­tions over se­quences of coin­flips.

Since some prob­a­bil­ity dis­tri­bu­tions over events are cor­re­lated, we must rep­re­sent our states of knowl­edge by as­sign­ing prob­a­bil­ities to se­quences or sets of events, and our states of knowl­edge can­not be rep­re­sented by stat­ing marginal prob­a­bil­ities for all events in­de­pen­dently.

We could also try to sum­ma­rize some fea­tures of such epistemic states by talk­ing about the in­sta­bil­ity of es­ti­mates—the de­gree to which they are eas­ily up­dated by knowl­edge of other events—though of course this will be a de­rived fea­ture of the prob­a­bil­ity dis­tri­bu­tion, rather than an on­tolog­i­cally ex­tra fea­ture of prob­a­bil­ity.

I re­ject that this is a good rea­son for prob­a­bil­ity the­o­rists to panic.

On the meta level I re­mark that panic rep­re­sents a failure of re­duc­tion­ist effort; that is, it would be pos­si­ble to re­duce things to sim­ple prob­a­bil­ities by putting in an effort, but there is a temp­ta­tion to not put in this effort and in­stead com­pli­cate our view of prob­a­bil­ity. After see­ing this re­duc­tion work a few dozen times, how­ever, one be­gins to ac­quire (by Laplace’s Rule of Suc­ces­sion) some de­gree of con­fi­dence that it can be car­ried out on the next oc­ca­sion as well, even if the man­ner of do­ing so is not im­me­di­ately ob­vi­ous, and a hasty as­ser­tion of a fake re­duc­tion would not be helpful.

• We could also try to sum­ma­rize some fea­tures of such epistemic states by talk­ing about the in­sta­bil­ity of es­ti­mates—the de­gree to which they are eas­ily up­dated by knowl­edge of other events

Yes, this is Jaynes’ A_p ap­proach.

this will be a de­rived fea­ture of the prob­a­bil­ity dis­tri­bu­tion, rather than an on­tolog­i­cally ex­tra fea­ture of prob­a­bil­ity.

I’m not sure I fol­low this. There is no prior dis­tri­bu­tion for the per-coin pay­out prob­a­bil­ities that can ac­cu­rately re­flect all our knowl­edge.

I re­ject that this is a good rea­son for prob­a­bil­ity the­o­rists to panic.

Yes, it’s clear from com­ments that my OP was some­what mis­lead­ing as to its pur­pose. Over­all, the se­quence in­tends to dis­cuss cases of un­cer­tainty in which prob­a­bil­ity the­ory is the wrong tool for the job, and what to do in­stead.

How­ever, this par­tic­u­lar ar­ti­cle in­tended only to in­tro­duce the idea that one’s con­fi­dence in a prob­a­bil­ity es­ti­mate is in­de­pen­dent from that es­ti­mate, and to de­velop the A_p (meta-prob­a­bil­ity) ap­proach to ex­press­ing that con­fi­dence.

• I’m not sure I fol­low this. There is no prior dis­tri­bu­tion for the per-coin pay­out prob­a­bil­ities that can ac­cu­rately re­flect all our knowl­edge.

Are we talk­ing about the Laplace vs. fair coins? Are you claiming there’s no prior dis­tri­bu­tion over se­quences which re­flects our knowl­edge? If so I think you are wrong as a mat­ter of math.

• Are you claiming there’s no prior dis­tri­bu­tion over se­quences which re­flects our knowl­edge?

No. Well, not so long as we’re al­lowed to take our own ac­tions into ac­count!

I want to em­pha­size—since many com­menters seem to have mis­taken me on this—that there’s an ob­vi­ous, cor­rect solu­tion to this prob­lem (which I made ex­plicit in the OP). I de­liber­ately made the prob­lem as sim­ple as pos­si­ble in or­der to pre­sent the A_p frame­work clearly.

Are we talk­ing about the Laplace vs. fair coins?

Not sure what you are ask­ing here, sorry...

• Are you claiming there’s no prior dis­tri­bu­tion over se­quences which re­flects our knowl­edge?

No. Well, not so long as we’re al­lowed to take our own ac­tions into ac­count!

Heh! Yes, tra­di­tional causal mod­els have struc­ture be­yond what is pre­sent in the cor­re­spond­ing prob­a­bil­ity dis­tri­bu­tion over those mod­els, though this has to do with com­put­ing coun­ter­fac­tu­als rather than meta-prob­a­bil­ity or es­ti­mate in­sta­bil­ity. Work con­tinues at MIRI de­ci­sion the­ory work­shops on the search for ways to turn some of this back into prob­a­bil­ity, but yes, in my world causal mod­els are things we as­sign prob­a­bil­ities to, over and be­yond prob­a­bil­ities we as­sign to joint col­lec­tions of events. They are still mod­els of re­al­ity to which a prob­a­bil­ity is as­signed, though. (See Judea Pearl’s “Why I Am Only A Half-Bayesian”.)

• I don’t re­ally un­der­stand what “be­ing Bayesian about causal mod­els” means. What makes the most sense (e.g. what peo­ple typ­i­cal­liy do) is:

(a) “be Bayesian about statis­ti­cal mod­els”, and

(b) Use ad­di­tional as­sump­tions to in­ter­pret the out­put of (a) causally.

(a) makes sense be­cause I un­der­stand how ev­i­dence help me se­lect among sets of statis­ti­cal al­ter­na­tives.

(b) also makes sense, but then no one will ac­cept your an­swer with­out ac­tu­ally ver­ify­ing the causal model by ex­per­i­ment—be­cause your as­sump­tions link­ing the statis­ti­cal model to a causal one may not be true. And this game of ver­ify­ing these as­sump­tions doesn’t seem like a Bayesian kind of game at all.

I don’t know what it means to use Bayes the­o­rem to se­lect among causal mod­els di­rectly.

• It means that you figure out which causal mod­els look more or less like what you ob­served.

More gen­er­ally: There’s a lan­guage of causal mod­els which, we think, al­lows us to de­scribe the ac­tual uni­verse, and many other uni­verses be­sides. Some of these mod­els are sim­pler than oth­ers. Any given se­quence of ex­pe­riences has some prob­a­bil­ity of be­ing en­coun­tered in a given causal uni­verse.

• Thanks for writ­ing this up! I’ve been want­ing to write some­thing on the Ap dis­tri­bu­tion since April, but hadn’t got­ten around to it. I look for­ward to your forth­com­ing posts.

I find [the Ap dis­tri­bu­tion] highly in­tu­itive, but it seems to have had al­most no in­fluence or ap­pli­ca­tion in prac­tice.

There aren’t many cita­tions of Jaynes on the Ap dis­tri­bu­tion, but model un­cer­tainty gets dis­cussed a lot, and is mod­el­ing the same kind of thing in a Bayesian way.

On the sub­ject of ap­plied ra­tio­nal­ity be­ing a lot more than prob­a­bil­ity es­ti­mates, see also When Not to Use Prob­a­bil­ities, Ex­plicit and tacit ra­tio­nal­ity, and… well, The Se­quences.

On the Ap dis­tri­bu­tion and model un­cer­tainty more gen­er­ally, see also Model Sta­bil­ity in In­ter­ven­tion Assess­ment, Model Com­bi­na­tion and Ad­just­ment, Why We Can’t Take Ex­pected Value Es­ti­mates Liter­ally, and The Op­ti­mizer’s Curse and How to Beat It.

• Luke, thank you for these poin­t­ers! I’ve read some of them, and have the rest open in tabs to read soon.

• What’s in­ter­est­ing is that, when you have to de­cide whether or not to gam­ble your first coin, the prob­a­bil­ity is ex­actly the same in the two cases (p=0.45 of a \$2 pay­out). How­ever, the ra­tio­nal course of ac­tion is differ­ent. What’s up with that?

That’s pretty triv­ial.

The ex­pected pay­out of putting a coin into a brown box is 0.90.

The ex­pected pay­out of putting a coin into a green box is 0.90 plus valuable in­for­ma­tion about what kind of a green box it is. It is a *differ­ent pay­out*.

• The term “metaprob­a­bil­ity” strikes me as adding con­fu­sion. The two lay­ers are not the same thing ap­plied to it­self, but are in fact differ­ent ques­tions. “What frac­tion of the time does this box pay out?” is a differ­ent ques­tion from “Is this box go­ing to pay out on the next coin?”.

Often it takes a lot of ques­tions to fully de­scribe a situ­a­tion. Us­ing the term “prob­a­bil­ity” for all of them hides the dis­tinc­tion.

• But it is—you’re an­swer­ing the ques­tion “what is the prob­a­bil­ity that this box will pay out next time”, and “what is the prob­a­bil­ity that my prob­a­bil­ity as­sign­ment was cor­rect?”

• What does it mean for a prob­a­bil­ity as­sign­ment to be cor­rect, as op­posed to well-cal­ibrated? Real­ity is or is not.

• I mostly meant well cal­ibrated, but...

There is some­thing-like-cor­rect­ness in that, given the ev­i­dence available to you, there is a cor­rect way to up­date your prior. That is strictly not a fact about your pos­te­rior, but I think it’s a le­gi­t­i­mate thing to talk about in terms of ‘cor­rect­ness’.

• Here, a sin­gle prob­a­bil­ity value fails to cap­ture ev­ery­thing you know about an un­cer­tain event.

There’s more than one event. If you as­sign a sin­gle prob­a­bil­ity to win­ning the first, third, and sev­enth times and failing the sec­ond, fourth, fifth, and sixth times given that you put in seven coins, etc. that cap­tures ev­ery­thing you need to know and does not in­volve meta-prob­a­bil­ities.

More suc­cinctly, the prob­a­bil­ity of win­ning on the sec­ond try given that you win on the first try is differ­ent de­pend­ing on the color of the ma­chine.

• Right: a game where you re­peat­edly put coins in a ma­chine and de­cide whether or not to put in an­other based on what oc­curred is not a sin­gle ‘event’, so you can’t sum up your in­for­ma­tion about it in just one prob­a­bil­ity.

• What’s in­ter­est­ing is that, when you have to de­cide whether or not to gam­ble your first coin, the prob­a­bil­ity is ex­actly the same in the two cases (p=0.45 of a \$2 pay­out). How­ever, the ra­tio­nal course of ac­tion is differ­ent. What’s up with that?

Why on earth should we ex­pect that the long term ex­pected value of all fu­ture con­se­quences of a choice to be equal to the im­me­di­ate pay­offs? They are two differ­ent things. Learn­ing is the most ob­vi­ous ex­am­ple of when these can be ex­pected to be differ­ent. In this case learn­ing in­for­ma­tion and in other cases learn­ing skills.

• The state­ment “prob­a­bil­ity es­ti­mates are not, by them­selves, ad­e­quate to make ra­tio­nal de­ci­sions” could ap­par­ently have been re­placed with the state­ment “my defi­ni­tion of the phrase ‘prob­a­bil­ity es­ti­mates’ is less in­clu­sive than yours”—what you call a “meta-prob­a­bil­ity” I would have just called a “prob­a­bil­ity”. In a world where both epistemic and aleatory un­cer­tainty ex­ist, your ex­pec­ta­tion of events in that world is go­ing to look like a prob­a­bil­ity dis­tri­bu­tion over a space of prob­a­bil­ity dis­tri­bu­tions over out­puts; this is still a prob­a­bil­ity dis­tri­bu­tion, just a much more ex­pen­sive one to do ap­prox­i­mate calcu­la­tions with.

• Yes, meta-prob­a­bil­ities are prob­a­bil­ities, al­though some­what odd ones; they obey the nor­mal rules of prob­a­bil­ity. Jaynes dis­cusses this in his Chap­ter 18; his dis­cus­sion there is worth a read.

The state­ment “prob­a­bil­ity es­ti­mates are not, by them­selves, ad­e­quate to make ra­tio­nal de­ci­sions” was meant to de­scribe the en­tire se­quence, not this ar­ti­cle.

I’ve re­vised the first para­graph of the ar­ti­cle, since it seems to have mis­led many read­ers. I hope the point is clearer now!

• I’m look­ing for­ward to the rest of your se­quence, thanks!

I was re­cently read­ing through a month-old blog post where one lousy com­ment was ar­gu­ing against a straw­man of Bayesian rea­son­ing wherein you deal with prob­a­bil­ities by “mush­ing them all into a sin­gle num­ber”. I im­me­di­ately rec­ol­lected that the lat­est thing I saw on LessWrong was a fan­tas­tic sum­mary of how you can treat mixed un­cer­tainty as a prob­a­bil­ity-dis­tri­bu­tion-of-prob­a­bil­ity-dis­tri­bu­tions. I con­sid­ered post­ing a be­lated link in re­ply, un­til I dis­cov­ered that the lousy com­ment was writ­ten by David Chap­man and the fan­tas­tic sum­mary was writ­ten by David_Chap­man.

I’m not sure if later you’re go­ing to go off the rails or change my mind or what, but so far this looks like one of the great­est at­tempts at “steel­man­ning” that I’ve ever seen on the in­ter­net.

• Thanks, that’s re­ally funny! “On the other hand” is my gen­eral ap­proach to life, so I’m happy to ar­gue with my­self.

And yes, I’m steel­man­ning. I think this ap­proach is an ex­cel­lent one in some cases; it will break down in oth­ers. I’ll pre­sent a first one in the next ar­ti­cle. It’s an­other box you can put coins in that (I’ll claim) can’t use­fully be mod­eled in this way.

Here’s the quote from Jaynes, by the way:

What are we do­ing here? It seems al­most as if we are talk­ing about the ‘prob­a­bil­ity of a prob­a­bil­ity’. Pend­ing a bet­ter un­der­stand­ing of what that means, let us adopt a cau­tious no­ta­tion that will avoid giv­ing pos­si­bly wrong im­pres­sions. We are not claiming that P(Ap|E) is a ‘real prob­a­bil­ity’ in the sense that we have been us­ing that term; it is only a num­ber which is to obey the math­e­mat­i­cal rules of prob­a­bil­ity the­ory.

• Thanks for post­ing this! :D I’m cu­ri­ous to see where you go next.

Whereas, when you’ve cho­sen one of the two green boxes at ran­dom, the curve looks like this:

It seems odd to me that the mode for the left mix­ture is to the right of 0. I would have put it at 0, and made that mix­ture twice as tall so the area un­der­neath would still be the same.

• Yup, it’s definitely wrong! I was hop­ing no one would no­tice. I thought it would be a dis­trac­tion to ex­plain why the two are differ­ent (if that’s not ob­vi­ous), and also I didn’t want to figure out ex­actly what the right math was to feed to my plot­ting pack­age for this case. (Is the cor­rect form of the curve for the p=0 case ob­vi­ous to you? It wasn’t ob­vi­ous to me, but this isn’t my area of ex­per­tise...)

• I thought it would be a dis­trac­tion to ex­plain why the two are differ­ent (if that’s not ob­vi­ous)

I would have left it un­ex­plained in the post, and then ex­plained it in the com­ments when the first per­son asked about it. In my ex­pe­rience, causally re­marked semi-ob­vi­ous true facts like that (“why are these two not equally tall?” “Be­cause the area un­der­neath is what mat­ters”) are use­ful at con­vinc­ing peo­ple of tech­ni­cal abil­ity.

Is the cor­rect form of the curve for the p=0 case ob­vi­ous to you? It wasn’t ob­vi­ous to me, but this isn’t my area of ex­per­tise...

I prob­a­bly would have gone with the point mass ap­prox­i­ma­tion- i.e. a big cir­cle at (0,.5), a line down to (0,0), a line over to (.9,0), and then a line up to a big cir­cle at (.9,.5), then also a line from (.9,0) to (1,0). Us­ing the Gaus­sian mix­tures, though, I’d prob­a­bly give them the same var­i­ance and just give the left one twice the weight of the right one, cen­ter them at 0 and .9, and then dis­play only be­tween 0 and 1. Us­ing the pure func­tional form, that would look some­thing like 2exp(-x^2/​v)+exp(-(x-.9)^2/​v).

Now, this is as­sum­ing we have some sort of Gaus­sian prior. We could also have a beta prior, which is con­ju­gate to the bino­mial dis­tri­bu­tion, which is nice be­cause that fits our testbed. Gaus­sian might be ap­pro­pri­ate be­cause we’ve ac­tu­ally opened the sys­tem up and we think the mea­sure­ment sys­tem it uses has Gaus­sian noise.

I’m not sure I agree with the claim that the var­i­ance is the same; you could prob­a­bly as­sert that chance the left one will pay out is 0 to ar­bi­trar­ily high pre­ci­sion, and it seems likely the var­i­ance would de­pend on the num­ber of plugs filled. That said, this doesn’t have much im­pact, and say­ing “we’ll ap­prox­i­mate away the meta-meta-prob­a­bil­ity to sim­plify this ex­am­ple” seems like it goes against your gen­eral point, and is thus in­ad­vis­able.

• Here, a sin­gle prob­a­bil­ity value fails to cap­ture ev­ery­thing you know about an un­cer­tain event. And, it’s a case in which that failure mat­ters.

Of course it doesn’t. Who ever said it does? De­ci­sions are made on the ba­sis of ex­pected value, not prob­a­bil­ity. And your anal­y­sis of the first bet ig­nores the value of the in­for­ma­tion gained from it in ex­e­cut­ing your op­tions for fur­ther play there­after.

I think you’re just fun­da­men­tally con­fus­ing the prob­a­bil­ity of a win on the first coin with the ex­pected long run fre­quency of wins for the differ­ent boxes. En­tirely differ­ent things.

We can’t be ab­solutely sure the prob­a­bil­ity is 0.5.

This state­ment in­di­cates a lack of un­der­stand­ing of Jaynes, or at least an ad­her­ence to his foun­da­tions. Prob­a­bly is as­signed by an agent based on in­for­ma­tion—there is no value that the prob­a­bil­ity is be­sides what the agent as­signs.

Jaynes speci­fi­cally an­a­lyzes coin flip­ping, cor­rectly as­sert­ing that the prob­a­bil­ity of the out­come of a coin flip will de­pend on your knowl­edge of the re­la­tion of the ini­tial states of the coin, the force ap­plied to it, and their re­la­tion to the out­come. He even de­scribes a method of con­trol­ling the out­come, and I be­lieve shared his own data on ex­e­cut­ing that method, show­ing how the fre­quency of heads/​tails could be made to de­vi­ate ap­pre­cia­bly from 0.5.

Hav­ing said that, I’ve always found Jaynes “in­ner robot” in­ter­est­ing, and have the feel­ing the idea has real po­ten­tial.

• De­ci­sions are made on the ba­sis of ex­pected value, not prob­a­bil­ity.

Yes, that’s the point here!

your anal­y­sis of the first bet ig­nores the value of the in­for­ma­tion gained from it in ex­e­cut­ing your op­tions for fur­ther play there­after.

By “the first bet” I take it that you mean “your first op­por­tu­nity to put a coin in a green box” (rather than mean­ing “brown box”).

My anal­y­sis of that was “you should put some coins in the box”, ex­actly be­cause of the in­for­ma­tion gain.

This state­ment in­di­cates a lack of un­der­stand­ing of Jaynes, or at least an ad­her­ence to his foun­da­tions.

This post was based closely on the Chap­ter 18 of Jaynes’ book, where he writes:

Sup­pose you have a penny and you are al­lowed to ex­am­ine it care­fully, and con­vince your­self that it is an hon­est coin; i.e. ac­cu­rately round, with head and tail, and a cen­ter of grav­ity where it ought to be. Then you’re asked to as­sign a prob­a­bil­ity that this coin will come up heads on the first toss. I’m sure you’ll say 12. Now, sup­pose you are asked to as­sign a prob­a­bil­ity to the propo­si­tion that there was once life on Mars. Well, I don’t know what your opinion is there, but on the ba­sis of all the things that I have read on the sub­ject, I would again say about 12 for the prob­a­bil­ity. But, even though I have as­signed the same ‘ex­ter­nal’ prob­a­bil­ities to them, I have a very differ­ent ‘in­ter­nal’ state of knowl­edge about those propo­si­tions.

Do you think he’s say­ing some­thing differ­ent from me here?

• I don’t like your use of the word “prob­a­bil­ity”. Some­times, you use it to de­scribe sub­jec­tive prob­a­bil­ities, but some­times you use it to de­scribe the fre­quency prop­er­ties of putting a coin in a given box.

When you say, “The brown box has 45 holes open, so it has prob­a­bil­ity p=0.45 of re­turn­ing two coins.” you are re­ally say­ing that know­ing that I have the brown box in front of me, and I put a coin in it, I would as­sign a 0.45 prob­a­bil­ity of that coin yield­ing 2 coins. And, as far as I know, the coin tosses are all in­de­pen­dent: no amount of coin toss would ever tell me any­thing about the next coin toss. Sim­ply put, a box, along with the way we toss coins in it has rather definite fre­quency prop­er­ties.

Then you talk about “as­sign­ing prob­a­bil­ities to each pos­si­ble prob­a­bil­ity be­tween 0 and 1”. What you re­ally wanted to say is as­sign­ing a prob­a­bil­ity dis­tri­bu­tion over the pos­si­ble fre­quency prop­er­ties.

I know it sounds pedan­tic, but I cringe ev­ery time some­one talks about “prob­a­bil­ities” be­ing some prop­er­ties of a real ob­ject out there in the ter­ri­tory (like am­pli­tudes in QM). Prob­a­bil­ity is in the mind. Us­ing the word any other way is con­fus­ing.

• So per­haps this is for the next post, but are these ‘metaprob­a­bil­ities’ just reg­u­lar hy­per­pa­ram­e­ters?

• I was won­der­ing this too. I haven’t looked at this A_p dis­tri­bu­tion yet (nor have I read all the com­ments here), but hav­ing dis­tri­bu­tions over dis­tri­bu­tions is, like, the core of Bayesian meth­ods in ma­chine learn­ing. You don’t just keep a sin­gle es­ti­mate of the prob­a­bil­ity; you keep a dis­tri­bu­tion over pos­si­ble prob­a­bil­ities, ex­actly like David is say­ing. I don’t even know how up­dat­ing your prob­a­bil­ity dis­tri­bu­tion in light of new ev­i­dence (aka a “Bayesian up­date”) would work with­out this.

Am I miss­ing some­thing about David’s post? I did go through it rather quickly.

• I’m sure you know more about this than I do! Based on a quick Wiki check, I sus­pect that for­mally the A_p are one type of hy­per­prior, but not all hy­per­pri­ors are A_p (a/​k/​a metaprob­a­bil­ities).

Hyper­pa­ram­e­ters are used in Bayesian sen­si­tivity anal­y­sis, a/​k/​a “Ro­bust Bayesian Anal­y­sis”, which I re­cently ac­ci­den­tally rein­vented here. I might write more about that later in this se­quence.

• When you use an un­der­score in a name, make sure to es­cape it first, like so:

``````I sus­pect that for­mally the A\_p are one type of [hy­per­prior](http://​​en.wikipe­dia.org/​​wiki/​​Hyper­prior), but not all hy­per­pri­ors are A\_p (a/​​k/​​a metaprob­a­bil­ities).
``````

(This is nec­es­sary be­cause un­der­scores are yet an­other way to make things italic, and only ap­plies to com­ments, as posts use differ­ent for­mat­ting.)

• Thanks! Fixed.

• Yeah—from what I’ve seen, some­thing math­e­mat­i­cally equiv­a­lent to A_p dis­tri­bu­tions are com­monly used, but that’s not what they’re called.

Like, I think you might call the case in this prob­lem “a Bernoulli ran­dom vari­able with an un­known pa­ram­e­ter”. (The Bernoulli ran­dom vari­able be­ing 1 if it gives you \$2, 0 if it gives you \$0). And then the hy­per­prior would be the prob­a­bil­ity dis­tri­bu­tion of that pa­ram­e­ter, I guess? I haven’t re­ally heard that word be­fore.

ET Jaynes, of course, would never talk like this be­cause the idea of a ran­dom quan­tity ex­ist­ing in the real world is a mind pro­jec­tion fal­lacy. Thus, no “ran­dom vari­ables”. So he uses the A_p dis­tri­bu­tion as a way of think­ing about the same math with­out the idea of ran­dom­ness. Jaynes’s A_p in this case cor­re­sponds ex­actly to the more tra­di­tional “the pa­ram­e­ter of the Bernoulli ran­dom vari­able is p”.

(btw I have a purely math­e­mat­i­cal ques­tion about the A_p dis­tri­bu­tion chap­ter, which I posted to the open thread: http://​​less­wrong.com/​​lw/​​ii6/​​open_thread_septem­ber_28_2013/​​9pbn if you know the an­swer I’d re­ally ap­pre­ci­ate it if you told me)

• If you en­joy this sort of thing, you might like to work out what the ex­act op­ti­mal al­gorithm is.

I guess this is a joke. From wikipe­dia: “Origi­nally con­sid­ered by Allied sci­en­tists in World War II, it proved so in­tractable that, ac­cord­ing to Peter Whit­tle, it was pro­posed the prob­lem be dropped over Ger­many so that Ger­man sci­en­tists could also waste their time on it.[10]” (note that your wikipe­dia-link is bro­ken)

• Thank you very much—link fixed!

That’s a re­ally funny quote!

Multi-armed ban­dit prob­lems were in­tractable dur­ing WWII prob­a­bly mainly be­cause com­put­ers weren’t available yet. In many cases, the best ap­proach is brute force simu­la­tion. That’s the way I would ap­proach the “blue box” prob­lem (be­cause I’m lazy).

But ex­act ap­proaches have also been found: “Bur­ne­tas AN and Kate­hakis MN (1996) also pro­vided an ex­plicit solu­tion for the im­por­tant case in which the dis­tri­bu­tions of out­comes fol­low ar­bi­trary (i.e., non­para­met­ric) dis­crete, uni­vari­ate dis­tri­bu­tions.” The blue box prob­lem is within that class.

• Yeah, but that was 60 years ago, and the sin­gle-armed ban­dit prob­lem is eas­ier than the multi-armed ban­dit.

• See Judea Pearl’s Prob­a­blilis­tic Rea­son­ing in In­tel­li­gent Sys­tems, sec­tion 7.3, for a dis­cus­sion of “metaprob­a­bil­ities” in the con­text of graph­i­cal mod­els.

Although it’s true that you could com­pute the cor­rect de­ci­sion by di­rectly putting a dis­tri­bu­tion on all pos­si­ble fu­tures, the com­pu­ta­tional com­plex­ity of this strat­egy grows com­bi­na­to­ri­ally as the sce­nario gets longer. This isn’t a minor point; gen­er­al­iz­ing the brute force method gets you AIXI. That is why you need some­thing like the A_p dis­tri­bu­tion or Pearl’s “con­tin­gen­cies” to store ev­i­dence and rea­son effi­ciently.

• The “meta-prob­a­bil­ity” ap­proach I’ve taken here is the Ap dis­tri­bu­tion of E. T. Jaynes. I find it highly in­tu­itive, but it seems to have had al­most no in­fluence or ap­pli­ca­tion in prac­tice. We’ll see later that it has some prob­lems, which might ex­plain this.

I don’t see how this differs from how any­one else ever han­dles this prob­lem. I hope you ex­plain the differ­ence in this ex­am­ple, be­fore go­ing on to other ex­am­ples.

• Can you point me at some other similar treat­ments of the same prob­lem? Thanks!

• I ask you for a differ­ent treat­ment, so you ask me for a similar treat­ment?
No, I don’t see the point. Doesn’t my re­quest make sense, re­gard­less of whether we agree on what is similar or differ­ent?

• FWIW, I un­der­stood David to be re­quest­ing some spe­cific ex­am­ples of how mem­bers of the set “ev­ery­one else ever” han­dle this prob­lem, which on your ac­count is the same as how Jaynes han­dles it, in or­der to more clearly see the similar­ity you refer­ence.

• Thanks, yes! I.e. who is this “ev­ery­one else,” and where do they treat it the same way Jaynes does? I’m not aware of any ex­am­ples, but I have only a ba­sic knowl­edge of prob­a­bil­ity the­ory.

It’s cer­tainly pos­si­ble that this ap­proach is com­mon, but Jaynes wasn’t ig­no­rant, and he seemed to think it was a new and un­usual and maybe con­tro­ver­sial idea, so I kind of doubt it.

Also, I should say that I have no dog in this fight at all; I’m not ad­vo­cat­ing “Jaynes is the great­est thing since sliced bread”, for ex­am­ple. (Although that does seem to be the opinion of some LW writ­ers.)

• I re­ally liked the ar­ti­cle. So al­low me to miss the for­est for a mo­ment; I want to chop down this tree:

Let’s solve the green box prob­lem:

Try zero coins: EV: 100 coins.

Try one coin, give up if no pay­out: 45% of 180.2 + 55% of 99= c. 135.5 (I hope.)

(I think this is right, but wel­come cor­rec­tions; 90%x50%x178, +.2 for first coin win­ning (EV of that 2 not 1.8), + keeper coins. I definitely got this wrong the first time I wrote it out, so I’m less con­fi­dent I got it right this time. Edit be­fore post­ing: Not just once.)

Try two coins, give up if no pay­out:

45% of 180.2 (pays off first time) 4.5% of 178.2 (sec­ond time)

50.5% of 98. To­tal: c.138.6

I used to be quite good at things like this. I also used to watch Hill Street Blues. I make the third round very close:

45% of 180.2 4.5% of 178.2 .45% of 176.2

50.05% of 97

Or c. 138.45.

So, I pick two as the an­swer.

Quib­ble with the sports­ball graph:

You have lit­tle con­fi­dence, for sure, but chance of win­ning doesn’t fol­low that graph, and there’s just no rea­son it should. If the Pig­gers are play­ing the Oat­meals, and you know noth­ing about them, I’d guess at the ju­nior high level the curve would be fairly flat, but not that flat. If they are pro­fes­sional sports­ballers of the Elite Sports­ballers League, the curve is go­ing to have a higher peak at 50; the Hyper­boles are not go­ing to be 100% to lose or win to the Break­fast Ce­real­ers in higher level play. At the ju­nior high level, there will be some c. 100%ers, but I think the flatline is un­likely, and I think the im­pres­sion that it should be a flat line is mis­taken.

Once again, I liked the ar­ti­cle. It was en­gag­ing and in­ter­est­ing. (And I hope I got the prob­lem right.)

I also get “stop af­ter two losses,” al­though my num­bers come out slightly differ­ently. How­ever, I suck at this sort of prob­lem, so it’s quite likely I’ve got it wrong.

My temp­ta­tion would be to solve it nu­mer­i­cally (by brute force), i.e. code up a simu­la­tion and run it a mil­lion times and get the an­swer by see­ing which strat­egy does best. Often that’s the right ap­proach. How­ever, some­times you can’t simu­late, and an an­a­lyt­i­cal (ex­act, a pri­ori) an­swer is bet­ter.

I think you are right about the sports­ball case! I’ve up­dated my meta-meta-prob­a­bil­ity curve ac­cord­ingly :-)

Can you think of a bet­ter ex­am­ple, in which the curve ought to be dead flat?

Jaynes uses “the prob­a­bil­ity that there was once life on Mars” in his dis­cus­sion of this. I’m not sure that’s such a great ex­am­ple ei­ther.

• I think you are right about the sports­ball case! I’ve up­dated my meta-meta-prob­a­bil­ity curve ac­cord­ingly :-)

The wikipe­dia ar­ti­cle on the Beta dis­tri­bu­tion has a good dis­cus­sion of pos­si­ble pri­ors to use. The Jeffreys prior is prob­a­bly the one I’d use for Sports­ball, but the Bayes-Laplace prior is gen­er­ally ac­cept­able as a rep­re­sen­ta­tion of ig­no­rance.

The ex­am­ple I like to give is the un­cer­tain digi­tal coin- I gen­er­ate some dou­ble p be­tween 0 and 1 us­ing a ran­dom num­ber gen­er­a­tor, and then write a func­tion “flip” which gen­er­ates an­other dou­ble, and com­pares it to p. This is analo­gous to your blue box, and if you’re con­fi­dent in the RNG means you have a tight meta-meta-prob­a­bil­ity curve, which jus­tifies the uniform prior.

Jaynes uses “the prob­a­bil­ity that there was once life on Mars” in his dis­cus­sion of this. I’m not sure that’s such a great ex­am­ple ei­ther.

Yeah, that seems like a good can­di­date for the Hal­dane prior to me.

• 178.2 should be 178.4 (180.2 − 1.8) and 176.2 should be 176.6 (178.4 − 1.8)

This doesn’t change the re­sult, though:

After 2 failed tries, even if you do have the good box, the most your net gain rel­a­tive to stand­ing pat can be is 98 ad­di­tional coins.

But, the odds ra­tio of good box to bad box af­ter 2 failed coins is 1:100 or less than 1% prob­a­bil­ity of good box.

So your ex­pected gain from en­ter­ing the third coin is up­per bounded by (98 x 0.01) - (1 x 0.99) which is less than 0.

• The an­swer I got also was to give up af­ter putting in two coins and los­ing both times (as­sum­ing risk neu­tral­ity), if you get a green box.

• 7 Jan 2015 20:56 UTC
0 points

Your link to Ap is bro­ken:( over­all, this was re­ally in­ter­est­ing and un­der­stand­able. Thank you.

• Glad you liked the post! Thanks for point­ing out the link prob­lem. I’ve fixed it, for now. It links to a PDF of a file that’s found in many places on the in­ter­net, but any one of them might be taken down at any time.

• We could call this meta-prob­a­bil­ity, al­though that’s not a stan­dard term.

Then why use it in­stead of learn­ing the stan­dard terms and us­ing those? This might sound like pedan­tic, but it mat­ters be­cause this kind of thing leads to pro­lifer­a­tion of un­nec­es­sary jar­gon and some­times rein­vent­ing the wheel.

Are we talk­ing about con­di­tional prob­a­bil­ity? Joint prob­a­bil­ity?

Also, a minor nit­pick about your next-to-last figure: given what’s said about the boxes, it’s not two bell curves cen­tered at 0 and 0.9. It should be a point mass (ver­ti­cal line) at 0 and a bell curve cen­tered at 0.9.

• Then why use it in­stead of learn­ing the stan­dard terms and us­ing those?

The stan­dard term is A_p, which seemed un­nec­es­sar­ily ob­scure.

Re the figure, see the dis­cus­sion here.

(Sorry to be slow to re­ply to this; I got busy and didn’t check my LW in­box for more than a month.)

• Agree with John Baez, Jeremy Salwen and oth­ers. Stan­dard tools are enough to solve this prob­lem. You don’t need prob­a­bil­ities over prob­a­bil­ities, just prob­a­bil­ities over states of the world, and prob­a­bil­ities over what might hap­pen in each state of the world.

• Has any­one used meta-prob­a­bil­ities, or some­thing similar, to an­a­lyze the Pas­cal Mug­ger prob­lem?

• We can do it now! :)

What sort of prob­lem is one where meta-prob­a­bil­ities are use­ful? One where you get differ­ent chance pay­outs de­pend­ing on differ­ent mod­els of the prob­lem (e.g. one brown box vs. the good green box), and so you want to tell those mod­els apart.

Or if we want meta-meta prob­a­bil­ities, then we could have differ­ent classes of mod­els that you can tell apart (boxes or spheres?), and then differ­ent mod­els that you have to tell apart (good box or bad box?), and then differ­ent out­comes that hap­pen prob­a­bil­is­ti­cally (coins or no coins?).

But the key idea is that we gain some­thing by differ­en­ti­at­ing these differ­ent known ways that the prob­lem could be.

So in the case of some­one who says “give me 5\$ and I’ll get you into heaven when you die,” what are the lay­ers? Well, they could be a char­latan or not. If they’re not a char­latan, then we can as­sume for the sake of ar­gu­ment that you’ll get into heaven with cer­tainty, so no meta-prob­a­bil­ity there. But if they are a char­latan, then there’s some prob­a­bil­ity you’d get into heaven any­how, so the prob­a­bil­ity of “they are a char­latan” is equiv­a­lent to a meta-prob­a­bil­ity for get­ting into heaven.

Okay, so: What ex­per­i­ment can you do that will let you change your mind about Pasal’s Mug­ger? Or to put it an­other way, how can some­one con­vince you even a lit­tle that they are not a char­latan? What is the anal­ogy be­tween this and the boxes int he origi­nal post?