# Probability, knowledge, and meta-probability

This article is the first in a sequence that will consider situations where probability estimates are not, by themselves, adequate to make rational decisions. This one introduces a “meta-probability” approach, borrowed from E. T. Jaynes, and uses it to analyze a gambling problem. This situation is one in which reasonably straightforward decision-theoretic methods suffice. Later articles introduce increasingly problematic cases.

## A surprising decision anomaly

Let’s say I’ve recruited you as a subject in my thought experiment. I show you three cubical plastic boxes, about eight inches on a side. There’s two green ones—identical as far as you can see—and a brown one. I explain that they are gambling machines: each has a faceplate with a slot that accepts a dollar coin, and an output slot that will return either two or zero dollars.

I unscrew the faceplates to show you the mechanisms inside. They are quite simple. When you put a coin in, a wheel spins. It has a hundred holes around the rim. Each can be blocked, or not, with a teeny rubber plug. When the wheel slows to a halt, a sensor checks the nearest hole, and dispenses either zero or two coins.

The brown box has 45 holes open, so it has probability p=0.45 of returning two coins. One green box has 90 holes open (p=0.9) and the other has none (p=0). I let you experiment with the boxes until you are satisfied these probabilities are accurate (or very nearly so).

Then, I screw the faceplates back on, and put all the boxes in a black cloth sack with an elastic closure. I squidge the sack around, to mix up the boxes inside, and you reach in and pull one out at random.

I give you a hundred one-dollar coins. You can put as many into the box as you like. You can keep as many coins as you don’t gamble, plus whatever comes out of the box.

If you pulled out the brown box, there’s a 45% chance of getting $2 back, and the expected value of putting a dollar in is $0.90. Rationally, you should keep the hundred coins I gave you, and not gamble.

If you pulled out a green box, there’s a 50% chance that it’s the one that pays two dollars 90% of the time, and a 50% chance that it’s the one that never pays out. So, overall, there’s a 45% chance of getting $2 back.

Still, rationally, you should put some coins in the box. If it pays out at least once, you should gamble all the coins I gave you, because you know that you got the 90% box, and you’ll nearly double your money.

If you get nothing out after a few tries, you’ve probably got the never-pay box, and you should hold onto the rest of your money. (Exercise for readers: how many no-payouts in a row should you accept before quitting?)

What’s interesting is that, when you have to decide whether or not to gamble your first coin, the probability is exactly the same in the two cases (p=0.45 of a $2 payout). However, the rational course of action is different. What’s up with that?

Here, a single probability value fails to capture everything you **know** about an uncertain event. And, it’s a case in which that failure matters.

Such limitations have been recognized almost since the beginning of probability theory. Dozens of solutions have been proposed. In the rest of this article, I’ll explore one. In subsequent articles, I’ll look at the problem more generally.

## Meta-probability

To think about the green box, we have to reason about *the probabilities of probabilities*. We could call this **meta-probability**, although that’s not a standard term. Let’s develop a method for it.

Pull a penny out of your pocket. If you flip it, what’s the probability it will come up heads? 0.5. Are you sure? Pretty darn sure.

What’s the probability that my local junior high school sportsball team will win its next game? I haven’t a ghost of a clue. I don’t know anything even about professional sportsball, and certainly nothing about “my” team. In a match between two teams, I’d have to say the probability is 0.5.

My girlfriend asked me today: “Do you think Raley’s will have dolmades?” Raley’s is our local supermarket. “I don’t know,” I said. “I guess it’s about ^{50}⁄_{50}.” But unlike sportsball, I know something about supermarkets. A fancy Whole Foods is very likely to have dolmades; a 7-11 almost certainly won’t; Raley’s is somewhere in between.

How can we model these three cases? One way is by assigning probabilities to each possible probability between 0 and 1. In the case of a coin flip, 0.5 is much more probable than any other probability:

We can’t be *absolutely sure* the probability is 0.5. In fact, it’s almost certainly not *exactly* that, because coins aren’t perfectly symmetrical. And, there’s a very small probability that you’ve been given a tricky penny that comes up tails only 10% of the time. So I’ve illustrated this with a tight Gaussian centered around 0.5.

In the sportsball case, I have no clue what the odds are. They might be anything between 0 to 1:

In the Raley’s case, I have *some* knowledge, and extremely high and extremely low probabilities seem unlikely. So the curve looks something like this:

Each of these curves averages to a probability of 0.5, but they express different degrees of confidence in that probability.

Now let’s consider the gambling machines in my thought experiment. The brown box has a curve like this:

Whereas, when you’ve chosen one of the two green boxes at random, the curve looks like this:

Both these curves give an average probability of 0.45. However, a rational decision theory has to distinguish between them. Your optimal strategy in the two cases is quite different.

With this framework, we can consider another box—a blue one. It has a fixed payout probability somewhere between 0 and 0.9. I put a random number of plugs in the holes in the spinning disk—leaving between 0 and 90 holes open. I used a noise diode to choose; but you don’t get to see what the odds are. Here the probability-of-probability curve looks rather like this:

This isn’t quite right, because 0.23 and 0.24 are much more likely than 0.235—the plot should look like a comb—but for strategy choice the difference doesn’t matter.

What *is* your optimal strategy in this case?

As with the green box, you ought to spend some coins gathering information about what the odds are. If your estimate of the probability is less than 0.5, when you get confident enough in that estimate, you should stop. If you’re confident enough that it’s more than 0.5, you should continue gambling.

If you enjoy this sort of thing, you might like to work out what the exact optimal algorithm is.

In the next article in this sequence, we’ll look at some more complicated and interesting cases.

## Further reading

The “meta-probability” approach I’ve taken here is the A_{p} distribution of E. T. Jaynes. I find it highly intuitive, but it seems to have had almost no influence or application in practice. We’ll see later that it has some problems, which might explain this.

The green and blue boxes are related to “multi-armed bandit problems.” A “one-armed bandit” is a casino slot machine, which has defined odds of payout. A multi-armed bandit is a hypothetical generalization with several arms, each of which may have different, unknown odds. In general, you ought to pull each arm several times, to gain information. The question is: what is the optimal algorithm for deciding which arms to pull how many times, given the payments you have received so far?

If you read the Wikipedia article and follow some links, you’ll find the concepts you need to find the optimal green and blue box strategies. But it might be more fun to try on your own first! The green box is simple. The blue box is harder, but the same general approach applies.

Wikipedia also has an accidental list of formal approaches for problems where ordinary probability theory fails. This is far from complete, but a good starting point for a browser tab explosion.

## Acknowledgements

Thanks to Rin’dzin Pamo, St. Rev., Matt_Simpson, Kaj_Sotala, and Vaniver for helpful comments on drafts. Of course, they may disagree with my analyses, and aren’t responsible for my mistakes!

- 14 Sep 2013 18:04 UTC; 13 points) 's comment on Welcome to Less Wrong! (5th thread, March 2013) by (
- Probability and radical uncertainty by 23 Nov 2013 22:34 UTC; 11 points) (
- 21 Mar 2015 11:24 UTC; 7 points) 's comment on Error margins by (
- Knightian Uncertainty from a Bayesian perspective by 4 Feb 2014 4:16 UTC; 7 points) (
- 19 Aug 2014 3:00 UTC; 5 points) 's comment on Open thread, 18-24 August 2014 by (
- Some thoughts on meta-probabilties by 21 Sep 2015 17:23 UTC; 0 points) (

Ordinary probability theory and expected utility are sufficient to handle this puzzle. You just have to calculate the expected utility of each strategy before choosing a strategy. In this puzzle a strategy is more complicated than simply putting some number of coins in the machine: it requires deciding what to do after each coin either succeeds or fails to succeed in releasing two coins.

In other words, a strategy is a choice of what you’ll do at each point in the game tree—just like a strategy in chess.

We don’t expect to do well at chess if we decide on a course of action that ignores our opponent’s moves. Similarly, we shouldn’t expect to do well in this probabilistic game if we only consider strategies that ignore what the machine does. If we consider

allstrategies, compute their expected utility based on the information we have, and choose the one that maximizes this, we’ll do fine.I’m saying essentially the same thing Jeremy Salwen said.

So, let me try again to explain why I think this is missing the point… I wrote “a single probability value fails to capture everything you know about an uncertain event.” Maybe “simple” would have been better than “single”?

The point is that you can’t solve this problem without somehow reasoning about probabilities of probabilities. You can solve it by reasoning about the expected value of different strategies. (I said so in the OP; I constructed the example to make this the obviously correct approach.) But those strategies contain reasoning about probabilities within them. So the “outer” probabilities (about strategies) are meta-probabilistic.

[Added:] Evidently, my OP was unclear and failed to communicate, since several people missed the same point in the same way. I’ll think about how to revise it to make it clearer.

The exposition of meta-probability is well done, and shows an interesting way of examining and evaluating scenarios. However, I would take issue with the first section of this article in which you establish single probability (expected utility) calculations as insufficient for the problem, and present meta-probability as the solution.

In particular, you say

I do not believe that this is a failure of applying a single probability to the situation, but merely calculating the probability wrongly, by ignoring future effects of your choice. I think this is most clearly illustrated by scaling the problem down to the case where you are handed a green box, and only two coins. In this simplified problem, we can clearly examine all possible strategies.

Strategy 1 would be to hold on to your two dollar coins. There is a 100% chance of a $2.00 payout

Strategy 2 would be to insert both of your coins into the box. There is a 50.5% chance of a $0.00 payout, 40.5% chance of a $4.00 payout and a 9% chance of a $2.00 payout.

Strategy 3 would be to insert one coin, and then insert the second only if the first pays out. There is a 55% chance of $1.00 payout, a 4.5% chance of a $2.00 payout, and a 40.5% chance of a $4.00 payout.

Strategy 4 would be to insert one coin, and then insert the second only if the first doesn’t pay out. There is a 50.5% chance of a 0.00$ payout, a 4.5% chance of a $2.00 payout, and a 45% chance of a $3.00 payout.

When put in these terms, it seems quite obvious that your choice to open the box would depend on more than the expected payoff from only the first box, because quite clearly your choice to open the first box pays off (or doesn’t pay off) when opening (or not opening) the other boxes as well. This seems like an error in calculating the payoff matrix rather than a flaw with the technique of single probability values itself. It ignores the fact that opening the first box not only pays you off immediately, but also pays you off in the future by giving you information about the other boxes.

This problem easily succumbs to standard expected value calculations if all actions are considered. The steps remain the same as always:

Assign a utility to each dollar amount outcome

Calculate the expected utility of all possible strategies

Choose the strategy with the highest expected utility

In the case of two coins, we were able to trivially calculate the outcomes of all possible strategies, but in larger instances of the problem, it might be advisable to use shortcuts in the calculations. However, it still remains true that the best choice will still be the one you

wouldhave gotten if you had done out the full expected value calculation.I think the confusion arises because a lot of the time problems are presented in a way that screens them off from the rest of the world. For example, you are given a box, and it either has $10.00 or $100.00. Once you open the box, the only effect it has on you is the amount of money you got. After you get the money, the box does not matter to the rest of the world. Problems are presented this way so that it is easy to factor out the decisions and calculations you have to make from every other decision you have to make. However, decision are not necessarily this way (in fact in real life, very few decisions are). In the choice of inserting the first coin or not, this is simply not the case, despite having superficial similarities to standard “box” problems.

Although you clearly understand that the payoffs from the boxes are entangled, you only apply this knowledge in your informal approach to the problem. The failure to consider the full effects of your actions in opening the first box may be psychologically encouraged by the technique of “single probability calculations”, but it is certainly not a failure of the technique itself to capture such situations.

The substantive point here isn’t about EU calculations per se. Running a full analysis of everything that might happen and doing an EU calculation on that basis is fine, and I don’t think the OP disputes this.

The subtlety is about what numerical data can formally represent your full state of knowledge. The claim is that a mere probability of getting the $2 payout does not. It’s the case that on the first use of a box, the probability of the payout given its colour is 0.45 regardless of the colour.

However, if you merely hold onto that probability, then if you put in a coin and so learn something about the boxes you can’t update that probability to figure out what the probability of payout for the second attempt is. You need to go back and also remember whether the box is green or brown. The point of Jaynes and the A_p distribution is that it actually does screen off all other information. If you keep track of it you never need to worry about remembering the colour of the box, or the setup of the experiment. Just this “meta-distribution”.

However, a single probability for each outcome given each strategy

isall the information needed. The problem is not with using single probabilities to represent knowledge about the world, it’s the straw math that was used to represent the technique. To me, this reasoning is equivalent to the following:“You work at a store where management is highly disorganized. Although they precisely track the number of days you have worked since the last payday, they never remember when they last paid you, and thus every day of the work week has a

^{1}⁄_{5}chance of being a payday. For simplicity’s sake, let’s assume you earn $100 a day.You wake up on Monday and do the following calculation: If you go in to work, you have a

^{1}⁄_{5}chance of being paid. Thus the expected payoff of working today is $20, which is too low for it to be worth it. So you skip work. On Tuesday, you make the same calculation, and decide that it’s not worth it to work again, and so you continue forever.I visit you and immediately point out that you’re being irrational. After all, a salary of $100 a day clearly is worth it to you, yet you are not working. I look at your calculations, and immediately find the problem: You’re using a single probability to represent your expected payoff from working! I tell you that using a meta-probability distribution fixes this problem, and so you excitedly scrap your previous calculations and set about using a meta-probability distribution instead. We decide that a Gaussian sharply peaked at 0.2 best represents our meta-probability distribution, and I send you on your way.”

Of course, in this case, the meta-probability distribution doesn’t change anything. You still continue skipping work, because I have devised the hypothetical situation to illustrate my point (

evil laugh). The point is that in this problem the meta-probability distribution solves nothing, because the problem is not with a lack of meta-probability, but rather a lack of considering future consequences.In both the OPs example and mine, the problem is that the math was done incorrectly, not that you need meta-probabilities. As you said, meta-probabilities are a method of screening off additional labels on your probability distributions

for a particular class of problemswhere you are taking repeated samples that are entangled in a very particular sort of way. As I said above, I appreciate the exposition of meta-probabilities as a tool, and your comment as well has helped me better understand their instrumental nature, but I take issue with what sort of tool they are presented as.If you do the calculations directly with the probabilities, your calculation will succeed if you do the math right, and fail if you do the math wrong. Meta-probabilities are a particular way of representing a certain calculation that succeed and fail on their own right. If you use them to represent the correct direct probabilities, you will get the right answer, but they are only an aid in the calculation, they

neverfix any problem with direct probability calculations. The fixing of the calculation and the use of probabilities are orthogonal issues.To make a blunt analogy, this is like someone trying to plug an Ethernet cable into a phone jack, and then saying “when Ethernet fails, wifi works”, conveniently plugging in the wifi adapter correctly.

The key of the dispute in my eyes is not whether wifi can work for certain situations, but whether there’s anything actually wrong with Ethernet in the first place.

So, my observation is that without meta-distributions (or A_p), or conditioning on a pile of past information (and thus tracking /more/ than just a probability distribution over current outcomes), you don’t have the room in your knowledge to be able to even talk about sensitivity to new information coherently. Once you can talk about a complete state of knowledge, you can begin to talk about the utility of long term strategies.

For example, in your example, one would have the same

probabilityof being paid today if 20% of employers actually pay you every day, whilst 80% of employers never paid you. But in such an environment, it would not make sense to work a second day in 80% of cases. The optimal strategy depends on what youknow, and to represent that in general requires more than a straight probability.There

aredifferent problems coming from the distinction between choosing a long term policy to follow, and choosing a one shot action. But we can’t even approach this question in general unless we can talk sensibly about a sufficient set of information to keep track of about. There are two distinct problems, one prior to the other.Jaynes does discuss a problem which is closer to your concerns (that of estimating neutron multiplication in a 1-d experiment 18.15, pp579. He’s comparing two approaches, which for my purposes differ in their prior A_p distribution.

Jeremy, I think the apparent disagreement here is due to unclarity about what the point of my argument was. The point was

notthat this situation can’t be analyzed with decision theory; it certainly can, and I did so. The point is that different decisions have to be made in two situations wherethe probabilitiesare the same.Your discussion seems to equate “probability” with “utility”, and the whole point of the example is that, in this case, they are not the same.

I guess my position is thus:

While there are sets of probabilities which by themselves are not adequate to capture the information about a decision, there always is a set of probabilities which

isadequate to capture the information about a decision.In that sense I do not see your article as an argument against using probabilities to represent decision information, but rather a reminder to use the correct set of probabilities.

My understanding of Chapman’s broader point (which may differ wildly from his understanding) is that determining which set of probabilities is correct for a situation can be rather hard, and so it deserves careful and serious study from people who want to think about the world in terms of probabilities.

It may be helpful to read some related posts (linked by lukeprog in a comment on this post): Estimate stability, and Model Stability in Intervention Assessment, which comments on Why We Can’t Take Expected Value Estimates Literally (Even When They’re Unbiased). The first of those motivates the A_p (meta-probability) approach, the second uses it, and the third explains intuitively why it’s important in practice.

Thanks, Jonathan, yes, that’s how I understand it.

Jaynes’ discussion motivates A_p as an efficiency hack that allows you to save memory by forgetting some details. That’s cool, although not the point I’m trying to make here.

A single probability cannot sum up our knowledge.

Before we talk about plans, as you went on to, we must talk about the world as it stands. We know there is a 50% chance of a 0% machine and a 50% chance of a 90% machine. Saying 45%

does not encode this information. No other number does either.Scalar probabilities of binary outcomes are such a useful hammer that we need to stop and remember sometimes that not all uncertainties are nails.

Jeremy, thank you for this. To be clear, I wasn’t suggesting that meta-probability is

thesolution. It’sasolution. I chose it because I plan to use this framework in later articles, where it will (I hope) be particularly illuminating.I don’t think it’s correct to equate probability with expected utility, as you seem to do here. The probability of a payout

isthe same in the two situations. The point of this example is that the probability of a particular event does not determine the optimal strategy. Because utility is dependent on your strategy, that also differs.Yes, absolutely! I chose a particularly simple problem, in which the correct decision-theoretic analysis is obvious, in order to show that probability does not always determine optimal strategy. In this case, the optimal strategies are clear (except for the exact stopping condition), and clearly different, even though the probabilities are the same.

I’m using this as an introductory wedge example. I’ve opened a Pandora’s Box: probability

by itselfis not a fully adequate account of rationality. Many odd things will leap and creep out of that box so long as we leave it open.Hmmm. I was equating them as part of the standard technique of calculating the probability of outcomes from your actions, and then from there multiplying by the utilities of the outcomes and summing to find the expected utility of a given action.

I think it’s just a question of what you think the error is in the original calculation. I find the error to be the conflation of “payout” (as in immediate reward from inserting the coin) with “payout” (as in the expected reward from your action including short term and long-term rewards). It seems to me that you are saying that you can’t look at the immediate probability of payout

which I agree with. But you seem to ignore the obvious solution of considering the probability of

totalpayout, including considerations about your strategy. In that case, you really do have a single probability representing the likelihood of a single outcome, and you do get the correct answer. So I don’t see where the issue with using a single probability comes from. It seems to me an issue with using the wrong single probability.And especially troubling is that you seem to agree that using direct probabilities to calculate the single probability of each outcome and then weighing them by desirability will give you the correct answer, but then you say

which may be true, but I don’t think is demonstrated at all by this example.

Thank you for further explaining your thinking.

Yes, I see your point (although I don’t altogether agree). But, again, what I’m doing here is setting up analytical apparatus that will be helpful for more difficult cases later.

In the mean time, the LW posts I pointed to here may motivate more strongly the claim that probability alone is an insufficient guide to action.

I think a much better approach is to assign models to the problem (e.g. “it’s a box that has 100 holes, 45 open and 65 plugged, the machine picks one hole, you get 2 coins if the hole is open and nothing if it’s plugged.”), and then have a probability distribution over models. This is better because keeps probabilities assigned to facts about the world.

It’s true that probabilities-of-probabilities are just an abstraction of this (when used correctly), but I’ve found that people get confused really fast if you ask them to think in terms of probabilities-of-probabilities. (See every confused discussion of “what’s the standard deviation of the standard deviation?”)

Isn’t Chapman’s approach and your approach completely identical?

As per OP’s graphs, each point on the X axis represents a model and the height of the blue line as the probability assigned to that model.

Or did you just mean that your way is a better way to phrase it for not confusing everyone?

Right. It’s good for not confusing new people, and sometimes also good for not confusing yourself.

Oh ok.

I misinterpreted because you said “better” (implying a difference), and “abstraction” is not necessarily the same as “identical”.

Suppose we’re using Laplace’s Rule of Succession on a coin. On the zeroth round before we have seen any evidence, we assign probability 0.5 to the first coinflip coming up heads. We also assign marginal probability 0.5 to the second flip coming up heads, the third flip coming up heads, and so on. What distinguishes the Laplace epistemic state from the ‘certainty of a fair coin’ epistemic state is that they represent different probability distributions over

sequencesof coinflips.Since some probability distributions over events are correlated, we must represent our states of knowledge by assigning probabilities to sequences or sets of events, and our states of knowledge cannot be represented by stating marginal probabilities for all events independently.

We could also try to summarize some features of such epistemic states by talking about the instability of estimates—the degree to which they are easily updated by knowledge of other events—though of course this will be a derived feature of the probability distribution, rather than an ontologically extra feature of probability.

I reject that this is a good reason for probability theorists to panic.

On the meta level I remark that panic represents a failure of reductionist effort; that is, it would be possible to reduce things to simple probabilities by putting in an effort, but there is a temptation to not put in this effort and instead complicate our view of probability. After seeing this reduction work a few dozen times, however, one begins to acquire (by Laplace’s Rule of Succession) some degree of confidence that it can be carried out on the next occasion as well, even if the manner of doing so is not immediately obvious, and a hasty assertion of a fake reduction would not be helpful.

Yes, this is Jaynes’ A_p approach.

I’m not sure I follow this. There is no prior distribution for the per-coin payout probabilities that can accurately reflect all our knowledge.

Yes, it’s clear from comments that my OP was somewhat misleading as to its purpose.

Overall, the sequence intends to discuss cases of uncertainty in which probability theory is the wrong tool for the job, and what to do instead.However, this

particulararticle intended only to introduce the idea that one’s confidence in a probability estimate is independent from that estimate, and to develop the A_p (meta-probability) approach to expressing that confidence.Are we talking about the Laplace vs. fair coins? Are you claiming there’s no prior distribution over

sequenceswhich reflects our knowledge? If so I think you are wrong as a matter of math.No. Well, not so long as we’re allowed to take our own actions into account!

I want to emphasize—since many commenters seem to have mistaken me on this—that there’s an

obvious, correctsolution to this problem (which I made explicit in the OP). I deliberately made the problem as simple as possible in order to present the A_p framework clearly.Not sure what you are asking here, sorry...

Heh! Yes, traditional causal models have structure beyond what is present in the corresponding probability distribution over those models, though this has to do with computing counterfactuals rather than meta-probability or estimate instability. Work continues at MIRI decision theory workshops on the search for ways to turn some of this back into probability, but yes, in my world causal models are things we assign probabilities to, over and beyond probabilities we assign to joint collections of events. They are still models of reality to which a probability is assigned, though. (See Judea Pearl’s “Why I Am Only A Half-Bayesian”.)

I don’t really understand what “being Bayesian about causal models” means. What makes the most sense (e.g. what people typicalliy do) is:

(a) “be Bayesian about statistical models”, and

(b) Use additional assumptions to interpret the output of (a) causally.

(a) makes sense because I understand how evidence help me select among sets of statistical alternatives.

(b) also makes sense, but then no one will accept your answer without actually verifying the causal model by experiment—because your assumptions linking the statistical model to a causal one may not be true. And this game of verifying these assumptions doesn’t seem like a Bayesian kind of game at all.

I don’t know what it means to use Bayes theorem to select among causal models directly.

It means that you figure out which causal models look more or less like what you observed.

More generally: There’s a language of causal models which, we think, allows us to describe the actual universe, and many other universes besides. Some of these models are simpler than others. Any given sequence of experiences has some probability of being encountered in a given causal universe.

Thanks for writing this up! I’ve been wanting to write something on the Ap distribution since April, but hadn’t gotten around to it. I look forward to your forthcoming posts.

There aren’t many citations of Jaynes on the Ap distribution, but model uncertainty gets discussed a lot, and is modeling the same kind of thing in a Bayesian way.

On the subject of applied rationality being a lot more than probability estimates, see also When Not to Use Probabilities, Explicit and tacit rationality, and… well, The Sequences.

On the Ap distribution and model uncertainty more generally, see also Model Stability in Intervention Assessment, Model Combination and Adjustment, Why We Can’t Take Expected Value Estimates Literally, and The Optimizer’s Curse and How to Beat It.

Luke, thank you for these pointers! I’ve read some of them, and have the rest open in tabs to read soon.

That’s pretty trivial.

The expected payout of putting a coin into a brown box is 0.90.

The expected payout of putting a coin into a green box is 0.90

plus valuable information about what kind of a green box it is. It is a *different payout*.The term “metaprobability” strikes me as adding confusion. The two layers are

notthe same thing applied to itself, but are in factdifferent questions. “What fraction of the time does this box pay out?” is a different question from “Is this box going to pay out on the next coin?”.Often it takes a lot of questions to fully describe a situation. Using the term “probability” for all of them hides the distinction.

But it is—you’re answering the question “what is the probability that this box will pay out next time”, and “what is the probability that my probability assignment was correct?”

What does it mean for a probability assignment to be correct, as opposed to well-calibrated? Reality is or is not.

I mostly meant well calibrated, but...

There is something-like-correctness in that, given the evidence available to you, there is a correct way to update your prior. That is strictly not a fact about your posterior, but I think it’s a legitimate thing to talk about in terms of ‘correctness’.

There’s more than one event. If you assign a single probability to winning the first, third, and seventh times and failing the second, fourth, fifth, and sixth times given that you put in seven coins, etc. that captures everything you need to know and does not involve meta-probabilities.

More succinctly, the probability of winning on the second try given that you win on the first try is different depending on the color of the machine.

Right: a game where you repeatedly put coins in a machine and decide whether or not to put in another based on what occurred is not a single ‘event’, so you can’t sum up your information about it in just one probability.

Why on earth should we expect that the long term expected value of all future consequences of a choice to be equal to the immediate payoffs? They are two different things. Learning is the most obvious example of when these can be expected to be different. In this case learning information and in other cases learning skills.

The statement “probability estimates are not, by themselves, adequate to make rational decisions” could apparently have been replaced with the statement “my definition of the phrase ‘probability estimates’ is less inclusive than yours”—what you call a “meta-probability” I would have just called a “probability”. In a world where both epistemic and aleatory uncertainty exist, your expectation of events in that world is going to look like a probability distribution over a space of probability distributions over outputs; this is still a probability distribution, just a much more expensive one to do approximate calculations with.

Yes, meta-probabilities are probabilities, although somewhat odd ones; they obey the normal rules of probability. Jaynes discusses this in his Chapter 18; his discussion there is worth a read.

The statement “probability estimates are not, by themselves, adequate to make rational decisions” was meant to describe the entire sequence, not this article.

I’ve revised the first paragraph of the article, since it seems to have misled many readers. I hope the point is clearer now!

I’m looking forward to the rest of your sequence, thanks!

I was recently reading through a month-old blog post where one lousy comment was arguing against a strawman of Bayesian reasoning wherein you deal with probabilities by “mushing them all into a single number”. I immediately recollected that the latest thing I saw on LessWrong was a fantastic summary of how you can treat mixed uncertainty as a probability-distribution-of-probability-distributions. I considered posting a belated link in reply, until I discovered that the lousy comment was written by David Chapman and the fantastic summary was written by David_Chapman.

I’m not sure if later you’re going to go off the rails or change my mind or what, but so far this looks like one of the greatest attempts at “steelmanning” that I’ve ever seen on the internet.

Thanks, that’s

reallyfunny! “On the other hand” is my general approach to life, so I’m happy to argue with myself.And yes, I’m steelmanning. I think this approach is an excellent one in some cases; it will break down in others. I’ll present a first one in the next article. It’s another box you can put coins in that (I’ll claim) can’t usefully be modeled in this way.

Here’s the quote from Jaynes, by the way:

Thanks for posting this! :D I’m curious to see where you go next.

It seems odd to me that the mode for the left mixture is to the right of 0. I would have put it at 0, and made that mixture twice as tall so the area underneath would still be the same.

Yup, it’s definitely wrong! I was hoping no one would notice. I thought it would be a distraction to explain why the two are different (if that’s not obvious), and also I didn’t want to figure out exactly what the right math was to feed to my plotting package for this case. (Is the correct form of the curve for the p=0 case obvious to you? It wasn’t obvious to me, but this isn’t my area of expertise...)

I would have left it unexplained in the post, and then explained it in the comments when the first person asked about it. In my experience, causally remarked semi-obvious true facts like that (“why are these two not equally tall?” “Because the area underneath is what matters”) are useful at convincing people of technical ability.

I probably would have gone with the point mass approximation- i.e. a big circle at (0,.5), a line down to (0,0), a line over to (.9,0), and then a line up to a big circle at (.9,.5), then also a line from (.9,0) to (1,0). Using the Gaussian mixtures, though, I’d probably give them the same variance and just give the left one twice the weight of the right one, center them at 0 and .9, and then display only between 0 and 1. Using the pure functional form, that would look something like 2exp(-x^2/v)+exp(-(x-.9)^2/v).

Now, this is assuming we have some sort of Gaussian prior. We could also have a beta prior, which is conjugate to the binomial distribution, which is nice because that fits our testbed. Gaussian might be appropriate because we’ve actually opened the system up and we think the measurement system it uses has Gaussian noise.

I’m not sure I agree with the claim that the variance is the same; you could probably assert that chance the left one will pay out is 0 to arbitrarily high precision, and it seems likely the variance would depend on the number of plugs filled. That said, this doesn’t have much impact, and saying “we’ll approximate away the meta-meta-probability to simplify this example” seems like it goes against your general point, and is thus inadvisable.

Of course it doesn’t. Who ever said it does? Decisions are made on the basis of expected value, not probability. And your analysis of the first bet ignores the value of the information gained from it in executing your options for further play thereafter.

I think you’re just fundamentally confusing the probability of a win on the first coin with the expected long run frequency of wins for the different boxes. Entirely different things.

This statement indicates a lack of understanding of Jaynes, or at least an adherence to his foundations. Probably is

assignedby an agent based on information—there is no value that the probabilityisbesides what the agent assigns.Jaynes specifically analyzes coin flipping, correctly asserting that the probability of the outcome of a coin flip will depend on your knowledge of the relation of the initial states of the coin, the force applied to it, and their relation to the outcome. He even describes a method of controlling the outcome, and I believe shared his own data on executing that method, showing how the

frequencyof heads/tails could be made to deviate appreciably from 0.5.Having said that, I’ve always found Jaynes “inner robot” interesting, and have the feeling the idea has real potential.

Yes, that’s the point here!

By “the first bet” I take it that you mean “your first opportunity to put a coin in a green box” (rather than meaning “brown box”).

My analysis of that was “you should put some coins in the box”, exactly because of the information gain.

This post was based closely on the Chapter 18 of Jaynes’ book, where he writes:

Do you think he’s saying something different from me here?

I don’t like your use of the word “probability”. Sometimes, you use it to describe subjective probabilities, but sometimes you use it to describe the frequency properties of putting a coin in a given box.

When you say,

“The brown box has 45 holes open, so it has probability p=0.45 of returning two coins.”you are really saying that knowing that I have the brown box in front of me, and I put a coin in it, I would assign a 0.45 probability of that coin yielding 2 coins.And, as far as I know, the coin tosses are all independent: no amount of coin toss would ever tell me anything about the next coin toss. Simply put, a box, along with the way we toss coins in it has rather definitefrequency properties.Then you talk about

“assigning probabilities to each possible probability between 0 and 1”. What you really wanted to say isassigning a probability distribution over the possible frequency properties.I know it sounds pedantic, but I cringe every time someone talks about “probabilities” being some properties of a real object out there in the territory (like amplitudes in QM). Probability is in the mind. Using the word any other way is confusing.

So perhaps this is for the next post, but are these ‘metaprobabilities’ just regular hyperparameters?

I was wondering this too. I haven’t looked at this A_p distribution yet (nor have I read all the comments here), but having distributions over distributions is, like, the core of Bayesian methods in machine learning. You don’t just keep a single estimate of the probability; you keep a distribution over possible probabilities, exactly like David is saying. I don’t even know how updating your probability distribution in light of new evidence (aka a “Bayesian update”) would work without this.

Am I missing something about David’s post? I did go through it rather quickly.

I’m sure you know more about this than I do! Based on a quick Wiki check, I suspect that formally the A_p are one type of hyperprior, but not all hyperpriors are A_p (a/k/a metaprobabilities).

Hyperparameters are used in Bayesian sensitivity analysis, a/k/a “Robust Bayesian Analysis”, which I recently accidentally reinvented here. I might write more about that later in this sequence.

When you use an underscore in a name, make sure to escape it first, like so:

(This is necessary because underscores are yet another way to make things italic, and only applies to comments, as posts use different formatting.)

Thanks! Fixed.

Yeah—from what I’ve seen, something mathematically equivalent to A_p distributions are commonly used, but that’s not what they’re called.

Like, I think you might call the case in this problem “a Bernoulli random variable with an unknown parameter”. (The Bernoulli random variable being 1 if it gives you $2, 0 if it gives you $0). And then the hyperprior would be the probability distribution of that parameter, I guess? I haven’t really heard that word before.

ET Jaynes, of course, would never talk like this because the idea of a random quantity existing in the real world is a mind projection fallacy. Thus, no “random variables”. So he uses the A_p distribution as a way of thinking about the same math without the idea of randomness. Jaynes’s A_p in this case corresponds exactly to the more traditional “the parameter of the Bernoulli random variable is p”.

(btw I have a purely mathematical question about the A_p distribution chapter, which I posted to the open thread: http://lesswrong.com/lw/ii6/open_thread_september_28_2013/9pbn if you know the answer I’d really appreciate it if you told me)

I guess this is a joke. From wikipedia: “Originally considered by Allied scientists in World War II, it proved so intractable that, according to Peter Whittle, it was proposed the problem be dropped over Germany so that German scientists could also waste their time on it.[10]” (note that your wikipedia-link is broken)

Thank you very much—link fixed!

That’s a really funny quote!

Multi-armed bandit problems were intractable during WWII probably mainly because computers weren’t available yet. In many cases, the best approach is brute force simulation. That’s the way I would approach the “blue box” problem (because I’m lazy).

But exact approaches have also been found: “Burnetas AN and Katehakis MN (1996) also provided an explicit solution for the important case in which the distributions of outcomes follow arbitrary (i.e., nonparametric) discrete, univariate distributions.” The blue box problem is within that class.

Yeah, but that was 60 years ago, and the single-armed bandit problem is easier than the multi-armed bandit.

See Judea Pearl’s Probablilistic Reasoning in Intelligent Systems, section 7.3, for a discussion of “metaprobabilities” in the context of graphical models.

Although it’s true that you could compute the correct decision by directly putting a distribution on all possible futures, the computational complexity of this strategy grows combinatorially as the scenario gets longer. This isn’t a minor point; generalizing the brute force method gets you AIXI. That is why you need something like the A_p distribution or Pearl’s “contingencies” to store evidence and reason efficiently.

I don’t see how this differs from how anyone else ever handles this problem. I hope you explain the difference in this example, before going on to other examples.

Can you point me at some other similar treatments of the same problem? Thanks!

I ask you for a different treatment, so you ask me for a similar treatment?

No, I don’t see the point. Doesn’t my request make sense, regardless of whether we agree on what is similar or different?

FWIW, I understood David to be requesting some specific examples of how members of the set “everyone else ever” handle this problem, which on your account is the same as how Jaynes handles it, in order to more clearly see the similarity you reference.

Thanks, yes! I.e. who is this “everyone else,” and where do they treat it the same way Jaynes does? I’m not aware of any examples, but I have only a basic knowledge of probability theory.

It’s certainly possible that this approach is common, but Jaynes wasn’t ignorant, and he seemed to think it was a new and unusual and maybe controversial idea, so I kind of doubt it.

Also, I should say that I have no dog in this fight at all; I’m not advocating “Jaynes is the greatest thing since sliced bread”, for example. (Although that does seem to be the opinion of some LW writers.)

I really liked the article. So allow me to miss the forest for a moment; I want to chop down this tree:

Let’s solve the green box problem:

Try zero coins: EV: 100 coins.

Try one coin, give up if no payout: 45% of 180.2 + 55% of 99= c. 135.5 (I hope.)

(I think this is right, but welcome corrections; 90%x50%x178, +.2 for first coin winning (EV of that 2 not 1.8), + keeper coins. I definitely got this wrong the first time I wrote it out, so I’m less confident I got it right this time. Edit before posting: Not just once.)

Try two coins, give up if no payout:

45% of 180.2 (pays off first time) 4.5% of 178.2 (second time)

50.5% of 98. Total: c.138.6

I used to be quite good at things like this. I also used to watch Hill Street Blues. I make the third round very close:

45% of 180.2 4.5% of 178.2 .45% of 176.2

50.05% of 97

Or c. 138.45.

So, I pick two as the answer.

Quibble with the sportsball graph:

You have little confidence, for sure, but chance of winning doesn’t follow that graph, and there’s just no reason it should. If the Piggers are playing the Oatmeals, and you know nothing about them, I’d guess at the junior high level the curve would be fairly flat, but not that flat. If they are professional sportsballers of the Elite Sportsballers League, the curve is going to have a higher peak at 50; the Hyperboles are not going to be 100% to lose or win to the Breakfast Cerealers in higher level play. At the junior high level, there will be some c. 100%ers, but I think the flatline is unlikely, and I think the impression that it should be a flat line is mistaken.

Once again, I liked the article. It was engaging and interesting. (And I hope I got the problem right.)

Glad you liked it!

I also get “stop after two losses,” although my numbers come out slightly differently. However, I suck at this sort of problem, so it’s quite likely I’ve got it wrong.

My temptation would be to solve it numerically (by brute force), i.e. code up a simulation and run it a million times and get the answer by seeing which strategy does best. Often that’s the right approach. However, sometimes you can’t simulate, and an analytical (exact,

a priori) answer is better.I think you are right about the sportsball case! I’ve updated my meta-meta-probability curve accordingly :-)

Can you think of a better example, in which the curve ought to be dead flat?

Jaynes uses “the probability that there was once life on Mars” in his discussion of this. I’m not sure that’s such a great example either.

The wikipedia article on the Beta distribution has a good discussion of possible priors to use. The Jeffreys prior is probably the one I’d use for Sportsball, but the Bayes-Laplace prior is generally acceptable as a representation of ignorance.

The example I like to give is the uncertain digital coin- I generate some double

pbetween 0 and 1 using a random number generator, and then write a function “flip” which generates another double, and compares it top. This is analogous to your blue box, and if you’re confident in the RNG means you have a tight meta-meta-probability curve, which justifies the uniform prior.Yeah, that seems like a good candidate for the Haldane prior to me.

178.2 should be 178.4 (180.2 − 1.8) and 176.2 should be 176.6 (178.4 − 1.8)

This doesn’t change the result, though:

After 2 failed tries, even if you do have the good box, the most your net gain relative to standing pat can be is 98 additional coins.

But, the odds ratio of good box to bad box after 2 failed coins is 1:100 or less than 1% probability of good box.

So your expected gain from entering the third coin is upper bounded by (98 x 0.01) - (1 x 0.99) which is less than 0.

The answer I got also was to give up after putting in two coins and losing both times (assuming risk neutrality), if you get a green box.

Your link to Ap is broken:( overall, this was really interesting and understandable. Thank you.

Glad you liked the post! Thanks for pointing out the link problem. I’ve fixed it, for now. It links to a PDF of a file that’s found in many places on the internet, but any one of them might be taken down at any time.

Then why use it instead of learning the standard terms and using those? This might sound like pedantic, but it matters because this kind of thing leads to proliferation of unnecessary jargon and sometimes reinventing the wheel.

Are we talking about conditional probability? Joint probability?

Also, a minor nitpick about your next-to-last figure: given what’s said about the boxes, it’s not two bell curves centered at 0 and 0.9. It should be a point mass (vertical line) at 0 and a bell curve centered at 0.9.

The standard term is A_p, which seemed unnecessarily obscure.

Re the figure, see the discussion here.

(Sorry to be slow to reply to this; I got busy and didn’t check my LW inbox for more than a month.)

Agree with John Baez, Jeremy Salwen and others. Standard tools are enough to solve this problem. You don’t need probabilities over probabilities, just probabilities over states of the world, and probabilities over what might happen in each state of the world.

Has anyone used meta-probabilities, or something similar, to analyze the Pascal Mugger problem?

We can do it now! :)

What sort of problem is one where meta-probabilities are useful? One where you get different chance payouts depending on different models of the problem (e.g. one brown box vs. the good green box), and so you want to tell those models apart.

Or if we want meta-meta probabilities, then we could have different classes of models that you can tell apart (boxes or spheres?), and then different models that you have to tell apart (good box or bad box?), and then different outcomes that happen probabilistically (coins or no coins?).

But the key idea is that we gain something by differentiating these different known ways that the problem could be.

So in the case of someone who says “give me 5$ and I’ll get you into heaven when you die,” what are the layers? Well, they could be a charlatan or not. If they’re not a charlatan, then we can assume for the sake of argument that you’ll get into heaven with certainty, so no meta-probability there. But if they are a charlatan, then there’s some probability you’d get into heaven anyhow, so the probability of “they are a charlatan” is equivalent to a meta-probability for getting into heaven.

Okay, so: What experiment can you do that will let you change your mind about Pasal’s Mugger? Or to put it another way, how can someone convince you even a little that they are not a charlatan? What is the analogy between this and the boxes int he original post?