# What is Bayesianism?

You’ve probably seen the word ‘Bayesian’ used a lot on this site, but may be a bit uncertain of what exactly we mean by that. You may have read the intuitive explanation, but that only seems to explain a certain math formula. There’s a wiki entry about “Bayesian”, but that doesn’t help much. And the LW usage seems different from just the “Bayesian and frequentist statistics” thing, too. As far as I can tell, there’s no article explicitly defining what’s meant by Bayesianism. The core ideas are sprinkled across a large amount of posts, ‘Bayesian’ has its own tag, but there’s not a single post that explicitly comes out to make the connections and say “this is Bayesianism”. So let me try to offer my definition, which boils Bayesianism down to three core tenets.

We’ll start with a brief example, illustrating Bayes’ theorem. Suppose you are a doctor, and a patient comes to you, complaining about a headache. Further suppose that there are two reasons for why people get headaches: they might have a brain tumor, or they might have a cold. A brain tumor always causes a headache, but exceedingly few people have a brain tumor. In contrast, a headache is rarely a symptom for cold, but most people manage to catch a cold every single year. Given no other information, do you think it more likely that the headache is caused by a tumor, or by a cold?

If you thought a cold was more likely, well, that was the answer I was after. Even if a brain tumor caused a headache every time, and a cold caused a headache only one per cent of the time (say), having a cold is so much more common that it’s going to cause a lot more headaches than brain tumors do. Bayes’ theorem, basically, says that if cause A might be the reason for symptom X, then we have to take into account both the probability that A caused X (found, roughly, by multiplying the frequency of A with the chance that A causes X) and the probability that anything else caused X. (For a thorough mathematical treatment of Bayes’ theorem, see Eliezer’s Intuitive Explanation.)

There should be nothing surprising about that, of course. Suppose you’re outside, and you see a person running. They might be running for the sake of exercise, or they might be running because they’re in a hurry somewhere, or they might even be running because it’s cold and they want to stay warm. To figure out which one is the case, you’ll try to consider which of the explanations is true most often, and fits the circumstances best.

Core tenet 1: Any given observation has many different possible causes.

Acknowledging this, however, leads to a somewhat less intuitive realization. For any given observation, how you should interpret it always depends on previous information. Simply seeing that the person was running wasn’t enough to tell you that they were in a hurry, or that they were getting some exercise. Or suppose you had to choose between two competing scientific theories about the motion of planets. A theory about the laws of physics governing the motion of planets, devised by Sir Isaac Newton, or a theory simply stating that the Flying Spaghetti Monster pushes the planets forwards with His Noodly Appendage. If these both theories made the same predictions, you’d have to depend on your prior knowledge—your prior, for short—to judge which one was more likely. And even if they didn’t make the same predictions, you’d need some prior knowledge that told you which of the predictions were better, or that the predictions matter in the first place (as opposed to, say, theoretical elegance).

Or take the debate we had on 9/​11 conspiracy theories. Some people thought that unexplained and otherwise suspicious things in the official account had to mean that it was a government conspiracy. Others considered their prior for “the government is ready to conduct massively risky operations that kill thousands of its own citizens as a publicity stunt”, judged that to be overwhelmingly unlikely, and thought it far more probable that something else caused the suspicious things.

Again, this might seem obvious. But there are many well-known instances in which people forget to apply this information. Take supernatural phenomena: yes, if there were spirits or gods influencing our world, some of the things people experience would certainly be the kinds of things that supernatural beings cause. But then there are also countless of mundane explanations, from coincidences to mental disorders to an overactive imagination, that could cause them to perceived. Most of the time, postulating a supernatural explanation shouldn’t even occur to you, because the mundane causes already have lots of evidence in their favor and supernatural causes have none.

Core tenet 2: How we interpret any event, and the new information we get from anything, depends on information we already had.

Sub-tenet 1: If you experience something that you think could only be caused by cause A, ask yourself “if this cause didn’t exist, would I regardless expect to experience this with equal probability?” If the answer is “yes”, then it probably wasn’t cause A.

This realization, in turn, leads us to

Core tenet 3: We can use the concept of probability to measure our subjective belief in something. Furthermore, we can apply the mathematical laws regarding probability to choosing between different beliefs. If we want our beliefs to be correct, we must do so.

The fact that anything can be caused by an infinite amount of things explains why Bayesians are so strict about the theories they’ll endorse. It isn’t enough that a theory explains a phenomenon; if it can explain too many things, it isn’t a good theory. Remember that if you’d expect to experience something even when your supposed cause was untrue, then that’s no evidence for your cause. Likewise, if a theory can explain anything you see—if the theory allowed any possible event—then nothing you see can be evidence for the theory.

At its heart, Bayesianism isn’t anything more complex than this: a mindset that takes three core tenets fully into account. Add a sprinkle of idealism: a perfect Bayesian is someone who processes all information perfectly, and always arrives at the best conclusions that can be drawn from the data. When we talk about Bayesianism, that’s the ideal we aim for.

Fully internalized, that mindset does tend to color your thought in its own, peculiar way. Once you realize that all the beliefs you have today are based—in a mechanistic, lawful fashion—on the beliefs you had yesterday, which were based on the beliefs you had last year, which were based on the beliefs you had as a child, which were based on the assumptions about the world that were embedded in your brain while you were growing in your mother’s womb… it does make you question your beliefs more. Wonder about whether all of those previous beliefs really corresponded maximally to reality.

And that’s basically what this site is for: to help us become good Bayesians.

• is there a simple explanation of the conflict between bayesianism and frequentialism? I have sort of a feel for it from reading background materials but a specific example where they yield different predictions would be awesome. has such already been posted before?

• Eliezer’s views as expressed in Blueberry’s links touch on a key identifying characteristic of frequentism: the tendency to think of probabilities as inherent properties of objects. More concretely, a pure frequentist (a being as rare as a pure Bayesian) treats probabilities as proper only to outcomes of a repeatable random experiment. (The definition of such a thing is pretty tricky, of course.)

What does that mean for frequentist statistical inference? Well, it’s forbidden to assign probabilities to anything that is deterministic in your model of reality. So you have estimators, which are functions of the random data and thus random themselves, and you assess how good they are for your purpose by looking at their sampling distributions. You have confidence interval procedures, the endpoints of which are random variables, and you assess the sampling probability that the interval contains the true value of the parameter (and the width of the interval, to avoid pathological intervals that have nothing to do with the data). You have statistical hypothesis testing, which categorizes a simple hypothesis as “rejected” or “not rejected” based on a procedure assessed in terms of the sampling probability of an error in the categorization. You have, basically, anything you can come up with, provided you justify it in terms of its sampling properties over infinitely repeated random experiments.

• Here is a more general definition of “pure frequentism” (which includes frequentists such as Reichenbach):

Consider an assertion of probability of the form “This X has probability p of being a Y.” A frequentist holds that this assertion is meaningful only if the following conditions are met:

1. The speaker has already specified a determinate set X of things that actually have or will exist, and this set contains “this X”.

2. The speaker has already specified a determinate set Y containing all things that have been or will be Ys.

The assertion is true if the proportion of elements of X that are also in Y is precisely p.

A few remarks:

1. The assertion would mean something different if the speaker had specified different sets X and Y, even though X and Y aren’t mentioned explicitly in the assertion.

2. If no such sets had been specified in the preceding discourse, the assertion by itself would be meaningless.

3. However, the speaker has complete freedom in what to take as the set X containing “this X”, so long as X contains X. In particular, the other elements don’t have to be exactly like X, or be generated by exactly the same repeatable procedure, or anything like that. There are practical constraints on X, though. For example, X should be an interesting set.

4. [ETA:] An important distinction between Bayesianism and Frequentism is this: Note that, according to the above, the correct probability has nothing to do with the state of knowledge of the speaker. Once the sets X and Y are determined, there is an objective fact of the matter regarding the proportion of things in X that are also in Y. The speaker is objectively right or wrong in asserting that this proportion is p, and that rightness or wrongness had nothing to do with what the speaker knew. It had only to do with the objective frequency of elements of Y among the elements of X.

• I’m sorry to see such wrongheaded views of frequentism here. Frequentists also assign probabilities to events where the probabilistic introduction is entirely based on limited information rather than a literal randomly generated phenomenon. If Fisher or Neyman was ever actually read by people purporting to understand frequentist/​Bayesian issues, they’d have a radically different idea. Readers to this blog should take it upon themselves to check out some of the vast oversimplifications… And I’m sorry but Reichenbach’s frequentism has very little to do with frequentist statistics--. Reichenbach, a philosopher, had an idea that propositions had frequentist probabilities. So scientific hypotheses—which would not be assigned probabilities by frequentist statisticians—could have frequentist probabilities for Reichenbach, even though he didn’t think we knew enough yet to judge them. He thought at some point we’d be able to judge of a hypothesis of a type how frequently hypothesis like it would be true. I think it’s a problematic idea, but my point was just to illustrate that some large items are being misrepresented here, and people sold a wrongheaded view. Just in case anyone cares. Sorry to interrupt the conversation (errorstatistics.com)

• What does that mean for frequentist statistical inference? Well, it’s forbidden to assign probabilities to anything that is deterministic in your model of reality.

Wait—Bayesians can assign probabilities to things that are deterministic? What does that mean?

What would a Bayesian do instead of a T-test?

• Wait—Bayesians can assign probabilities to things that are deterministic? What does that mean?

Absolutely!

The Bayesian philosophy is that probabilities are about states of knowledge. Probability is reasoning with incomplete information, not about whether an event is “deterministic”, as probabilities do still make sense in a completely deterministic universe. In a poker game, there are almost surely no quantum events influencing how the deck is shuffled. Classical mechanics, which is deterministic, suffices to predict the ordering of cards. Even so, we have neither sufficient initial conditions (on all the particles in the dealer’s body and brain, and any incoming signals), nor computational power to calculate the ordering of the cards. In this case, we can still use probability theory to figure out probabilities of various hand combinations that we can use to guide our betting. Incorporating knowledge of what cards I’ve been dealt, and what (if any) are public is straightforward. Incorporating player’s actions and reactions is much harder, and not really well enough defined that there is a mathematically correct answer, but clearly we should use that knowledge in determining what types of hands we think it likely for our opponents to have. If we count as the dealer shuffles, and see he only shuffled three or four times, in principle we can (given a reasonable mathematical model of shuffling, such as the one Diaconis constructed to give the result that 7 shuffles are needed to randomize a deck) use the correlations left in there to give us even more clues about opponents’ likely hands.

What would a Bayesian do instead of a T-test?

In most cases we’d step back, and ask what you were trying to do, such that a T-test seemed like a good idea.

For those unaware, a t-test is a way of calculating the “likelihood” for the null hypothesis, which measures how likely the data are given that model. If the data is even moderately compatible, Frequentists say “we can’t reject it”. If it is terribly unlikely, the Frequentists say that it can be rejected—that it’s worth looking at another model.

From a Bayesian perspective, this is somewhat backwards—we don’t really care how likely the data is given this model P(D|M) -- after all, we actually got the data. We effectively want to know how useful the model is, now that we know this data. Some simple consistency requirements and scaling constraints means that this usefulness has to act just like a probability. So let’s just call it the probability of the model, given the data: P(M|D). A small bit of algebra gives us that P(M|D) = P(D|M) * P(M)/​P(D), where P(D) is the sum over all models i of P(D|M_i) P(M_i), and P(M_i) is some “prior probability” of each model—how useful we think that model would be, even without any data collected (But, importantly, with some background knowledge).

In this framework, we don’t have absolute objective levels of confidence in our theories. All that is absolute and objective is how the data should change our confidence in various theories. We can’t just reject a theory if the data don’t match well, unless we have a better alternative theory to which we can switch. In many cases these models can be continuously indexed, such that the index corresponds to a parameter in a unified model, then this becomes parameter estimation—we get a range of theories with probability densities instead of probabilities, or equivalently, one theory with a probability density on a parameter, and getting new data mechanically turns a crank to give us a new probability density on this parameter.

There are a couple unsatisfying bits here:
First it really would be nice to say “this theory is ridiculous because it doesn’t explain the data” without any reference to any other theory. But if we know it’s the only theory in town, we don’t have a choice. If it’s not the only theory in town, then how useful it is can really only coherently be measured relative to how useful other theories are.
Second, we need to give “prior probabilities” to our various theories, and the math doesn’t give any direct justifications for what these should be. However, as long as these aren’t crazy, the incoming data will continuously update these so that the ones that seem more useful will get weighted as more useful, and the ones that aren’t will get weighted as less useful. This of course means we need reasonable spaces of theories to work over, and we’ll only pick a good model if we have a good model in this space of theories. If you eventually realize that “hey, all these models are crappy”, there is no good way of expanding the set of models you’re willing to consider, though a common way is to just “start over” with an expanded model space, and reallocated prior probabilities. You can’t just pretend that the first analysis was over some subset of this analysis, because the rescaling due to the P(D) term depends on the set of models you have. (Though you can handwave that you weren’t actually calculating P(M_i|D), but P(M_i|D, {M}), the probability of each model given the data, assuming that it was one of these models).

A sometimes useful shortcut is rather than working directly with the probabilities, and hence needing the rescaling is to work with the likelihoods (or more tractably, the log of them). The difference of the log likelihoods of two different theories for some data is a reasonable measure of how much that data should effect their relative ranking. But any given likelihood by itself hasn’t much meaning—only in comparison to the rest in a set tells you anything useful.

• Very nice! I’d only replace “useful” with “plausible”. (Sure, it’s hard to define plausibility, but usefulness is not really the right concept.)

• And besides, as a software developer with plenty of Bayesian theory behind me, I appreciate the simplicity of the article for the clarity it provides me. Thanks for “aiming low” ;-)

• Great, great post. I like that it’s more qualitative and philosophical than quantitative, which really makes it clear how to think like a Bayesian. Though I know the math is important, having this kind of intuitive, qualitative understanding is very useful for real life, when we don’t have exact statistics for so many things.

• Thanks Kaj,

As I stated in my last post, reading LW often gives me the feeling that I have read something very important, yet I often don’t immediately know why what I just read should be important until I have some later context in which to place the prior content.

Your post just gave me the context in which to make better sense of all of the prior content on Bayes here on LW.

It doesn’t hurt that I have finally dipped my toes in the Bayesian Waters of Academia in an official capacity with a Probability and Stats class (which seems to be a prerequisite for so many other classes). The combined information from school and the content here have helped me to get a leg up on the other students in the usage of Bayesian Probability at school.

I am just lacking one bit in order to fully integrate Bayes into my life: How to use it to test my beliefs against reality. I am sure that this will come with experience.

• I recently started working through this Applied Bayesian Statistics course material, which has done wonders for my understanding of Bayesianism vs. the bag-of-tricks statistics I learned in engineering school.

• So I finally picked up a copy of Probability Theory: The Logic of Science, by E.T. Jaynes. It’s pretty intimidating and technical, but I was surprised how much prose there is, which makes it surprisingly palatable. We should recommend this more here on Less Wrong.

• Just remember that Jaynes was not a mathematician and many of his claims about pure mathematics (as opposed to computations and their applications) in the book are wrong. Especially, infinity is not mysterious.

• Especially, infinity is not mysterious.

It should be obvious that infinity (like all things) is not inherently mysterious, and equally obvious that it’s mysterious (if not unknown) to most people.

• Infinity is mysterious was intended as a paraphrase of Jaynes’ chapter on “paradoxes” of probability theory, and I intended mysterious precisely in the sense of inherently mysterious. As far as I know, Jaynes didn’t use the word mysterious himself. But he certainly claims that rules of reasoning about infinity (which he conveniently ignores) are not to be trusted and that they lead to paradoxes.

• Bayesianism is more than just subjective probability; it is a complete decision theory.

A decent summary is provided by Sven Ove Hansson:

1. The Bayesian subject has a coherent set of probabilistic beliefs.

2. The Bayesian subject has a complete set of probabilistic beliefs.

3. When exposed to new evidence, the Bayesian subject changes his (her) beliefs in accordance with his (her) conditional probabilities.

4. Finally, Bayesianism states that the rational agent chooses the option with the highest expected utility.

• What Bayescraft covers is a matter of tendentious definitions. I personally do not consider decision theory a necessary part of it, though it is certainly part of we’re trying to capture at LessWrong.

• I agree. The line between belief and decision is the line between 3 and 4 in that list and it is such a clean line that the von Neumann-Morgenstern axioms can be (and usually are) presented about a frequentist world.

• “A might be the reason for symptom X, then we have to take into account both the probability that X caused A”

I think you have accidentally swapped some variables there

• Thanks, fixed.

• It seems there are a few meta-positions you have to hold before taking Bayesianism as talked about here; you need the concept of Winning first. Bayes is not sufficient for sanity, if you have, say, an anti-Occamian or anti-Laplacian prior.

What this site is for is to help us be good rationalists; to win. Bayesianism is the best candidate methodology for dealing with uncertainty. We even have theorems that show that in it’s domain it’s uniquely good. My understanding of what we mean by Bayesianism is updating in the light of new evidence, and updating correctly within the constraints of sanity (cf Dutch books).

• We can discuss both epistemic and instrumental rationality.

• You are right that Bayesianism isn’t sufficient for sanity, but why should it prevent a post explaining what Bayesianism is? It’s possible to be a Bayesian with wrong priors. It’s also good to know what Bayesianism is, especially when the term is so heavily used. My understanding is that the OP is doing a good job keeping concepts of winning and Bayesianism separated. The contrary would conflate Bayesianism with rationality.

• The penultimate paragraph about our beliefs isn’t about Bayesianism so much as heuristics and biases. Unless you were a Bayesian from birth, for at least part of your life your beliefs evolved in a crazy fashion not entirely governed by Bayes’ theorem. It is for this reason that we should be suspicious of the beliefs based on assumptions we’ve never scrutinized.

• Thanks!

And interestingly, I find myself looking at my upvotes here and there and wondering what the appropriate “conversion rate” is for purposes of feeling good over a successful post. I’ve now gotten 31 upvotes there, but only 13 here. Obviously getting upvotes over there is easier than over here, so I shouldn’t value this as much as if I’d got 13 + 31 = 46 upvotes here. On the other hand, I should probably allow myself a small bonus for writing a cross-domain post that is good enough to get upvotes on both sites. Hum. Man, this is tough.

• By any standard you had a successful Hacker News post—it was on the front page for most of the morning, which is good. The number of votes is not meaningful at all on Hacker News so there’s no conversion rate. Also, I strongly suspect that many of the initial early votes on HN came from primary LW users following my link and then upvoting, possibly even people that didn’t upvote it on LW.

• The ‘Intuitive Explanation’ link has changed to http://​​yudkowsky.net/​​rational/​​bayes

Firstly I should say i’m still very undecided on the matter. Iv’e heard a lot of convincing evidence for both sides of the story, and I know many intelligent people who’s opinion i respect on both sides of the fence. I do however think that it is often dismissed to easily.

Many of the criticisms of the 9/​11 cover up theories still implicitly use arguments of ridicule like “oh yeah sure it was all entirely plotted by top US officials who collaborated in this mass conspiracy”. As woozle said the main argument is that there are major holes in the official story, and this is a much harder claim to refute.

A common response to this is “well of course theres holes, its a complex official story, if you look hard enough your bound to find inconsistencies”. Is that really satisfactory? Perhaps if you were investigating a bank robbery or tax fraud, but with a event of this significance and scale I think any inconsistencies and even a remote possibility of foul play should be taken far more seriously.

Secondly, people seem to have an ill informed, far too high respect for government. These people make manipulative and often very damaging decisions every day. A major argument in the thread below is the fact that we should have an extremely low prior assigned to a government conspiracy which should essentially cause us to disregard this possibility. But any one who has done any real research on 9/​11 should have stumbled across Operation Northwoods (sorry, i dont know how to link in these threads “http://​​en.wikipedia.org/​​wiki/​​Operation_Northwoods″ ). This is an uncovered secret government plan to stage a terrorist attack against america and blame it on Cuba in order to gain public support to invade Cuba. Ring any bells? There is no controversy regarding the existence of this plan which was eventually cancelled. We know the government is capable of thinking this way, So why should we have such a low prior for this possibility.

Frankly im a bit sick of the whole “it’s in the past” attitude. We now know that the invasion of Iraq was totally illegal, that the American government, and my Australian government, was entirely aware of the fact that there where no weapons of mass destruction, but what is our response? Oh well, they fooled us good hey. I cant believe how easily they were let off the hook for deceiving a nation to start a war and cause thousands of civilian casualties. I know this is off topic but just consider the very possibility that there was any level of involvement or at least prior knowledge of the attacks at any level of government. Surely these allegations should not be dismissed as easily as they are given that, from what i have heard, there is undeniably some real problems with the official story