One-Magisterium Bayes

[Epistemic Status: Very partisan /​ opinionated. Kinda long, kinda rambling.]

In my conversations with members of the rationalist community as well as in my readings of various articles and blog posts produced by them (as well as outside), I’ve noticed a recent trend towards skepticism of Bayesian principles and philosophy (see Nostalgebraist’s recent post for an example), which I have regarded with both surprise and a little bit of dismay, because I think progress within a community tends to be indicated by moving forward to new subjects and problems rather than a return to old ones that have already been extensively argued for and discussed. So the intent of this post is to summarize a few of the claims I’ve seen being put forward and try to point out where I believe these have gone wrong.

It’s also somewhat an odd direction for discussion to be going in, because the academic statistics community has largely moved on from debates between Bayesian and Frequentist theory, and has largely come to accept both the Bayesian and the Frequentist /​ Fisherian viewpoints as valid. When E.T. Jaynes wrote his famous book, the debate was mostly still raging on, and many questions had yet to be answered. In the 21st century, statisticians have mostly come to accept a world in which both approaches exist and have their merits.

Because I will be defending the Bayesian side here, there is a risk that this post will come off as being dogmatic. We are a community devoted to free-thought after all, and any argument towards a form of orthodoxy might be perceived as an attempt to stifle dissenting viewpoints. That is not my intent here, and in fact I plan on arguing against Bayesian dogmatism as well. My goal is to argue that having a base framework with which to feel relatively high confidence in is useful to the goals of the community, and that if we feel high enough confidence in it, then spending extra effort trying to prove it false might be wasting brainpower than can potentially be used on more interesting or useful tasks. There could always be a point we reach where most of us strongly feel that unless we abandon Bayesianism, we can’t make any further progress. I highly doubt that we have reached such a point or that we ever will.

This is also a personal exercise to test my understanding of Bayesian theory and my ability to communicate it. My hope is that if my ideas here are well presented, it should be much easier for both myself and others to find flaws with it and allow me to update.

I will start with an outline of philosophical Bayesianism, also called “Strong Bayesianism”, or what I prefer to call it, “One Magisterium Bayes.” The reason for wanting to refer to it as being a single magisterium will hopefully become clear. The Sequences did argue for this point of view, however, I think the strength of the Sequences had more to do with why you should update your beliefs in the face of new evidence, rather than why Bayes’ theorem was the correct way to do this. In contrast, I think the argument for using Bayesian principles as the correct set of reasoning principles was made more strongly by E.T. Jaynes. Unfortunately, I feel like his exposition of the subject tends to get ignored relative to the material presented in the Sequences. Not that the information in the Sequences isn’t highly relevant and important, just that Jaynes’ arguments are much more technical, and their strength can be overlooked for this reason.

The way to start an exposition on one-magisterium rationality is by contrast to multi-magisteria modes of thought. I would go as far as to argue that the multi-magisterium view, or what I sometimes prefer to call tool-boxism, is by far the most dominant way of thinking today. Tool-boxism can be summarized by “There is no one correct way to arrive at the truth. Every model we have today about how to arrive at the correct answer is just that – a model. And there are many, many models. The only way to get better at finding the correct answer is through experience and wisdom, with a lot of insight and luck, just as one would master a trade such as woodworking. There’s nothing that can replace or supersede the magic of human creativity. [Sometimes it will be added:] Also, don’t forget that the models you have about the world are heavily, if not completely, determined by your culture and upbringing, and there’s no reason to favor your culture over anyone else’s.”

As I hope to argue in this post, tool-boxism has many downsides that should push us further towards accepting the one-magisterium view. It also very dramatically differs in how it suggests we should approach the problem of intelligence and cognition, with many corollaries in both rationalism and artificial intelligence. Some of these corollaries are the following:

  • If there is no unified theory of intelligence, we are led towards the view that recursive self-improvement is not possible, since an increase in one type of intelligence does not necessarily lead to an improvement in a different type of intelligence.

  • With a diversification in different notions of correct reasoning within different domains, it heavily limits what can be done to reach agreement on different topics. In the end we are often forced to agree to disagree, which while preserving social cohesion in different contexts, can be quite unsatisfying from a philosophical standpoint.

  • Related to the previous corollary, it may lead to beliefs that are sacred, untouchable, or based on intuition, feeling, or difficult to articulate concepts. This produces a complex web of topics that have to be avoided or tread carefully around, or a heavy emphasis on difficult to articulate reasons for preferring one view over the other.

  • Developing AI around a tool-box /​ multi-magisteria approach, where systems are made up of a wide array of various components, limits generalizability and leads to brittleness.

One very specific trend I’ve noticed lately in articles that aim to discredit the AGI intelligence explosion hypothesis, is that they tend to take the tool-box approach when discussing intelligence, and use that to argue that recursive self-improvement is likely impossible. So rationalists should be highly interested in this kind of reasoning. One of Eliezer’s primary motivations for writing the Sequences was to make the case for a unified approach to reasoning, because it lends credence to the view of intelligence in which intelligence can be replicated by machines, and where intelligence is potentially unbounded. And also that this was a subtle and tough enough subject that it required hundreds of blog posts to argue for it. So because of the subtle nature of the arguments I’m not particularly surprised by this drift, but I am concerned about it. I would prefer if we didn’t drift.

I’m trying not to sound No-True-Scotsman-y here, but I wonder what it is that could make one a rationalist if they take the tool-box perspective. After all, even if you have a multi-magisterium world-view, there still always is an underlying guiding principle directing the use of the proper tools. Often times, this guiding principle is based on intuition, which is a remarkably hard thing to pin down and describe well. I personally interpret the word ‘rationalism’ as meaning in the weakest and most general sense that there is an explanation for everything – so intelligence isn’t irreducibly based on hand-wavy concepts such as ingenuity and creativity. Rationalists believe that those things have explanations, and once we have those explanations, then there is no further use for tool-boxism.

I’ll repeat the distinction between tool-boxism and one-magisterium Bayes, because I believe it’s that important: Tool-boxism implies that there is no underlying theory that describes the mechanisms of intelligence. And this assumption basically implies that intelligence is either composed of irreducible components (where one component does not necessarily help you understand a different component) or some kind of essential property that cannot be replicated by algorithms or computation.

Why is tool-boxism the dominant paradigm then? Probably because it is the most pragmatically useful position to take in most circumstances when we don’t actually possess an underlying theory. But the fact that we sometimes don’t have an underlying theory or that the theory we do have isn’t developed to the point where it is empirically beating the tool box approach is sometimes taken as evidence that there isn’t a unifying theory. This is, in my opinion, the incorrect conclusion to draw from these observations.

Nevertheless, it seems like a startlingly common conclusion to draw. I think the great mystery is why this is so. I don’t have very convincing answers to that question, but I suspect it has something to do with how heavily our priors are biased against a unified theory of reasoning. It may also be due to the subtlety and complexity of the arguments for a unified theory. For that reason, I highly recommend reviewing those arguments (and few people other than E.T. Jaynes and Yudkowsky have made them). So with that said, let’s review a few of those arguments, starting with one of the myths surrounding Bayes theorem I’d like to debunk:

Bayes Theorem is a trivial consequence of the Kolmogorov Axioms, and is therefore not powerful.

This claim usually presented as part of a claim that “Bayesian” probability is just a small part of regular probability theory, and therefore does not give us any more useful information than you’d get from just studying probability theory. And as a consequence of that, if you insist that you’re a “Strong” Bayesian, that means you’re insisting on using only on that small subset of probability theory and associated tools we call Bayesian.

And the part of the statement that says the theorem is a trivial consequence of the Kolmogorov axioms is technically true. It’s the implication typically drawn from this that is false. The reason it’s false has to do with Bayes theorem being a non-trivial consequence of a simpler set of axioms /​ desiderata. This consequence is usually formalized by Cox’s theorem, which is usually glossed over or not quite appreciated for how far-reaching it actually is.

Recall that the qualitative desiderata for a set of reasoning rules were:

  1. Degrees of plausibility are represented by real numbers.

  2. Qualitative correspondence with common sense.

  3. Consistency.

You can read the first two chapters of Jaynes’ book, Probability Theory: The Logic of Science if you want more detail into what those desiderata mean. But the important thing to note from them is that they are merely desiderata, not axioms. This means we are not assuming those things are already true, we just want to devise a system that satisfies those properties. The beauty of Cox’s theorem is that it specifies exactly one set of rules that satisfy these properties, of which Bayes Theorem as well as the Kolmogorov Axioms are a consequence of those rules.

The other nice thing about this is that degrees of plausibility can be assigned to any proposition, or any statement that you could possibly assign a truth value to. It does not limit plausibility to “events” that take place in some kind of space of possible events like whether a coin flip comes up heads or tails. What’s typically considered the alternative to Bayesian reasoning is Classical probability, sometimes called Frequentist probability, which only deals with events drawn from a sample space, and is not able to provide methods for probabilistic inference of a set of hypotheses.

For axioms, Cox’s theorem merely requires you to accept Boolean algebra and Calculus to be true, and then you can derive probability theory as extended logic from that. So this should be mindblowing, right? One Magisterium Bayes? QED? Well apparently this set of arguments is not convincing to everyone, and it’s not because people find Boolean logic and calculus hard to accept.

Rather, there are two major and several somewhat minor difficulties encountered within the Bayesian paradigm. The two major ones are as follows:

  • The problem of hypothesis generation.

  • The problem of assigning priors.

The list of minor problems are as follows, although like any list of minor issues, this is definitely not exhaustive:

  • Should you treat “evidence” for a hypothesis, or “data”, as having probability 1?

  • Bayesian methods are often computationally intractable.

  • How to update when you discover a “new” hypothesis.

  • Divergence in posterior beliefs for different individuals upon the acquisition of new data.

Most Bayesians typically never deny the existence of the first two problems. What some anti-Bayesians conclude from them, though, is that Bayesianism must be fatally flawed due to those problems, and that there is some other way of reasoning that would avoid or provide solutions to those problems. I’m skeptical about this, and the reason I’m skeptical is because if you really had a method for say, hypothesis generation, this would actually imply logical omniscience, and would basically allow us to create full AGI, RIGHT NOW. If you really had the ability to produce a finite list containing the correct hypothesis for any problem, the existence of the other hypotheses in this list is practically a moot point – you have some way of generating the CORRECT hypothesis in a finite, computable algorithm. And that would make you a God.

As far as I know, being able to do this would imply that P = NP is true, and as far as I know, most computer scientists do not think it’s likely to be true (And even if it were true, we might not get a constructive proof from it). But I would ask: Is this really a strike against Bayesianism? Is the inability of Bayesian theory to provide a method for providing the correct hypothesis evidence that we can’t use it to analyze and update our own beliefs?

I would add that there are plenty of ways to generate hypotheses by other methods. For example, you can try to make the hypothesis space gargantuan, and encode different hypotheses in a vector of parameters, and then use different optimization or search procedures like evolutionary algorithms or gradient descent to find the most likely set of parameters. Not all of these methods are considered “Bayesian” in the sense that you are summarizing a posterior distribution over the parameters (although stochastic gradient descent might be). It seems like a full theory of intelligence might include methods for generating possible hypotheses. I think this is probably true, but I don’t know of any arguments that it would contradict Bayesian theory.

The reason assigning prior probabilities is such a huge concern is that it forces Bayesians to hold “subjective” probabilities, where in most cases, if you’re not an expert in the domain of interest, you don’t really have a good argument for why you should hold one prior over another. Frequentists often contrast this with their methods which do not require priors, and thus hold some measure of objectivity.

E.T. Jaynes never considered to this be a flaw in Bayesian probability, per se. Rather, he considered hypothesis generation, as well as assigning priors, to be outside the scope of “plausible inference” which is what he considered to be the domain of Bayesian probability. He himself argued for using the principle of maximum entropy for creating a prior distribution, and there are also more modern techniques such as Empirical Bayes.

In general, Frequentists often have the advantage that their methods are often simpler and easier to compute, while also having strong guarantees about the results, as long as certain constraints are satisfied. Bayesians have the advantage that their methods are “ideal” in the sense that you’ll get the same answer each time you run an analysis. And this is the most common form of the examples that Bayesians use when they profess the superiority of their approach. They typically show how Frequentist methods can give both “significant” and “non-significant” labels to their results depending on how you perform the analysis, whereas the Bayesian way just gives you the probability of the hypothesis, plain and simple.

I think that in general, once could say that Frequentist methods are a lot more “tool-boxy” and Bayesian methods are more “generally applicable” (if computational tractability wasn’t an issue). That gets me to the second myth I’d like to debunk:

Being a “Strong Bayesian” means avoiding all techniques not labeled with the stamp of approval from the Bayes Council.

Does this mean that Frequentist methods, because they are tool box approaches, are wrong or somehow bad to use, as some argue that Strong Bayesians claim? Not at all. There’s no reason not to use a specific tool, if it seems like the best way to get what you want, as long as you understand exactly what the results you’re getting mean. Sometimes I just want a prediction, and I don’t care how I get it – I know that a specific algorithm being labeled “Bayesian” doesn’t confer it any magical properties. Any Bayesian may want to know the frequentist properties of their model. It’s easy to forget that different communities of researchers flying the flag of their tribe developed some methods and then labeled them according to their tribal affiliation. That’s ok. The point is, if you really want to have a Strong Bayesian view, then you also have to assign probabilities to various properties of each tool in the toolbox.

Chances are, if you’re a statistics/​data science practitioner with a few years of experience applying different techniques to different problems and different data sets, and you have some general intuitions about which techniques apply better to which domains, you’re probably doing this in a Bayesian way. That means, you hold some prior beliefs about whether Bayesian Logistic Regression or Random Forests is more likely to get what you want on this particular problem, you try one, and possibly update your beliefs once you get a result, according to what your models predicted.

Being a Bayesian often requires you to work with “black boxes”, or tools that you know give you a specific result, but you don’t have a full explanation of how it arrives at the result or how it fits in to the grand scheme of things. A Bayesian fundamentalist may refuse to work with any statistical tool like that, not realizing that in their everyday lives they often use tools, objects, or devices that aren’t fully transparent to them. But you can, and in fact do, have models about how those tools can be used and the results you’d get if you used them. The way you handle these models, even if they are held in intuition, probably looks pretty Bayesian upon deeper inspection.

I would suggest that instead of using the term “Fully Bayesian” we use the phrase “Infinitely Bayesian” to refer to using a Bayesian method for literally everything, because it more accurately shows that it would be impossible to actually model every single atom of knowledge probabilistically. It also makes it easier to see that even the Strongest Bayesian you know probably isn’t advocating this.

Let me return to the “minor problems” I mentioned earlier, because they are pretty interesting. Some epistemologists have a problem with Bayesian updating because it requires you to assume that the “evidence” you receive at any given point is completely true with probability 1. I don’t really understand why it requires this. I’m easily able to handle the case where I’m uncertain about my data. Take the situation where my friend is rolling a six-sided die, and I want to know the probability of it coming up 6. I assume all sides are equally likely, so my prior probability for 6 is 16. Let’s say that he rolls it where I can’t see it, and then tells me the die came up even. What do I update p(6) to?

Let’s say that I take my data as saying “the die came up even.” Then p(6 | even) = p(even | 6) * p(6) /​ p(even) = 1 * (1/​6) /​ (1 /​ 2) = 13. Ok, so I should update p(6) to 13 now right? Well, that’s only if I take the evidence of “the die came up even” as being completely true with probability one. But what actually happened is that my friend TOLD ME the die came up even. He could have been lying, maybe he forgot what “even” meant, maybe his glasses were really smudged, or maybe aliens took over his brain at that exact moment and made him say that. So let’s say I give a 90% chance to him telling the truth, or equivalently, a 90% chance that my data is true. What do I update p(6) to now?

It’s pretty simple. I just expand p(6) over “even” as p(6) = p(6 | even) p(even) + p(6 | odd) p(odd). Before he said anything, p(even) = p(odd) and this formula evaluated to (1/​3)(1/​2) + (0)(1/​2) = 16, my prior. After he told me the die came up even, I update p(even) to 0.9, and this formula becomes (1/​3)(9/​10) + (0)(1/​10) = 930. A little less than 13. Makes sense.

In general, I am able to model anything probabilistically in the Bayesian framework, including my data. So I’m not sure where the objection comes from. It’s true that from a modeling perspective, and a computational one, I have to stop somewhere, and just accept for the sake of pragmatism that probabilities very close to 1 should be treated as if they were 1, and not model those. Not doing that, and just going on forever, would mean being Infinitely Bayesian. But I don’t see why this counts as problem for Bayesianism. Again, I’m not trying to be omniscient. I just want a framework for working with any part of reality, not all of reality at once. The former is what I consider “One Magisterium” to mean, not the latter.

The rest of the minor issues are also related to limitations that any finite intelligence is going to have no matter what. They should all, though, get easier as access to data increases, models get better, and computational ability gets better.

Finally, I’d like to return to an issue that I think is most relevant to the ideas I’ve been discussing here. In AI risk, it is commonly argued that a sufficiently intelligent agent will be able to modify itself to become more intelligent. This premise assumes that an agent will have some theory of intelligence that allows it to understand which updates to itself are more likely to be improvements. Because of that, many who argue against “AI Alarmism” will argue against the premise that there is a unified theory of intelligence. In “Superintelligence: The Idea that Eats Smart People”, I think most of the arguments can be reduced to basically saying as much.

From what I can tell, most arguments against AI risk in general will take the form of anecdotes about how really really smart people like Albert Einstein were very bad at certain other tasks, and that this is proof that there is no theory of intelligence that can be used to create a self-improving AI. Well, more accurately, these arguments are worded as “There is no single axis on which to measure intelligence” but what they mean is the former, since even multiple axes of intelligence (such as measure of success on different tasks) would not actually imply that there isn’t one theory of reasoning. What multiple axes of measuring intelligence do imply is that within a given brain, the brain may have devoted more space to better modeling certain tasks than others, and that maybe the brain isn’t quite that elastic, and has a hard time picking up new tasks.

The other direction in which to argue against AI risk is to argue against the proposed theories of reasoning themselves, like Bayesianism. The alternative, it seems, is tool-boxism. I really want to avoid tool-boxism because it makes it difficult to be a rationalist. Even if Bayesianism turns out to be wrong, does this exclude other, possibly undiscovered theories of reasoning? I’ve never seen that touched upon by any of the AI risk deniers. As long as there is a theory of reasoning, then presumably a machine intelligence could come to understand that theory and all of its consequences, and use that to update itself.

I think the simplest summary of my post is this: A Bayesian need not be Bayesian in all things, for reasons of practicality. But a Bayesian can be Bayesian in any given thing, and this is what is meant by “One Magisterium”.

I didn’t get to cover every corollary of tool-boxing or every issue with Bayesian statistics, but this post is already really long, and for the sake of brevity I will probably end it here. Perhaps I can cover those issues more thoroughly in a future post.