Contra Yudkowsky’s Ideal Bayesian

This is my first post, so forgive me for it being a bit of a carelessly referenced, informal ramble. Feedback is appreciated.

As I understand it, Yudkowsky’s contends that is there exists an ideal Bayesian, with respect to which any epistemic algorithm is only ‘good’ insofar as it is approximating it. Specifically, this ideal Bayesian is following a procedure such that their prior is defined by a ‘best’ summary of their available knowledge. This is the basis for which he claims that, for instance, Einstein’s reasoning must have had a strictly Bayesian component to it, otherwise he could not have been correct. He generally extends this assertion to argue for seeking that Bayesian explication for why things work as though it were the fundamental underlying reason why, and to toss away the non-Bayesian parts. For a number of reasons, it is not clear to me that such a statement holds:

  • No theorem he uses to argue for Bayesianism fully generally accounts for a situation as complex as Einstein’s, and in fact these theorems generally allow for non-Bayesian reasoning to be perfectly correct (and sometimes superior)

  • I know of no theorem which guarantees the desirable properties one would hope for an idealized Bayesian who summarizes all of their background information into the prior as superior to (say) an Objective Bayesian in the style of Berger & such, and in fact it seems that the latter can often satisfy properties which bring it closer to a common-sensical ideal

  • From a Lakatosian perspective, “looking for the Bayesian explanation” does not seem to have been a particularly productive direction to go on: the most powerful theorems used to prove that Bayesianism works are profitable only in that they assume that a decent-enough Frequentist approach works. Generally, it seems that discoveries in Statistics are made by people who look for techniques which work, derived by principles other than a strict adherence to coherent Bayesianism, and then one may go on to prove that they are (or aren’t) Bayesian—when some part isn’t, it is typically not harmful (and sometimes beneficial). When such a story holds true, the coherence/​prior-as-subjective-info part is not the key to the kingdom, and you are free to pick your priors with respect to a more general criterion than information-aggregating.

  • Therefore, with the generality of Bayesianism’s lawfulness (or its usefulness-as-law) in question, “Rationalism means winning” is in conflict with “Looking for Bayes-Structure”. Instead, look for winning.

I will go step by step. Though, first, a note of caution.

Inevitably, some of the results I will be pointing to will be decried as pathologies, monsters, etc. When you feel this urge, you should remember that Yudkowsky’s account is said to hold for all epistemic problems in full generality, and in this respect mathematical toy-problems you can write down are woefully simple and clean things. If things break down at this level of simplicity, you should be far more suspicious of the Einstein example, which, among other things, is necessarily infinitely-dimensional (insofar as Bayesian Einstein places a nonzero probability over all physical laws, which a consistent Bayesian ought to—I suppose you could just place zeroes over all the physical laws which are false, but that seems like the ideal Bayesian has access to the answer sheet), very likely to be non-regular (at least for the predictions available to him, even at the fullest sense-aggregating Bayesian sense), and so on.

On another note, you may also feel the temptation to run from the infinities in such pathologies, but such a strategy changes nothing—all it does is convert statements of limiting behaviour to approximate behaviour at large numbers, which is to say as or grows. This restatement doesn’t change much of anything.

Accounts of the optimality of Bayesian reasoning

There are several arguments which are said to be reasons for the optimality of Bayesian reasoning as opposed to any other form of reasoning. Here is a brief list, in what I imagine to be Yudkowsky’s order of importance

  1. Cox’s Theorem and accounts of coherence (i. e. Dutch Books)

  2. adherence to the Likelihood principle

  3. the Complete Class theorems

2 is of a separate kind than 1 and 3, so I will briefly treat it separately. I think Yudkowsky’s account of the Bayesian’s independence to stopping rules is overstated, and that in general you (and the Bayesian) should indeed care about these, regardless of rigmarole about such things being in the researcher’s minds. And, as per the example earlier, it seems that allowing non-Likelihood information into your estimates can net you some fruitful properties (which a Bayesian can probably include through some meta-modelling techniques, but in a way that makes the point about strict adherence to the Likelihood principle seem a little strange).

In fact, I see this as an example of Bayesianism leading the statistician astray—for many years (i. e., in Jaynes’s time), before we had a better accounting of causal inference, it was the Bayesian position to be against randomization and double-blinding and experimental designs that Rationalists typically prefer today (c. f. Le Cam) :

Another claim is the very curious one that if one follows the neo Bayesian theory strictly one would not randomize experiments. The advocates of the neo-Bayesian creed admit that the theory is not so perfect that one should follow its dictates in this instance. This author would say that no theory should be followed, that a theory can only suggest certain paths. However, in this particular case the injunction against randomization is a typical product of a theory which ignores differences between experiments and experiences and refuses to admit that there is a difference between events which are made equiprobable by appropriate mechanisms and events which are equiprobable by virtue of ignorance. Furthermore, the theory would turn against itself if the neo-Bayesian statistician was asked to bet on the results of a poll carried out by picking the 100 persons most likely to give the right answers. In spite of this the neo-Bayesian theory places randomization on some kind of limbo, and thus attempts to distract from the classical preaching that double blind randomized experiments are the only ones really convincing.

Of course, we are now much more convinced by the causal explanations of double-blinded randomization than the classical-statistical ones, which (says Pearl) are of a different type of reasoning than statistical in his terms. The point is that it seems to me that looking for Bayes-Structure seems to get you much farther from the correct answer than looking for things that simply work.

Coherence

Coherence is the property that an agent (always) updates their beliefs through probabilistic conditioning. Usually, one argues that coherence is desirable through Cox’s theorem or the Dutch Book results. This means that coherence is a very brittle thing—you can either be coherent or not, and being approximately Bayesian in most senses still violates the conditions which these results pose as desirable.

When dealing in theorems of coherence, one must tread carefully. Infamously, the original theorem as posed by Cox does not even apply to the case of rolling a fair, six-sided die, let alone Einstein’s uncountably infinite problem-space insofar as it only poses finite additivity and small parameter-spaces. This nice paper demonstrates a variant of Cox’s theorem which works, but assumes something that smells very Frequentist (consistency under repeated events). In general, it seems that extensions which do work in actual cases seem to usually require unusual assumptions, in such a way that makes me skeptical of any claim of universality.

This is all quibbling, however, in comparison to the much more fundamental problem: why care about coherence, at all? Surely, of all the desiderata that an ideal reasoner might hold, we would care first and foremost about conditions such as that

  • the reasoner is in some sense going towards the correct answer, usually in a limiting sense (consistency)

  • the probabilities reported by the reasoner are accurate to the degree with which they purport to be (calibration)

You can construct many procedures which are consistent and calibrated but which do not match coherence—in fact, it is extremely difficult to be coherent at all, in a way that makes it at best unclear that Yudkowsky’s proposed ideal Bayesian matches the performance of the best Frequentists on offer (at best you are resistant to Dutchmen, but only some). His favourite analogy here is that of the Carnot engine—if that analogy is to hold, clearly we cannot show engines which are efficient but not Carnot, let alone better-than-Carnot. If the only thing that makes a Bayesian lawfully superior to a Frequentist is this metaphysical sense of logical coherence, while being matched or defeated in other desirable conditions for good reasoning, it seems to liken the ‘best’ engine as the one which makes the most noise.

Complete Class Theorems and Optimality

In my estimation, this is by far the best argument a Bayesian has to offer, which is why Frequentists freely pick Bayesian estimators to make use of this property. Taking a naive interpretation, this is the closest that Yudkowsky gets to correct—for any decision-procedure, there is a Bayesian one with smaller risk. The problem, of course, is that this has absolutely nothing to do with the prior or the posterior’s coherence—it plays no role in the proof at all, and in most cases these theorems show that Generalized-Bayes (which is to say, with improper/​incoherent priors)[1] solutions are admissible too. In fact, these ‘incoherent’ solutions are often preferrable (and therefore ideal), in that you can make use of minimaxity or robustness or invariance or probability-matching or some other desirable property, none of which are guaranteed or even expected with an unspecifiedly subjective prior.

In fact, in the stated hard nonparametric problems Yudkowsky is interested in, it is not known whether such theorems hold with much weight at all (the important part is that the complete class is minimal or admissible, otherwise, there are lots of complete classes—again as per Le Cam, decision rules which minimize are complete classes under the same conditions of the usual complete class theorems). Typically, you require that the parameter space is compact and that the loss is convex, or similar. And, once again, the condition of admissibility is not enough—Famously, for many problems, the constant estimators are admissible, and so are Bayesian ones with incredibly far-off priors with no guarantee of fast or even eventual convergence to the truth.

So, what is enough? Here are some trivial necessary properties.

Conditions for Bayesianism to work at all

Consistency

An ideal reasoner supplied with an infinite stream of data ought to converge to the truth[2]. This seems so utterly common-sensical that I cannot imagine arguing against it, and every Frequentist method that people recommend of which I know of checks whether this holds—it is an utterly basic, necessary-but-not-sufficient property. Bayesians have it generally good here, though, again, things are not as simple when the domain is infinite-dimentional. For simple situations (from both finite and small-infinite domains, which is to say most basic inference problems), if the Bayesian assigns a nonzero prior probability on the truth, he will eventually converge there. Obviously, if your prior is arbitrarily low, you will also fail to be eventually correct.

The precise conditions for this theorem to apply, however, typically do not hold in nonparametric situations:

However, this note of optimism relies heavily on finite-dimensional intuition and, more particularly, Lebesgue measure. There is absolutely no implication that analogous expectations are justified in non-parametric context. Indeed, Doob’s theorem becomes highly problematic in such models: the theorem stays true exactly as stated, it simply means something else than what finite-dimensional intuition suggests. Strictly speaking, only frequentists recognize consistency problems: Doob’s proof says nothing about specific points in the model, i.e. given a particular underlying the sample, Doob’s theorem does not give conditions that can be checked to see whether the Bayesian procedure will be consistent at this particular : it is always possible that belongs to the null-set for which inconsistency occurs. That such null-sets may be large, is clear from example 2.1.18 and that, indeed, this may lead to grave problems in non-parametric situations, becomes apparent when we consider the counterexamples given by Freedman (1963, 1965) [97, 98] and Diaconis and Freedman (1986) [71, 72]. Non-parametric examples of inconsistency in Bayesian regression are found in Cox (1993) [61] and Diaconis and Freedman (1998) [74]. Basically what is shown is that the null-set on which inconsistency occurs in Doob’s theorem can be rather large in non-parametric situations. Some authors are tempted to present the above as definitive proof of the fact that Bayesian statistics are unfit for non-parametric estimation problems. More precise is the statement that not every choice of prior is suitable, raising the question that will entertain us for the rest of this chapter and next: under which conditions on model and prior, can we expect frequentist forms of consistency to hold?

I agree completely with the author, here—the situation is not totally dismal, and, for most of these problems, carefully picked priors will work out fine. But it is absolutely without proof that the procedure of embedding as much subjective-prior information and praying it all works out will lead to the correct level of care for this to occur.

In fact, it seems to me much more sensible to imagine that our ideal reasoner’s priors are restricted to those which lead to sensible results in this way, but fixing which priors are allowed in advance for the sake of good properties makes you as Frequentist as the rest of them. The conditions which must be met for such a prior to not lead to pathology are hard to justify from a purely subjectivist viewpoint, though are often flexible enough such that you can decently approximate most such subjective priors using them (i. e. tail-free processes).

Calibration and Coverage

Evidently, we also want our ideal reasoner to accurately report probabilities—that events reported with a probability actually occur of the time. Really, we would want our ideal reasoner to always be calibrated—but, for Yudkowsky’s brand of Bayesian, it is unclear that this can be met.

Firstly, every Bayesian expects to be well-calibrated, even the horrifically wrong ones. Observing that your probabilities do not seem correct does not particularly move the Bayesisan:

(ii) Subject to feedback, calibration in the long run is otiose. It gives no ground for validating one coherent opinion over another as each coherent forecaster is (almost) sure of his own long-run calibration. (iii) Calibration in the short run is an inducement to hedge forecasts. A calibration score, in the short run, is improper. It gives the forecaster reason to feign violation of total evidence by enticing him to use the more predictable frequencies in a larger finite reference class than that directly relevant.

There are some situations where the Bayesian prediction is optimal (again, related to the complete class theorems and subject to the same regularity conditions as those), and they are known to be a martingale (i. e., stable) when your model is exactly correct. However, the bayesian probability reporting is a whole other matter. There are plenty of frequentist methods which guarantee calibration, even finite-sample, under minimal conditions—nonparametrics, and in particular conformal inference methods are the way to go. Frankly, I would consider such a calibrated reasoner to be much closer to ideal than a Bayesian, overconfident-yet-only-dubiously-correct one.

Another issue is, of course, of coverage—in the same sense that we want our predictions to be true as much as we say they are, we obviously desire that our reported intervals actually similarly contain the truth at the requisite rate. Consider the following situation:[3] You are a computer programmer, and are tasked to write down a statistical program for some physical problem (say, that you are estimating the physical constant , and they are planning to conduct an experiment in which things are dropped.) It is of interest to engineers and physicists alike to find some lower and upper bounds for this constant, preferrably ones that form as small an interval as you can.

Imagine, then, that you set in your simulation program, and feed the generated data to your interval method, which returns your 95% interval as

0.292.34

That doesn’t seem good, but maybe you just got unlucky. So you generate a whole sequence of intervals and they look like

6.226.98
-1.237.77
0.754.55
10.9112.12

and so on, such that your intervals are only actually correct about 1% of the time. I don’t know about you, but I would suspect that we have some kind of bug! This does not seem like desirable behaviour for intervals with the label of 95% - if these were predictions, they would all be wrong.

The fact that these may or may not be coherently obtained is irrelevant to their empirical inadequacy—to check if your Bayesian credence intervals are correctly credal, you simulate random parameters from the prior and then check the intervals obtained from the random posteriors, which seems like a strange thing to explain to an engineer who just wants a sensible lower or upper bound. Therefore, I will also absolutely assign this responsibility to an ideal reasoner—that the intervals he reports actually contain the things he says it contains as often as he says they do.

Some people have trouble with this “oftenness” in this statement, claiming that a sequence of imaginary repetitions of the same experiment is illusory and strange to care about—I agree. I can, however, simply imagine that all intervals labelled 95% everywhere contain the truth about 95% of the time—it does not seem imaginary that these will continue to be produced.

So, when do Bayesian intervals have good coverage? The picture here is, again, not hopeless, but somewhat complex. You may pick a probability-matching prior in the style of Berger, but, I seriously doubt your prior information is right-Haar invariant. For the subjectivism which Yudkowsky is advocating, the Bayesian he wants can only match this requirement asymptotically. Generally, this happens whenever the conditions of the Bernstein von-Mises theorem hold, which is probably the strongest theorem here, in the sense that it typically assumes the existence of some decent test, as well as the conditions for a well-behaved Maximum Likelihood estimator to converge (such that the Bayesian estimate can converge to it).

Under misspecification, there is no hope—insofar as the Bayesian is approximating the Maximum-Likelihoodist, their answers will both converge to the closest model to the truth in the sense of Kullback-Leibler divergence (ML can be shown to literally minimize it, though a Bayesian asymptotically, but I am not so sure on this), but the Bayesian’s credence regions will diverge away from the Frequentist’s, which have the correct coverage. This isn’t a problem for the ideal Bayesian which always has the correct model somewhere in his mind, but this property is sufficiently pathological that I thought I should mention it.

Conclusion

Unless you hold steadfastly to coherence, none of the theorems relating to the successes of Bayesianism require you to adhere to a subjectivist framework—in general, it seems that the “Bayesian with good frequentist properties” seems closer to optimal, both in practice and as a normative ideal. However, for the sorts of complicated problems Yudkowsky wants to posit a Bayesian ideal to, it does not seem clear that a Bayesian is even ideal at all, let alone good.

Maybe I am missing something, but I know of no properties which describe the subjectivist-Bayesian’s information-aggregation procedure as either necessary or sufficient for any of the niceties of Bayesianism, and it seems to me that not engaging in such a thing will often be better. Frankly, it seems like a bit of a rigmarole to assume that the subjectivist-ideal Bayesian’s prior will make the posterior have these properties without some further specifications, but it is not impossible that you could show this somehow—Doob’s theorem is very nice but it just seems unclear what a subjectivist makes of, say, Schwarz’s condition, a priori.

As a much more informal, personal aside, looking for Bayesian explications of phenomena does not seem like a universally good approach, or even commonly good. Lots of perfectly good methods can be shown to be good without any reference to their Bayesianness—the explanation for why they are successful involves calculations which do not invoke anything that must be Bayesian.

An example Yudkowsky invokes actually satisfies this: the OLS estimator for a Linear Regression model cannot be Bayesian under any actual, coherent (proper) prior, since it is unbiased. You can, however, approximate it with a wide-enough uniform prior, which seems like a dubious thing to want. Is this really why we consider OLS to work? Why not, instead, theorems like Gauss-Markov, or proofs that it is the UMVUE in certain situations? This seems like an example where the Bayesian explanation can only be an approximation of a working Frequentist one, rather than the other way around.

What about, say, the most-cited method in statistics, Cox’s (not that one) Proportional Hazards model, which is semiparametric and involves a likelihood approximation? There is in fact some Bayesian explanation for why it kind of makes sense, but it again is post-hoc, auxiliary (which again involves extremely bizarre priors—I believe here the explanation involves some very-diffuse gamma process prior for the baseline hazard, with some extra regularity properties to show that the partial likelihood is a workable approximation… Why? How do you arrive at this, without post-hoc reasoning?); to me, most proofs that show a working method has a Bayesian explanation prove that no subjectivist would ever consider it, that someone purely looking for Bayes-structure would take eons to find these great solutions that a Frequentist does much more easily.

It seems to me that finding a Bayesian explanation for a procedure is a lot like finding a constructivist proof—sometimes desirable, often a comically bad way to actually find solutions to problems as a working man. Find something that works first, then maybe prove that it is Bayesian if it desirable to (most typically in the domains where the complete class theorems hold).

Frankly, you should want to show more than this, and just finding a generically Bayesian explanation is often of no help, other than where it might help you pin down the a formalization of the statistical properties which make the algorithm work (which typically involves no prior). It seems to me that you can find one of those for just about any algorithm, no matter how absurdly wrong. So, all in all, I hope the theme is clear:

Prove all things, hold fast unto that which is good

- 1 Thessalonians 5:21

Notes

I would really recommend you read this if you are interested in the more technical aspects that I am skipping over, here, as well as these notes on nonparametrics. A lot of what I’m stating can be found in these pages in one version or another. Less relevantly to the specific issue of an idealized Bayesian, but still a nifty collection of positive and negative results about the effectiveness of Bayesians in general is the following collection by Shalizi.

  1. ^
  2. ^

    For finitists, “an ideal reasoner supplied with an arbitrarily increasing quantity of data ought to get as close to the truth as possible” is not any less common-sensical, though more muddled.

  3. ^