Contra Yudkowsky’s Ideal Bayesian
This is my first post, so forgive me for it being a bit of a carelessly referenced, informal ramble. Feedback is appreciated.
As I understand it, Yudkowsky’s contends that is there exists an ideal Bayesian, with respect to which any epistemic algorithm is only ‘good’ insofar as it is approximating it. Specifically, this ideal Bayesian is following a procedure such that their prior is defined by a ‘best’ summary of their available knowledge. This is the basis for which he claims that, for instance, Einstein’s reasoning must have had a strictly Bayesian component to it, otherwise he could not have been correct. He generally extends this assertion to argue for seeking that Bayesian explication for why things work as though it were the fundamental underlying reason why, and to toss away the non-Bayesian parts. For a number of reasons, it is not clear to me that such a statement holds:
No theorem he uses to argue for Bayesianism fully generally accounts for a situation as complex as Einstein’s, and in fact these theorems generally allow for non-Bayesian reasoning to be perfectly correct (and sometimes superior)
I know of no theorem which guarantees the desirable properties one would hope for an idealized Bayesian who summarizes all of their background information into the prior as superior to (say) an Objective Bayesian in the style of Berger & such, and in fact it seems that the latter can often satisfy properties which bring it closer to a common-sensical ideal
From a Lakatosian perspective, “looking for the Bayesian explanation” does not seem to have been a particularly productive direction to go on: the most powerful theorems used to prove that Bayesianism works are profitable only in that they assume that a decent-enough Frequentist approach works. Generally, it seems that discoveries in Statistics are made by people who look for techniques which work, derived by principles other than a strict adherence to coherent Bayesianism, and then one may go on to prove that they are (or aren’t) Bayesian—when some part isn’t, it is typically not harmful (and sometimes beneficial). When such a story holds true, the coherence/prior-as-subjective-info part is not the key to the kingdom, and you are free to pick your priors with respect to a more general criterion than information-aggregating.
Therefore, with the generality of Bayesianism’s lawfulness (or its usefulness-as-law) in question, “Rationalism means winning” is in conflict with “Looking for Bayes-Structure”. Instead, look for winning.
I will go step by step. Though, first, a note of caution.
Inevitably, some of the results I will be pointing to will be decried as pathologies, monsters, etc. When you feel this urge, you should remember that Yudkowsky’s account is said to hold for all epistemic problems in full generality, and in this respect mathematical toy-problems you can write down are woefully simple and clean things. If things break down at this level of simplicity, you should be far more suspicious of the Einstein example, which, among other things, is necessarily infinitely-dimensional (insofar as Bayesian Einstein places a nonzero probability over all physical laws, which a consistent Bayesian ought to—I suppose you could just place zeroes over all the physical laws which are false, but that seems like the ideal Bayesian has access to the answer sheet), very likely to be non-regular (at least for the predictions available to him, even at the fullest sense-aggregating Bayesian sense), and so on.
On another note, you may also feel the temptation to run from the infinities in such pathologies, but such a strategy changes nothing—all it does is convert statements of limiting behaviour to approximate behaviour at large numbers, which is to say as or grows. This restatement doesn’t change much of anything.
Accounts of the optimality of Bayesian reasoning
There are several arguments which are said to be reasons for the optimality of Bayesian reasoning as opposed to any other form of reasoning. Here is a brief list, in what I imagine to be Yudkowsky’s order of importance
Cox’s Theorem and accounts of coherence (i. e. Dutch Books)
adherence to the Likelihood principle
the Complete Class theorems
2 is of a separate kind than 1 and 3, so I will briefly treat it separately. I think Yudkowsky’s account of the Bayesian’s independence to stopping rules is overstated, and that in general you (and the Bayesian) should indeed care about these, regardless of rigmarole about such things being in the researcher’s minds. And, as per the example earlier, it seems that allowing non-Likelihood information into your estimates can net you some fruitful properties (which a Bayesian can probably include through some meta-modelling techniques, but in a way that makes the point about strict adherence to the Likelihood principle seem a little strange).
In fact, I see this as an example of Bayesianism leading the statistician astray—for many years (i. e., in Jaynes’s time), before we had a better accounting of causal inference, it was the Bayesian position to be against randomization and double-blinding and experimental designs that Rationalists typically prefer today (c. f. Le Cam) :
Another claim is the very curious one that if one follows the neo Bayesian theory strictly one would not randomize experiments. The advocates of the neo-Bayesian creed admit that the theory is not so perfect that one should follow its dictates in this instance. This author would say that no theory should be followed, that a theory can only suggest certain paths. However, in this particular case the injunction against randomization is a typical product of a theory which ignores differences between experiments and experiences and refuses to admit that there is a difference between events which are made equiprobable by appropriate mechanisms and events which are equiprobable by virtue of ignorance. Furthermore, the theory would turn against itself if the neo-Bayesian statistician was asked to bet on the results of a poll carried out by picking the 100 persons most likely to give the right answers. In spite of this the neo-Bayesian theory places randomization on some kind of limbo, and thus attempts to distract from the classical preaching that double blind randomized experiments are the only ones really convincing.
Of course, we are now much more convinced by the causal explanations of double-blinded randomization than the classical-statistical ones, which (says Pearl) are of a different type of reasoning than statistical in his terms. The point is that it seems to me that looking for Bayes-Structure seems to get you much farther from the correct answer than looking for things that simply work.
Coherence
Coherence is the property that an agent (always) updates their beliefs through probabilistic conditioning. Usually, one argues that coherence is desirable through Cox’s theorem or the Dutch Book results. This means that coherence is a very brittle thing—you can either be coherent or not, and being approximately Bayesian in most senses still violates the conditions which these results pose as desirable.
When dealing in theorems of coherence, one must tread carefully. Infamously, the original theorem as posed by Cox does not even apply to the case of rolling a fair, six-sided die, let alone Einstein’s uncountably infinite problem-space insofar as it only poses finite additivity and small parameter-spaces. This nice paper demonstrates a variant of Cox’s theorem which works, but assumes something that smells very Frequentist (consistency under repeated events). In general, it seems that extensions which do work in actual cases seem to usually require unusual assumptions, in such a way that makes me skeptical of any claim of universality.
This is all quibbling, however, in comparison to the much more fundamental problem: why care about coherence, at all? Surely, of all the desiderata that an ideal reasoner might hold, we would care first and foremost about conditions such as that
the reasoner is in some sense going towards the correct answer, usually in a limiting sense (consistency)
the probabilities reported by the reasoner are accurate to the degree with which they purport to be (calibration)
You can construct many procedures which are consistent and calibrated but which do not match coherence—in fact, it is extremely difficult to be coherent at all, in a way that makes it at best unclear that Yudkowsky’s proposed ideal Bayesian matches the performance of the best Frequentists on offer (at best you are resistant to Dutchmen, but only some). His favourite analogy here is that of the Carnot engine—if that analogy is to hold, clearly we cannot show engines which are efficient but not Carnot, let alone better-than-Carnot. If the only thing that makes a Bayesian lawfully superior to a Frequentist is this metaphysical sense of logical coherence, while being matched or defeated in other desirable conditions for good reasoning, it seems to liken the ‘best’ engine as the one which makes the most noise.
Complete Class Theorems and Optimality
In my estimation, this is by far the best argument a Bayesian has to offer, which is why Frequentists freely pick Bayesian estimators to make use of this property. Taking a naive interpretation, this is the closest that Yudkowsky gets to correct—for any decision-procedure, there is a Bayesian one with smaller risk. The problem, of course, is that this has absolutely nothing to do with the prior or the posterior’s coherence—it plays no role in the proof at all, and in most cases these theorems show that Generalized-Bayes (which is to say, with improper/incoherent priors)[1] solutions are admissible too. In fact, these ‘incoherent’ solutions are often preferrable (and therefore ideal), in that you can make use of minimaxity or robustness or invariance or probability-matching or some other desirable property, none of which are guaranteed or even expected with an unspecifiedly subjective prior.
In fact, in the stated hard nonparametric problems Yudkowsky is interested in, it is not known whether such theorems hold with much weight at all (the important part is that the complete class is minimal or admissible, otherwise, there are lots of complete classes—again as per Le Cam, decision rules which minimize are complete classes under the same conditions of the usual complete class theorems). Typically, you require that the parameter space is compact and that the loss is convex, or similar. And, once again, the condition of admissibility is not enough—Famously, for many problems, the constant estimators are admissible, and so are Bayesian ones with incredibly far-off priors with no guarantee of fast or even eventual convergence to the truth.
So, what is enough? Here are some trivial necessary properties.
Conditions for Bayesianism to work at all
Consistency
An ideal reasoner supplied with an infinite stream of data ought to converge to the truth[2]. This seems so utterly common-sensical that I cannot imagine arguing against it, and every Frequentist method that people recommend of which I know of checks whether this holds—it is an utterly basic, necessary-but-not-sufficient property. Bayesians have it generally good here, though, again, things are not as simple when the domain is infinite-dimentional. For simple situations (from both finite and small-infinite domains, which is to say most basic inference problems), if the Bayesian assigns a nonzero prior probability on the truth, he will eventually converge there. Obviously, if your prior is arbitrarily low, you will also fail to be eventually correct.
The precise conditions for this theorem to apply, however, typically do not hold in nonparametric situations:
However, this note of optimism relies heavily on finite-dimensional intuition and, more particularly, Lebesgue measure. There is absolutely no implication that analogous expectations are justified in non-parametric context. Indeed, Doob’s theorem becomes highly problematic in such models: the theorem stays true exactly as stated, it simply means something else than what finite-dimensional intuition suggests. Strictly speaking, only frequentists recognize consistency problems: Doob’s proof says nothing about specific points in the model, i.e. given a particular underlying the sample, Doob’s theorem does not give conditions that can be checked to see whether the Bayesian procedure will be consistent at this particular : it is always possible that belongs to the null-set for which inconsistency occurs. That such null-sets may be large, is clear from example 2.1.18 and that, indeed, this may lead to grave problems in non-parametric situations, becomes apparent when we consider the counterexamples given by Freedman (1963, 1965) [97, 98] and Diaconis and Freedman (1986) [71, 72]. Non-parametric examples of inconsistency in Bayesian regression are found in Cox (1993) [61] and Diaconis and Freedman (1998) [74]. Basically what is shown is that the null-set on which inconsistency occurs in Doob’s theorem can be rather large in non-parametric situations. Some authors are tempted to present the above as definitive proof of the fact that Bayesian statistics are unfit for non-parametric estimation problems. More precise is the statement that not every choice of prior is suitable, raising the question that will entertain us for the rest of this chapter and next: under which conditions on model and prior, can we expect frequentist forms of consistency to hold?
I agree completely with the author, here—the situation is not totally dismal, and, for most of these problems, carefully picked priors will work out fine. But it is absolutely without proof that the procedure of embedding as much subjective-prior information and praying it all works out will lead to the correct level of care for this to occur.
In fact, it seems to me much more sensible to imagine that our ideal reasoner’s priors are restricted to those which lead to sensible results in this way, but fixing which priors are allowed in advance for the sake of good properties makes you as Frequentist as the rest of them. The conditions which must be met for such a prior to not lead to pathology are hard to justify from a purely subjectivist viewpoint, though are often flexible enough such that you can decently approximate most such subjective priors using them (i. e. tail-free processes).
Calibration and Coverage
Evidently, we also want our ideal reasoner to accurately report probabilities—that events reported with a probability actually occur of the time. Really, we would want our ideal reasoner to always be calibrated—but, for Yudkowsky’s brand of Bayesian, it is unclear that this can be met.
Firstly, every Bayesian expects to be well-calibrated, even the horrifically wrong ones. Observing that your probabilities do not seem correct does not particularly move the Bayesisan:
(ii) Subject to feedback, calibration in the long run is otiose. It gives no ground for validating one coherent opinion over another as each coherent forecaster is (almost) sure of his own long-run calibration. (iii) Calibration in the short run is an inducement to hedge forecasts. A calibration score, in the short run, is improper. It gives the forecaster reason to feign violation of total evidence by enticing him to use the more predictable frequencies in a larger finite reference class than that directly relevant.
There are some situations where the Bayesian prediction is optimal (again, related to the complete class theorems and subject to the same regularity conditions as those), and they are known to be a martingale (i. e., stable) when your model is exactly correct. However, the bayesian probability reporting is a whole other matter. There are plenty of frequentist methods which guarantee calibration, even finite-sample, under minimal conditions—nonparametrics, and in particular conformal inference methods are the way to go. Frankly, I would consider such a calibrated reasoner to be much closer to ideal than a Bayesian, overconfident-yet-only-dubiously-correct one.
Another issue is, of course, of coverage—in the same sense that we want our predictions to be true as much as we say they are, we obviously desire that our reported intervals actually similarly contain the truth at the requisite rate. Consider the following situation:[3] You are a computer programmer, and are tasked to write down a statistical program for some physical problem (say, that you are estimating the physical constant , and they are planning to conduct an experiment in which things are dropped.) It is of interest to engineers and physicists alike to find some lower and upper bounds for this constant, preferrably ones that form as small an interval as you can.
Imagine, then, that you set in your simulation program, and feed the generated data to your interval method, which returns your 95% interval as
0.29 | 2.34 |
That doesn’t seem good, but maybe you just got unlucky. So you generate a whole sequence of intervals and they look like
6.22 | 6.98 |
-1.23 | 7.77 |
0.75 | 4.55 |
10.91 | 12.12 |
and so on, such that your intervals are only actually correct about 1% of the time. I don’t know about you, but I would suspect that we have some kind of bug! This does not seem like desirable behaviour for intervals with the label of 95% - if these were predictions, they would all be wrong.
The fact that these may or may not be coherently obtained is irrelevant to their empirical inadequacy—to check if your Bayesian credence intervals are correctly credal, you simulate random parameters from the prior and then check the intervals obtained from the random posteriors, which seems like a strange thing to explain to an engineer who just wants a sensible lower or upper bound. Therefore, I will also absolutely assign this responsibility to an ideal reasoner—that the intervals he reports actually contain the things he says it contains as often as he says they do.
Some people have trouble with this “oftenness” in this statement, claiming that a sequence of imaginary repetitions of the same experiment is illusory and strange to care about—I agree. I can, however, simply imagine that all intervals labelled 95% everywhere contain the truth about 95% of the time—it does not seem imaginary that these will continue to be produced.
So, when do Bayesian intervals have good coverage? The picture here is, again, not hopeless, but somewhat complex. You may pick a probability-matching prior in the style of Berger, but, I seriously doubt your prior information is right-Haar invariant. For the subjectivism which Yudkowsky is advocating, the Bayesian he wants can only match this requirement asymptotically. Generally, this happens whenever the conditions of the Bernstein von-Mises theorem hold, which is probably the strongest theorem here, in the sense that it typically assumes the existence of some decent test, as well as the conditions for a well-behaved Maximum Likelihood estimator to converge (such that the Bayesian estimate can converge to it).
Under misspecification, there is no hope—insofar as the Bayesian is approximating the Maximum-Likelihoodist, their answers will both converge to the closest model to the truth in the sense of Kullback-Leibler divergence (ML can be shown to literally minimize it, though a Bayesian asymptotically, but I am not so sure on this), but the Bayesian’s credence regions will diverge away from the Frequentist’s, which have the correct coverage. This isn’t a problem for the ideal Bayesian which always has the correct model somewhere in his mind, but this property is sufficiently pathological that I thought I should mention it.
Conclusion
Unless you hold steadfastly to coherence, none of the theorems relating to the successes of Bayesianism require you to adhere to a subjectivist framework—in general, it seems that the “Bayesian with good frequentist properties” seems closer to optimal, both in practice and as a normative ideal. However, for the sorts of complicated problems Yudkowsky wants to posit a Bayesian ideal to, it does not seem clear that a Bayesian is even ideal at all, let alone good.
Maybe I am missing something, but I know of no properties which describe the subjectivist-Bayesian’s information-aggregation procedure as either necessary or sufficient for any of the niceties of Bayesianism, and it seems to me that not engaging in such a thing will often be better. Frankly, it seems like a bit of a rigmarole to assume that the subjectivist-ideal Bayesian’s prior will make the posterior have these properties without some further specifications, but it is not impossible that you could show this somehow—Doob’s theorem is very nice but it just seems unclear what a subjectivist makes of, say, Schwarz’s condition, a priori.
As a much more informal, personal aside, looking for Bayesian explications of phenomena does not seem like a universally good approach, or even commonly good. Lots of perfectly good methods can be shown to be good without any reference to their Bayesianness—the explanation for why they are successful involves calculations which do not invoke anything that must be Bayesian.
An example Yudkowsky invokes actually satisfies this: the OLS estimator for a Linear Regression model cannot be Bayesian under any actual, coherent (proper) prior, since it is unbiased. You can, however, approximate it with a wide-enough uniform prior, which seems like a dubious thing to want. Is this really why we consider OLS to work? Why not, instead, theorems like Gauss-Markov, or proofs that it is the UMVUE in certain situations? This seems like an example where the Bayesian explanation can only be an approximation of a working Frequentist one, rather than the other way around.
What about, say, the most-cited method in statistics, Cox’s (not that one) Proportional Hazards model, which is semiparametric and involves a likelihood approximation? There is in fact some Bayesian explanation for why it kind of makes sense, but it again is post-hoc, auxiliary (which again involves extremely bizarre priors—I believe here the explanation involves some very-diffuse gamma process prior for the baseline hazard, with some extra regularity properties to show that the partial likelihood is a workable approximation… Why? How do you arrive at this, without post-hoc reasoning?); to me, most proofs that show a working method has a Bayesian explanation prove that no subjectivist would ever consider it, that someone purely looking for Bayes-structure would take eons to find these great solutions that a Frequentist does much more easily.
It seems to me that finding a Bayesian explanation for a procedure is a lot like finding a constructivist proof—sometimes desirable, often a comically bad way to actually find solutions to problems as a working man. Find something that works first, then maybe prove that it is Bayesian if it desirable to (most typically in the domains where the complete class theorems hold).
Frankly, you should want to show more than this, and just finding a generically Bayesian explanation is often of no help, other than where it might help you pin down the a formalization of the statistical properties which make the algorithm work (which typically involves no prior). It seems to me that you can find one of those for just about any algorithm, no matter how absurdly wrong. So, all in all, I hope the theme is clear:
Prove all things, hold fast unto that which is good
- 1 Thessalonians 5:21
Notes
I would really recommend you read this if you are interested in the more technical aspects that I am skipping over, here, as well as these notes on nonparametrics. A lot of what I’m stating can be found in these pages in one version or another. Less relevantly to the specific issue of an idealized Bayesian, but still a nifty collection of positive and negative results about the effectiveness of Bayesians in general is the following collection by Shalizi.
- ^
A quick and summary of some basic results can be found at https://stats.stackexchange.com/questions/319905/model-with-admissible-estimators-that-are-not-the-bayes-estimator-for-any-choi/319958#319958 - in general, Xi’an’s book is excellent here.
- ^
For finitists, “an ideal reasoner supplied with an arbitrarily increasing quantity of data ought to get as close to the truth as possible” is not any less common-sensical, though more muddled.
- ^
Taking clear inspiration from https://projecteuclid.org/journals/bayesian-analysis/volume-1/issue-3/Frequentist-Bayes-is-objective-comment-on-articles-by-Berger-and/10.1214/06-BA116H.pdf, which is a lovely paper and research programme in general.
To clarify: “coherence” here means that your credences obey the probability axioms?
Yep—this is the standard term for the property (e. g. in that Seidenfeld paper).
You might be interested in this post of mine which makes some related claims.
(Interested to read your post more thoroughly but for now have just skimmed it and not sure when I’ll find time to engage more.)
Neat post, thank you! Also, you seem to have posted this twice.
I always feel that Bayesianism/Frequentism debates are somewhat misguided. Both categories are so vague, consisting of many individual elements, often for weird historical reasons. A lot of vibe based reasoning also involved, like what “smells very Frequentist” or vice versa. The global argument here appears more like a political one instead of being about math or epistemology.
It seem more useful to address specific individual disagreements and in the process develop a framework that deals with whatever problems Bayesianism and Frequentism have, taking all the best from both, instead of scoring points for these two frameworks and comparing which is better. What’s the point in arguing, whether coherence or calibration is more important? Clearly we want both!
A standard example of such specific disagreement is assigning probabilities to unfair coin about which you know nothing else. Here, as far as I know, Frequentism is unable to perform, while Bayesianism has a coherent answer: 1⁄2. Bayesianism does look better in this regard but this is beside the point. What is important, is that now we know what answer should the optimal framework produce. Do you know similar specific examples where Frequentism appears superior? If we collect enough of them in both directions, we will be able to conceptualize a strictly superior framework.
I think you’ve misunderstood the point somewhat. On the question of ‘taking the best from both’, that is what Yudkowsky calls a “tool” view, whereas I’m trying to argue against his view of its status as a necessary law of correct reasoning (see my other comment for related points). Insofar as you acknowledge that Frequentists can produce good answers independently of there existing some Bayesian prior-likelihood combination they must be approximating, we agree.
Still, the problem of ‘taking the best from both’ from a philosophically Bayesian view is that it is incoherent—you can’t procedurally pick a method which isn’t derived through a Bayes update of your prior beliefs without incurring in all the Dutch book/axiom-breaking behaviours coherence is supposed to insure against.
Not only that, but insofar as you ‘want both’ (finite-sample) calibration and coherence, you are called to abandon one or the other—Insofar as there are Bayesian methods that can get you the former, they are not derived by prior distributions that represent your knowledge of the world (if they even exist in general, anyway—not something I know of).
On your query about coins, 1⁄2 is minimax for the squared error, I believe. But, on a more fundamental level, at least to me most of the point of being Frequentist is to believe in no unique and nontrivial optimal framework for reasoning (i. e. not a hodgepodge of principle-less methods), that there are only good properties which a method can or can’t obtain.
Not necessary. You can have a law-view interpretation of such synthesis where we conceptualize Bayesianism as an imperfect approximation, a special case of the True Law, which should also capture all the good insights of Frequentism.
I’m not sure I see what exactly Walker is arguing here. Could you recreate the substance of the argument using a specific example, the roll of a 6 sided die, for instance?
We don’t know anything more about the die, do not have any data of the previous throws. Is Steven Walker confused where are we getting the equiprobable prior from?
I really don’t see why! As far as I know Bayes Theorem and Law of Large Numbers coexist perfectly. Could you give me some maximally simple example where such discrepancy happens?
Minimax for the squared error of what? How do you calculate it if you don’t have any access to the information about the previous tosses of the coin, nor know how exactly biased it is? Could you present your reasoning here step by step? Also what is your claim here, in the first place? That I misunderstand Frequentist position on the question and they actually agree with Bayesians here?
Hmm… and how do you judge which methods have good properties and which properties are good in the first place? Doesn’t reasoning about this itself requires some initial intuition and accumulated data from previous experience? Therefore essentially satisfying Bayesian structure?
Yes, but such an interpretation falls outside of Yudkowsky’s view, as I understand it (for example on that X thread in another comment in this post, and his comments on other statistics topics I’ve seen around—could fish for the quotes, but I’m a bit held up at this precise moment), which is what that I’m focusing on here.
On Walker, on that paragraph he is criticizing the specific (common) practice of comparing separate Bayesian models and picking the best (via ratios or errors or some such) when there is uncertainty about the truth instead of appropriately representing this uncertainty about your sampling model in the prior.
Rolling a die is a bit of a nifty example here since it’s the case where you assign a separate probability to each label in the sample space, so that your likelihood is in fact fully general, which is where the idea for a Dirichlet prior comes from in an attempt to generalize this notion of covering all possible models for less trivial problems. In the rest of the intro, Walker points to Bayesians fitting models with different likelihoods (i. e., Weibull v. s. Lognormal, I think he is a survival guy), each with their own “inner” priors, comparing them against each other and then picking the one which is “best” as incoherent, since picking a prior just to compare posteriors on some frequentist property like error or coverage is not an accurate representation of your prior uncertainty (instead, he wants you to pick some nonparametric model).
On Bayesian finite-sample miscalibration, simply pick a prior which is sufficiently far off from the true value and your predictive intervals will be very bad for a long time (you may check by simulation on some conjugate model). This is a contrived example, of course, but this happens on some level all the time, since Bayesian methods make no promise of finite-sample calibration—your prediction intervals just reflect belief, not what future data might be (in practice, I’ve heard people complain about this in Bayesian machine learning type situations). Of course, asymptotically and under some regularity conditions you will be calibrated, but one would rather be right before then. If you want finite-sample calibration, you have to look for a method which promises it. In terms of the coverage of credible intervals more generally, though, unless you want to be in the throes of asymptotics, you’d have to pick what is called a matching prior, which again seems in conflict with subjectivist information input
On minimax: I don’t know how to format math in this site on a phone, so I will be a bit terse here, but the proof is very simple (I thought). in statistics, when we call an estimator “minimax” it means that it minimizes the maximum risk, which is the expectation of the loss over the distribution of the estimator. Since we have no data, all estimators are some constant c. The expectation of the loss with is just the loss with respect to the parameter (i. e. (c-p)^2). Clearly the maxima are taken when p is 0 or 1, so minimize the maximum of c^2, (1-c)^2, which has a minimum at c=0.5. Which is to say, 1⁄2 has this nice Frequentist property, which is how one could justify it.
On your last comment, it seems like a bit of an open question to attribute the existence of practical intuition and reasoning about mathematical constructs like this to a Bayesian prior updating process. Certainly I reason, and I change my mind, but to me personally I see no reason to imagine this was Bayesian in some way (or that those thoughts were expressed in credence-probabilities which I shifted by conditioning on a type of sense-data), nor that I would be ideally doing this instead. But, I suppose such a thing could be possible.
Maybe! But I would expect him to change his view to something like this in case you managed to persuade him that there is some crucial flaw in Bayesianism. While your goal seems to be to propagate the toolbox-view as the only valid approach. So you might as well engage with a stronger version of law-view right now.
So, suppose that instead of assigning equal probabilities to each label of a die, I consider this as just one of multiple possible models from a set of models with different priors. According to one of them:
P(1) = 1⁄2, P(2) = P(3) = P(4)=P(5)=P(6)=1/10
According to another:
P(2) = 1⁄2, P(1)=P(3)=P(4)=P(5)=P(6)=1/10
And so on and so forth.
And then I assign equiprobable prior between these models and start collecting experimental data—see how well all of them perform. Do I understand correctly, that Walker considers such approach incoherent?
In which case, I respectfully disagree with him. While it’s true that this approach doesn’t represent our uncertainty about which label of an unknown die will be shown on a roll, it, nevertheless, represents the uncertainty about which bayesian model best approximates the behavior of this particular die. And there is nothing incoherent in modelling the latter kind of uncertainty instead of the former.
And likewise for more complicated settings and models. Whenever we have uncertainty about which model is the best one we can model this uncertainty and get a probabilistic answer to it via bayesian methods. And then get a probabilistic answer according to this model, if we want to.
But why would the prior, capturing all your information about a setting, be sufficiently far off from the true value, in the first place? This seems to happen mostly when you misuse the bayesian method, by picking some arbitrary prior for no particular reason. Which is a weird complain. Surely we can also misuse frequentist methods in a similar fashion—p-hacking immediately comes to mind, or just ignoring bunch of data points altogether. But what’s the point in talking about this? We are interested in situations when the art fails us, not when we fail the art, aren’t we?
Interesting! So is there a agreement among frequentists that probability of an unfair coin about which we know nothing else to land Tails is 1/2? Or is it more like: “Well we have a bunch of tools and here one of them says 1⁄2, but we do not have a principled reason to prefer it to other tools regarding the question of what probability is, so the question is still open”.
Is it? I though everyone is in agreement that Bayes theorem naturally follows from the axioms of probability theory. In which case the only reason why such reasoning doesn’t follow Bayesian updating procedure is that, somehow, probability theory is not applicable to the reasoning about mathematical constructs in particular, but why would that be true?
Oh wait, you don’t think that probability theory is applicable to reasoning in general? Surely I’m misunderstanding you here? Could you elaborate on your position here? I feel that this is the most important crux of disagreement.
Well, maybe, I don’t know. As it stands it just seems best to argue against what he has said at his word then to assume otherwise, though, insofar as other people take this view at face value. Though, if such a thing does come about, I would of course have to write a different post. This could be some part of LessWrong culture that I am just ignorant of, though, so apologies.
It depends on what you mean by see how well all of them perform. In this situation, where you can easily get a reasonably small set of models that might represent your total uncertainty, and then (crucially) obtain whatever estimates or uncertainties you desire by updating the posterior of the complete model (including these sub-models—i. e.P(p1,p2,…,p6|X) must be marginalized over M).
To a Bayesian, this is simply the uniquely-identified distribution function which represents your uncertainty about these parameters—no other function represents this and any other probability represents some other belief. This of course includes the procedure of maximizing some data score (i. e. finding an empirical ‘best model’), which would be something like P(p1,p2,…,p6|X,M=m∗i) where m∗i=argm∈MmaxT(X,m) in which T is some model evaluation score (possibly just the posterior probability of the model).
This seems like a very artificial thing to report as your uncertainty about these parameters and essentially guarantees that your uncertainties will be underestimated—among other things, there is no guarantee that such a procedure follows the likelihood principle (for most measures of model correctness other than something in proportion to the posterior probability of each model, but if you have those at hand you might as well just marginalize over them), and by the uniqueness part of Cox’s proof it will break one of the presuppositions there (and therefore likely fall pray to some Dutch Book). Therefore, if you consider these the reasons to be Bayesian, to be a Bayesian and do this seems ill-fated.
Of course, he and I and most reasonable people accept that this could be a useful practical device in many contexts, but it is indeed incoherent in the formal sense of the word.
Well, you told me to grab you the simplest possible example, not the most cogent or common or general one!
But, well, regardless if you’re looking for this happening in practice, this complaint is sometimes levied at Bayesian versions of machine learning models where it is especially hard to convincingly supply priors for humongous quantities of parameters. Here’s some random example I found of this happening in a much less complex situation. This is all beside the point, though, which is again that there is no guarantee that I know of for Bayesians to be finite-sample calibrated AT ALL. There are asymptotic arguments, and there are approximate arguments, but there is simply no property which says that even an ideal Bayesian (under the subjectivist viewpoint) will be approximately calibrated.
Note that I also have no reason to assume the prior must be sufficiently close to the true value—since a prior for a subjectivist is just an old posterior, this is tantamount to just assuming a Bayesian is not only consistent but well on his way to asymptotic-land, so the cart is put before the horse.
Few Frequentists believe that there is an absolute principle to which an estimate is the unique best one, no, but this doesn’t make the question ‘still open’, just only as open as any other statistical question (in that, until proven otherwise, you could always pose a method which is just as good as the original in one way or another while also being better in some other dimension). 1⁄2 seems hard to beat, though, unless you specify some asymmetrical loss function instead (which seems like an appropriate amount of openness to me, in that in those situations you obviously do want to use a different estimate).
Yes, of course, but from this does not follow that any particular bit of thought can be represented as a posterior obtained from some prior. Clearly, whatever thought might be in my head there could easily just not follow the axioms of probability, and frankly I think this is almost certain. Maybe there does exist some decent representation of these sorts of practical mental problems such as these, but I would have to see it to believe it. Not only that, but I am doubtful of the value of supposing that whatever ideal this thought ought to aspire to achieve is a Bayesian one (thus the content of the post—consider that a representation of this practical problem is another formally nonparametric one, in that the ideal list of mathematical properties must be comically large—if I am to assume some smaller space, I am implicitly claiming a probability of zero on rather a lot of mathematical laws which could be good which I cannot conceive of immediately, which seems horrible as an ideal.)
There is some incomplete text “If you ” here.
Thank you for spotting that.
You attribute a certain version of Bayesianism to Yudkowsky, but it seems perhaps to be original to Jaynes?
Is it? Certainly Yudkowksy takes a lot of inspiration from Jaynes but I don’t remember him beating on this particular drum. Of course there were arguments as to why being Bayesian is correct from a philosophical perspective, just not so much that everything must be approximating it. Though, it’s been years since I read him, so I could be wrong.
Bayesian here. I’ll lay down my thoughts after reading this post in no particular order, I’m not trying to construct a coherent argument pro/against the argument of your post, not due to lack of interest but due to lack of time, though in general it’ll be evident I’m generally pro-Bayes:
I have the impression that Yudkowsky has lowered his level of dogmaticism since then, as is common with aging. I’ve never read this explicitly discussed by him, but I’ll cite one holistic piece of my evidence: that I’ve learned more about the limitations of Bayes from LessWrong and MIRI than from discussions with frequentists (I mean, people preferring or mostly using frequentist stuff or at least feeling about Bayes like a weird thing, not that they would be ideologues and not use it). Going beyond Bayes seems like a central theme in Yudkowsky’s work, though it’s always framed as extending Bayes. So I’ll take a stab at guessing what Yudkowsky thinks right now about this, and it would be that AIXI is the simplest complete idealized model of Bayesian agent, it’s of course a model and not reality, and general out-of-formal-math evidence aggregation points to Bayes being substantially an important property of intelligence, agency, knowledge that’s going to stick around like Newton is still used after Einsten, and this amounts to saying that Bayes is a law, though he wouldn’t today describe it with the biblical vibes he had when writing the sequences.
I empirically observed Bayes is a good guide to finding good statistical models, and I venture that if you think otherwise you are not good enough at the craft. It took me years of usage and study to use it myself in that sense, rather than just using Bayesian stuff as pre-packaged tools and basically equivalent to other frequentist stuff if not for idiosyncratic differences in convenience in each given problem.
I generally have the impression that the mathematical arguments you mention focus a lot on the details and miss the big picture. I don’t mean that they are wrong; I trust that they are right. But the overall picture I’d draw out of them is that Bayes is the right intuition, it’s substantially correct, though it’s a simplified model of course and you can refine it in multiple directions.
Formally, frequentist estimators are just any function. Though of course you’ll require sensible properties out of your estimators, there’s no real rule about what a good frequentist estimator should be. You can ask it to be unbiased, or to minimize MSE, under i.i.d. repetition. What’s if i.i.d. repetition is not a reasonable assumption? What if unbiased is in contradiction with minimizing MSE? Bayes gives you a much much smaller subset of stuff to pick from in a given problem, though still large in absolute terms. That’s the strength of the method; that you should not in practice need anything else out of that much smaller subset of potential solutions for your practical inference problems. They are also valid as frequentist solutions, but this does not mean that Bayes is equivalent to frequentist, because the latter does not select so specifically those solutions.
OLS is basically Bayesian. If you don’t like the improper prior, pick a proper very diffuse one. This should not matter in practice. If it happens to matter in some case, I bet the setup is artificial and contrived. OLS is not a general model of agency and intelligence, it’s amongst the simplest regression models, and it need not work under extreme hypothetical scenarios, it needs to work for simple stuff. If I ran OLS and got beta_1 = 1′000′000′000′000, I would immediately think “I fucked up”, unless I was already well aware of putting wackily unscaled values into the procedure, so a wide proper prior matches practical reasoning at an intuitive level. Which does not mean that Bayes is a good overall model of my practical reasoning at that point, which should point to “re-check the data and code”, but I take it as a good sign for Bayes that it points in the right direction within the allowance of such a simplified model.
Thank you for the many pointers to the literature on this, this is the kind of post one gets back to in the future (even if you consider it a rush job).
Hello! Thank you for the comment, these are good points.
I do not consider myself a Rationalist nor know much of anything about Yudkowsky’s more current positions on this subject, but I probably should have mentioned somewhere in the post that this article was partly motivated by this discussion on X, and his comment. I must admit I do not really grasp what he is gesturing towards with the point he makes there, but it seems like he still believes some version of the original point as stated.
This post is not about Bayesian inference as practiced by mortal statistical workers; I have other reasons to justify my Frequentism there, but I wrote this so as to eschew the “Tool vs. Law” distinction that seems to be sometimes drawn here. Of course, Bayesian methods in statistics are sometimes useful (it’s hard to justify a hierarchical model without reference to conditioning, “H-likelihood” feels like the sort of post-hoc methodological loop-the-loop that I criticize Bayesians for), and I have used them myself here and there. I am very interested to hear what methods you derived through Bayesian thinking which are not equivalent to a Frequentist estimate, though!
I agree with you here, almost completely—it just doesn’t seem like what Yudkowsky is saying. To wit:
(Though I would personally add that, even though it’s probably the best unifying principle in statistics, there is no need to adhere to any such general principle when there are better alternatives.)
This is one of those practical questions which I tried to avoid here (maybe I should just write a separate Frequentism post eventually), but yes, I agree, and would characterize this as probably the biggest advantage of Bayesian methods in practice—that they are “plug-and-play”, that if you specify a minimally sensible model you have strong guarantees (in nice, parametric, problems) that your answers will be sensible too.
I imagine this is why they are most often seen in fields like astrophysics, where you don’t want to seek out the best methods for really complicated physical models, you just want something that works well without having to worry. Still, the comparative strength of Frequentism is being able to specify and more directly obtain exactly what you want, sometimes optimally. An easy example is exact finite-sample calibration: if I want my predictions to be calibrated (and there are many situations in which I do), the methods which will guarantee that I get this will involve conformal inference method or the like. I don’t have to wrangle a prior which matches this or hope everything works out. Other examples are, say, robustness, or in experiment design.
You comment on assumptions, here, but in my opinion you have it backwards—if your Bayesian model handles non i.i.d-ness well, this is because the dependency shows up in the likelihood, which (say) the MLE still handles quite well (vaguely asymptotically efficient and so on). What if you want to be distribution-free, or want to check if your answers are robust to your model being wrong in some directions? Maybe there will be better Bayesian answers here someday, statistics generally is a young field, but (in practice) I think the Frequentists just take the cake on this one.
This is again correct, of course, but I am specifically criticizing the essence of Yudkowsky’s point of “if it’s any good, it must be approximating a Bayesian answer”: who’s approximating who? Here it seems much more sensible to say that we have a good answer (the OLS estimate), one that we have reasons to prefer in some scenarios (e. g. Gauss-Markov, general distribution-free niceness) that a Bayesian method, strictly speaking, can only approximate, and which seems at odds with a pure subjectivist point of view (because the prior is incoherent, but this is much more salient in the Cox model example). Indeed in practice this is irrelevant.
A general way my mental model of how statistics works disagrees with what you write here is on whether the specific properties that are in different contexts required of estimators (calibration, unbiasedness, minimum variance, etc.) are the things we want. I think of them as proxies, and I think Goodhart’s law applies: when you try to get the best estimator in one of these senses, you “pull the cover” and break some other property that you would actually care about on reflection but are not aware of.
(Not answering many points in your comment to cut it short, I prioritized this one.)