On Frequentism and Bayesian Dogma

I’ve heard that you believe that frequentism is correct. But that’s obviously wrong, so what gives?

DanielFilan

I guess first of all I should ask, what do you mean by “frequentism”?

DanielFilan

I mean classical statistical frequentism. Though somewhat tongue-in-cheek, as I don’t think it’s fully correct, I think it’s much more correct than orthodox Jaynesian Bayesianism.

Some scattered thoughts:

  • Bayes’ theorem derives from conditional probability so it’s also included in frequentism.

  • Bayesian epistemology only applies to situations when your beliefs are a probability distribution, and is thus incomplete.

    • It doesn’t account for e.g. limited computation.

  • Frequentism solves these things by framing the problem in a different way. Rather than ‘how should I think?’ it’s “this algorithm seems like a sensible way to think, let’s figure out what epistemic guarantees it has”.

    • In particular, it makes it OK to believe things that are not expressible as probability distributions.

Adrià Garriga-alonso

I’m still sort of unsure what you mean by “classical statistical frequentism”. Like, I’m pretty sure I agree that the purported theorems of Fisher are in fact theorems. Do you mean something like “the way we should think about thinking is to ask ‘what cognitive algorithms perform well with high probability in the long run’”?

DanielFilan

(and regarding Bayesianism, I think that’s a separate question that’s more productively talked about once I understand why you think frequentism is good)

DanielFilan

Sure. Thank you for the clarification—I agree there are many fine theorems on both sides.

Statistics is the problem of learning from data, frequentism is saying “Here’s an algorithm that takes the data and computes something (that’s an estimator). Let’s study the properties of it.”

“the way we should think about thinking is to ask ‘what cognitive algorithms perform well with high probability in the long run’”?

Yeah, I agree with this. Though I recognize in practice it’s pretty hard, frequentist theory attempts to do so (for performance usually being ‘have correct beliefs’).

Adrià Garriga-alonso

Here’s an algorithm that takes the data and computes something (that’s an estimator). Let’s study the properties of it.

I first want to note that this is an exhortation and not a proposition. Regarding the implicit exhortation of “it’s good to understand estimators”, I guess I agree? I think my crux is something like “but it’s relevant that some worlds are a priori more likely than others, and you want to do better in those” (and of course now we’re in the territory where we need to argue about Bayesianism).

DanielFilan

I first want to note that this is an exhortation and not a proposition

Sure. I mean that the examples of knowledge frequentism creates (theorems and such) are things derived from this exhortation. E.g. “consider the algorithm of taking the sample mean. What does that say about the true population mean?” is an example of a very classical frequentism question with a useful answer.

but it’s relevant that some worlds are a priori more likely than others

Sure, you can analyze whether your estimator does well in this way!

and of course now we’re in the territory where we need to argue about Bayesianism).

Why do you say that?

Adrià Garriga-alonso

I guess you’re referencing the part where Frequentism is like “propositions are only true or false, you can’t believe in probabilities”.

Fine. But you have the ‘probabilities’ be numbers in your estimator-algorithm. It is true that, at the end of the date, propositions are either true or false.

In fact outputting the ‘Bayesian probability’ is an estimator with good properties for estimating truth/​falsehood of a proposition, with Brier loss or whatever. So that’s a draw for freq vs Bayes.

Adrià Garriga-alonso

I guess I think of ‘Frequentism’ as definitely believing in probabilities—just probabilities of drawing different samples, rather than a priori probabilities, or probabilities of ground truths given a sample. So I feel that the question is which type of probability is more important, or more relevant. (Like, I certainly agree that “understand what sort of algorithms do well in probabilistic settings” is the right way to think about cognition, and you don’t have to over-reify the features of those cognitive algorithms!)

DanielFilan

Another potential question could be “how valuable is it for cognitive algorithms to not be constrained by having their main internal representations be probability distributions over possible worlds”.

DanielFilan

Basically you presented this as a hot take, and I’m trying to figure out where you expect to disagree with people.

DanielFilan

Another possible question: how valuable is the work produced by frequentist statisticians?

DanielFilan

So I feel that the question is which type of probability is more important, or more relevant.

I’m not sure I agree that this is the important question. Or rather it is, but I would answer it pragmatically: what sort of approaches to epistemology does focusing on each type of probability produce, what questions and answers does it lead you to produce? And I think here frequentism wins. This ties neatly with:

how valuable is the work produced by frequentist statisticians?

Historically pretty valuable! It’s good to understand the guarantees of e.g. the ‘sample mean’ estimator. Bandit algorithms are also a glorious frequentist achievement, as argued by Steinhardt. The Bootstrap method, a way to figure out your estimator’s uncertainty without assuming much about the data (no Bayesian prior distribution for one) is also great.

But I think the theoretical pickings are pretty slim at this point—cool stuff, but it’s unlikely that there’ll be something as fundamental as the sample mean.

The field to which statistics now is relevant is machine learning. and here I think frequentists have won an absolute victory: all the neural networks are probabilistic, but Bayesian ML needs way more computation than mainstream ML for the same or worse results.

And IMO this is because of an overreliance on “the theory says the algorithm will work if done this way, therefore we’re going to do it this way” versus a willingness to experiment with various algorithms (i.e. estimators) without quite understanding why they work, and seeing which one works.

Adrià Garriga-alonso

how valuable is it for cognitive algorithms to not be constrained by having their main internal representations be probability distributions over possible worlds

I think this is very valuable as exemplified by Bayesian vs mainstream ML.

Adrià Garriga-alonso

OK, I’m getting a better sense of where our disagreements may lie.

I agree that the historical record of frequentist statistics is pretty decent. I am somewhat more enthusiastic about more “Bayesian” approaches to bandits, e.g. Thompson sampling, than it sounds like you are, but this might just be tribalism—and if I think about the learning algorithm products of the AI alignment community that I’m excited about (logical induction, boundedly rational inductive agents), they look more frequentist than Bayesian.

I think my real gripe is that I see this massive impact of frequentism on the scientific method as promoting the use of p-values and confidence intervals, which, IMO, are using conditional probabilities in the wrong direction (one way to tell this: ask any normal scientist what a p-value or a confidence interval is, and there’s a high chance that they’ll give an explanation of what the Bayesian equivalent would be).

Now, I think it’s sort of fair enough to say “but that’s not what Ronald Fisher would do” or “people would and do misuse Bayesian methods too”, and all of these are right (as a side-note, I’ve noticed that when people introduce Bayes theorem in arguments about religion they’re typically about to say something unhinged), but it’s notable to me that tons of people seem to want the Bayesian thing.

---

Regarding Bayesian vs standard machine learning: on the one hand, I share your impression that the Bayesian methods are terrible and don’t work, and that empiricism /​ tight feedback loops are important for making progress. On the other hand, as far as I can tell, the ML community is on track to build things that kill me and everyone I care about and also everyone else, and I kind of chalk this up to them not understanding enough about the generalization properties of their algorithms. So I actually don’t take this as the win for frequentism that it looks like.

DanielFilan

it’s notable to me that tons of people seem to want the Bayesian thing.

I agree that Bayesian statistics are more intuitive than p-values. It’s sad in my opinion that you need to assume prior probabilities about your hypotheses to get the Bayesian-style p(hypothesis | data), which is what we all love. But the math comes out that way.

Maybe also log-likelihood ratios would be better reported in papers (you can add them up!) but then people would add up log-likelihood ratios for slightly different hypotheses and convince them that they’re valid (it can be, but it’s unclear what assumptions you need to do that), and it would be a huge mess. That’s not your strongest point though.

On the other hand, as far as I can tell, the ML community is on track to build things that kill me and everyone I care about and also everyone else

Now we’re talking!

I kind of chalk this up to them not understanding enough about the generalization properties of their algorithms

Fair, but that doesn’t mean you can chalk it up to frequentism. I don’t think the Bayesian approach (here I very much mean the actual Bayesian ML community[1]) is any better at this. They work kind of backwards: instead of fitting their theory to observable data about experiments, they assume Bayesian theory and kind of shrug when the experiments don’t work out. IMO the right way to understand generalization is to have a theory, and then change it when the experiment contradicts it.

Part of the reason this is justifiable to the Bayesian ML folks is that the experiments aren’t quite about Bayesian theoretical ideal, they’re about practical algorithms. My position here is that I would like my theories to talk about the actual things people do. I am wary of theorems about asymptotics for the same reason: technically they don’t talk about what happens in finite time.

In my opinion we should discard the culture of this particular academic sub-field, and talk about how good would the best possible Bayesian approach to understanding ML generalization be. Two versions of this:

  1. Understand existing algorithms. I claim that its fixation on having the only valid beliefs be well-specified probability distribution, and the lack of claims about what happens in any finite time, would make it impossible for them to make progress. Though maybe the dev-interp people will succeed (I doubt it, but we’ll see; and they’re studying MCMC’s behavior in practice so not quite Bayesian).

  2. Create Bayesian algorithms that are therefore well understood. This is the holy grail of Bayesian ML, but I don’t think this will happen. Maintaining beliefs as probabilities that are always internally self-consistent is expensive and not always necessary, and also IMO not all beliefs are representable as probability distributions (radical uncertainty). Also you need a better understanding of good reasoning under finite computation which, as you wrote above, is more frequentist. (I agree with this point, and I think it’s frequentist because frequentism is about analyzing estimators).

  1. ^

    Examples of people who made this error: myself from 6 years ago, myself from 3 years ago. I would argue many of my grad student peers and professors made (and yet make) the same mistake. Yes, this formative experience is an important contributor to the existence of this dialogue.

Adrià Garriga-alonso

Part of the reason this is justifiable to the Bayesian ML folks is that the experiments aren’t quite about Bayesian theoretical ideal, they’re about practical algorithms. My position here is that I would like my theories to talk about the actual things people do.

I think this suggests a place where I have some tension with your view: while I certainly agree that theories should be about the things people actually do, and that Bayesianism can fall short on this score, I also want theories to meaningfully guide what people do! Cognitive algorithms can be better and worse, and we should use (and analyze) the better ones, rather than the worse ones. One way of implementing this could be “try a bunch of cognitive algorithms and see what works”, but once your algorithms include “play nice while you’re being tested then take over the world”, empiricism isn’t enough: we either need theory to guide us away from those algorithms, or we need to investigate the internals of the algorithms that we try, and make sure they comply with certain standards that rule out treacherous turn behaviour.

Now, this theory of what algorithms should look like or what they should have in their internals doesn’t have to be Bayesianism—in fact, it probably doesn’t work for it to be Bayesianism, because to understand a Bayesian you need to understand their event space, which could be weird and inscrutable. But once you’ve got such a theory, I think you’re at least outside of the domain of “mere frequentism” (altho I admit that in some sense any time you think about how an algorithm works in a probabilistic setting you’re in some sense a frequentist).

DanielFilan

As a side note:

Also you need a better understanding of good reasoning under finite computation which, as you wrote above, is more frequentist.

This might be an annoying definitional thing, but I don’t think good reasoning under finite computation has to be ‘frequentist’. As an extreme example, I wouldn’t call Bayes net algorithms frequentist, even tho with finite size they run in finite time. I call logical induction and boundedly rational inductive agents ‘frequentist’ because they fall into the family of “have a ton of ‘experts’ and play them off against each other” (and crucially, don’t constrain those experts to be ‘rational’ according to some a priori theory of good reasoning).

DanielFilan

Good point. True Bayesian algos are only finite if the world is finite though; and the world is too large to count as finite for the purposes of a competent AGI. I should have said “with computation bounded under what the requirements of the world are”, or something similar but less unwieldy.

Adrià Garriga-alonso

Now, this theory of what algorithms should look like or what they should have in their internals doesn’t have to be Bayesianism—in fact, it probably doesn’t work for it to be Bayesianism, because to understand a Bayesian you need to understand their event space, which could be weird and inscrutable. But once you’ve got such a theory, I think you’re at least outside of the domain of “mere frequentism” (altho I admit that in some sense any time you think about how an algorithm works in a probabilistic setting you’re in some sense a frequentist).

I agree with all of this. I call this “Bayesianism is wrong and frequentism is correct”, maybe I shouldn’t call it that?

Adrià Garriga-alonso

Well, I was more thinking of Bayesianism as being insufficient for purpose, rather than necessarily “wrong” here.

DanielFilan

I feel like we’ve transformed the initial dispute into a new, clearer, and more exciting dispute. Perhaps this is a good place to stop?

DanielFilan

I’m not sure we agree on what the new dispute is, I’d like to explore that! But perhaps the place for that is another dialogue.

I would say Bayesianism is wrong like Newtonian mechanics is wrong. It’s a very good approximation of reality for some domains (in Newtonian mechanics’ case, macroscopic objects at low energy scales, in Bayesian statistics’ case, epistemic problems with at most ~millions of possible outcomes).

The frequentist frame I exposed here (let’s analyze some actual algorithms) is IMO more likely to point at the kind of thing we want out of a theory of epistemology. But I guess classical frequentist methods are also not close to solving alignment, and also didn’t accurately predict that deep NNs would work so well (they have so many parameters, you’re going to overfit!)

So maybe frequentism is wrong in the same way. But I think the shift from “the theory is done and should guide algorithms” to “the theory should explain what’s going on in actual algorithms” is important.

Maybe we should write a consensus statement to conclude?

Adrià Garriga-alonso

I guess we have a few disagreements left...

I would say Bayesianism is wrong like Newtonian mechanics is wrong. It’s a very good approximation of reality for some domains

I wouldn’t think about Bayesianism this way—I’d say that Bayesianism is the best you can do when you’re not subject to computational /​ provability /​ self-reflection limitations, and when you are subject to those limitations, you should think about how you can get what’s good about Bayesianism for less of the cost.

But I think the shift from “the theory is done and should guide algorithms” to “the theory should explain what’s going on in actual algorithms” is important.

This still feels incomplete to me for reasons described in my earlier comment: Yes, it’s bad to be dogmatic about theories that aren’t quite right, and yes, theories have got to describe reality somehow, but also, theories should guide you into doing good things rather than bad things!

DanielFilan

How about this as a consensus statement?

Frequentism has the virtue of describing the performance of algorithms that are possible to run, without being overly dogmatic about what algorithms must look like. By contrast, Bayesianism is only strictly applicable in cases where computation is not limited, and its insistence on limiting focus to algorithms that carry around probability distributions that they update using likelihood ratios is overly limiting. In future, we need to develop ways of thinking about cognitive algorithms that describe real algorithms that can actually be run, while also providing useful guidance.

DanielFilan

I’d say that Bayesianism is the best you can do when you’re not subject to computational /​ provability /​ self-reflection limitations,

I disagree with this, by the way. Even under these assumptions, you still have the problem of handling belief states which cannot be described as a probability distribution. For small state spaces, being fast and loose with that (e.g. just belief the uniform distribution over everything) is fine, but larger state spaces run into problems, even if you have infinite compute and can prove everything and don’t need to have self-knowledge.

Adrià Garriga-alonso

I endorse the consensus statement you wrote!

Adrià Garriga-alonso

And perhaps a remaining point of dispute is: how important is it to have non-probabilistic beliefs?

DanielFilan

Sure, I’m happy to leave it at that. Thank you for being a thoughtful dialogue partner!

Adrià Garriga-alonso

Thanks for the fun chat :)

DanielFilan