Bayesian here. I’ll lay down my thoughts after reading this post in no particular order, I’m not trying to construct a coherent argument pro/against the argument of your post, not due to lack of interest but due to lack of time, though in general it’ll be evident I’m generally pro-Bayes:
I have the impression that Yudkowsky has lowered his level of dogmaticism since then, as is common with aging. I’ve never read this explicitly discussed by him, but I’ll cite one holistic piece of my evidence: that I’ve learned more about the limitations of Bayes from LessWrong and MIRI than from discussions with frequentists (I mean, people preferring or mostly using frequentist stuff or at least feeling about Bayes like a weird thing, not that they would be ideologues and not use it). Going beyond Bayes seems like a central theme in Yudkowsky’s work, though it’s always framed as extending Bayes. So I’ll take a stab at guessing what Yudkowsky thinks right now about this, and it would be that AIXI is the simplest complete idealized model of Bayesian agent, it’s of course a model and not reality, and general out-of-formal-math evidence aggregation points to Bayes being substantially an important property of intelligence, agency, knowledge that’s going to stick around like Newton is still used after Einsten, and this amounts to saying that Bayes is a law, though he wouldn’t today describe it with the biblical vibes he had when writing the sequences.
I empirically observed Bayes is a good guide to finding good statistical models, and I venture that if you think otherwise you are not good enough at the craft. It took me years of usage and study to use it myself in that sense, rather than just using Bayesian stuff as pre-packaged tools and basically equivalent to other frequentist stuff if not for idiosyncratic differences in convenience in each given problem.
I generally have the impression that the mathematical arguments you mention focus a lot on the details and miss the big picture. I don’t mean that they are wrong; I trust that they are right. But the overall picture I’d draw out of them is that Bayes is the right intuition, it’s substantially correct, though it’s a simplified model of course and you can refine it in multiple directions.
Formally, frequentist estimators are just any function. Though of course you’ll require sensible properties out of your estimators, there’s no real rule about what a good frequentist estimator should be. You can ask it to be unbiased, or to minimize MSE, under i.i.d. repetition. What’s if i.i.d. repetition is not a reasonable assumption? What if unbiased is in contradiction with minimizing MSE? Bayes gives you a much much smaller subset of stuff to pick from in a given problem, though still large in absolute terms. That’s the strength of the method; that you should not in practice need anything else out of that much smaller subset of potential solutions for your practical inference problems. They are also valid as frequentist solutions, but this does not mean that Bayes is equivalent to frequentist, because the latter does not select so specifically those solutions.
OLS is basically Bayesian. If you don’t like the improper prior, pick a proper very diffuse one. This should not matter in practice. If it happens to matter in some case, I bet the setup is artificial and contrived. OLS is not a general model of agency and intelligence, it’s amongst the simplest regression models, and it need not work under extreme hypothetical scenarios, it needs to work for simple stuff. If I ran OLS and got beta_1 = 1′000′000′000′000, I would immediately think “I fucked up”, unless I was already well aware of putting wackily unscaled values into the procedure, so a wide proper prior matches practical reasoning at an intuitive level. Which does not mean that Bayes is a good overall model of my practical reasoning at that point, which should point to “re-check the data and code”, but I take it as a good sign for Bayes that it points in the right direction within the allowance of such a simplified model.
Thank you for the many pointers to the literature on this, this is the kind of post one gets back to in the future (even if you consider it a rush job).
Hello! Thank you for the comment, these are good points.
I do not consider myself a Rationalist nor know much of anything about Yudkowsky’s more current positions on this subject, but I probably should have mentioned somewhere in the post that this article was partly motivated by this discussion on X, and his comment. I must admit I do not really grasp what he is gesturing towards with the point he makes there, but it seems like he still believes some version of the original point as stated.
This post is not about Bayesian inference as practiced by mortal statistical workers; I have other reasons to justify my Frequentism there, but I wrote this so as to eschew the “Tool vs. Law” distinction that seems to be sometimes drawn here. Of course, Bayesian methods in statistics are sometimes useful (it’s hard to justify a hierarchical model without reference to conditioning, “H-likelihood” feels like the sort of post-hoc methodological loop-the-loop that I criticize Bayesians for), and I have used them myself here and there. I am very interested to hear what methods you derived through Bayesian thinking which are not equivalent to a Frequentist estimate, though!
I agree with you here, almost completely—it just doesn’t seem like what Yudkowsky is saying. To wit:
I have been justly chastised by the discussants for spreading alarm about the health of the body Bayesian. Certainly it has held together more successfully than any other theory of statistical inference, and I am not predicting its imminent demise. But no human creation is completely perfect, and we should not avert our eyes from its deficiencies. In its solipsistic satisfaction with the psychological self-consistency of the schizophrenic statistician, it runs the risk of failing to say anything useful about the world outside.
(Though I would personally add that, even though it’s probably the best unifying principle in statistics, there is no need to adhere to any such general principle when there are better alternatives.)
This is one of those practical questions which I tried to avoid here (maybe I should just write a separate Frequentism post eventually), but yes, I agree, and would characterize this as probably the biggest advantage of Bayesian methods in practice—that they are “plug-and-play”, that if you specify a minimally sensible model you have strong guarantees (in nice, parametric, problems) that your answers will be sensible too.
I imagine this is why they are most often seen in fields like astrophysics, where you don’t want to seek out the best methods for really complicated physical models, you just want something that works well without having to worry. Still, the comparative strength of Frequentism is being able to specify and more directly obtain exactly what you want, sometimes optimally. An easy example is exact finite-sample calibration: if I want my predictions to be calibrated (and there are many situations in which I do), the methods which will guarantee that I get this will involve conformal inference method or the like. I don’t have to wrangle a prior which matches this or hope everything works out. Other examples are, say, robustness, or in experiment design.
You comment on assumptions, here, but in my opinion you have it backwards—if your Bayesian model handles non i.i.d-ness well, this is because the dependency shows up in the likelihood, which (say) the MLE still handles quite well (vaguely asymptotically efficient and so on). What if you want to be distribution-free, or want to check if your answers are robust to your model being wrong in some directions? Maybe there will be better Bayesian answers here someday, statistics generally is a young field, but (in practice) I think the Frequentists just take the cake on this one.
This is again correct, of course, but I am specifically criticizing the essence of Yudkowsky’s point of “if it’s any good, it must be approximating a Bayesian answer”: who’s approximating who? Here it seems much more sensible to say that we have a good answer (the OLS estimate), one that we have reasons to prefer in some scenarios (e. g. Gauss-Markov, general distribution-free niceness) that a Bayesian method, strictly speaking, can only approximate, and which seems at odds with a pure subjectivist point of view (because the prior is incoherent, but this is much more salient in the Cox model example). Indeed in practice this is irrelevant.
A general way my mental model of how statistics works disagrees with what you write here is on whether the specific properties that are in different contexts required of estimators (calibration, unbiasedness, minimum variance, etc.) are the things we want. I think of them as proxies, and I think Goodhart’s law applies: when you try to get the best estimator in one of these senses, you “pull the cover” and break some other property that you would actually care about on reflection but are not aware of.
(Not answering many points in your comment to cut it short, I prioritized this one.)
Bayesian here. I’ll lay down my thoughts after reading this post in no particular order, I’m not trying to construct a coherent argument pro/against the argument of your post, not due to lack of interest but due to lack of time, though in general it’ll be evident I’m generally pro-Bayes:
I have the impression that Yudkowsky has lowered his level of dogmaticism since then, as is common with aging. I’ve never read this explicitly discussed by him, but I’ll cite one holistic piece of my evidence: that I’ve learned more about the limitations of Bayes from LessWrong and MIRI than from discussions with frequentists (I mean, people preferring or mostly using frequentist stuff or at least feeling about Bayes like a weird thing, not that they would be ideologues and not use it). Going beyond Bayes seems like a central theme in Yudkowsky’s work, though it’s always framed as extending Bayes. So I’ll take a stab at guessing what Yudkowsky thinks right now about this, and it would be that AIXI is the simplest complete idealized model of Bayesian agent, it’s of course a model and not reality, and general out-of-formal-math evidence aggregation points to Bayes being substantially an important property of intelligence, agency, knowledge that’s going to stick around like Newton is still used after Einsten, and this amounts to saying that Bayes is a law, though he wouldn’t today describe it with the biblical vibes he had when writing the sequences.
I empirically observed Bayes is a good guide to finding good statistical models, and I venture that if you think otherwise you are not good enough at the craft. It took me years of usage and study to use it myself in that sense, rather than just using Bayesian stuff as pre-packaged tools and basically equivalent to other frequentist stuff if not for idiosyncratic differences in convenience in each given problem.
I generally have the impression that the mathematical arguments you mention focus a lot on the details and miss the big picture. I don’t mean that they are wrong; I trust that they are right. But the overall picture I’d draw out of them is that Bayes is the right intuition, it’s substantially correct, though it’s a simplified model of course and you can refine it in multiple directions.
Formally, frequentist estimators are just any function. Though of course you’ll require sensible properties out of your estimators, there’s no real rule about what a good frequentist estimator should be. You can ask it to be unbiased, or to minimize MSE, under i.i.d. repetition. What’s if i.i.d. repetition is not a reasonable assumption? What if unbiased is in contradiction with minimizing MSE? Bayes gives you a much much smaller subset of stuff to pick from in a given problem, though still large in absolute terms. That’s the strength of the method; that you should not in practice need anything else out of that much smaller subset of potential solutions for your practical inference problems. They are also valid as frequentist solutions, but this does not mean that Bayes is equivalent to frequentist, because the latter does not select so specifically those solutions.
OLS is basically Bayesian. If you don’t like the improper prior, pick a proper very diffuse one. This should not matter in practice. If it happens to matter in some case, I bet the setup is artificial and contrived. OLS is not a general model of agency and intelligence, it’s amongst the simplest regression models, and it need not work under extreme hypothetical scenarios, it needs to work for simple stuff. If I ran OLS and got beta_1 = 1′000′000′000′000, I would immediately think “I fucked up”, unless I was already well aware of putting wackily unscaled values into the procedure, so a wide proper prior matches practical reasoning at an intuitive level. Which does not mean that Bayes is a good overall model of my practical reasoning at that point, which should point to “re-check the data and code”, but I take it as a good sign for Bayes that it points in the right direction within the allowance of such a simplified model.
Thank you for the many pointers to the literature on this, this is the kind of post one gets back to in the future (even if you consider it a rush job).
Hello! Thank you for the comment, these are good points.
I do not consider myself a Rationalist nor know much of anything about Yudkowsky’s more current positions on this subject, but I probably should have mentioned somewhere in the post that this article was partly motivated by this discussion on X, and his comment. I must admit I do not really grasp what he is gesturing towards with the point he makes there, but it seems like he still believes some version of the original point as stated.
This post is not about Bayesian inference as practiced by mortal statistical workers; I have other reasons to justify my Frequentism there, but I wrote this so as to eschew the “Tool vs. Law” distinction that seems to be sometimes drawn here. Of course, Bayesian methods in statistics are sometimes useful (it’s hard to justify a hierarchical model without reference to conditioning, “H-likelihood” feels like the sort of post-hoc methodological loop-the-loop that I criticize Bayesians for), and I have used them myself here and there. I am very interested to hear what methods you derived through Bayesian thinking which are not equivalent to a Frequentist estimate, though!
I agree with you here, almost completely—it just doesn’t seem like what Yudkowsky is saying. To wit:
(Though I would personally add that, even though it’s probably the best unifying principle in statistics, there is no need to adhere to any such general principle when there are better alternatives.)
This is one of those practical questions which I tried to avoid here (maybe I should just write a separate Frequentism post eventually), but yes, I agree, and would characterize this as probably the biggest advantage of Bayesian methods in practice—that they are “plug-and-play”, that if you specify a minimally sensible model you have strong guarantees (in nice, parametric, problems) that your answers will be sensible too.
I imagine this is why they are most often seen in fields like astrophysics, where you don’t want to seek out the best methods for really complicated physical models, you just want something that works well without having to worry. Still, the comparative strength of Frequentism is being able to specify and more directly obtain exactly what you want, sometimes optimally. An easy example is exact finite-sample calibration: if I want my predictions to be calibrated (and there are many situations in which I do), the methods which will guarantee that I get this will involve conformal inference method or the like. I don’t have to wrangle a prior which matches this or hope everything works out. Other examples are, say, robustness, or in experiment design.
You comment on assumptions, here, but in my opinion you have it backwards—if your Bayesian model handles non i.i.d-ness well, this is because the dependency shows up in the likelihood, which (say) the MLE still handles quite well (vaguely asymptotically efficient and so on). What if you want to be distribution-free, or want to check if your answers are robust to your model being wrong in some directions? Maybe there will be better Bayesian answers here someday, statistics generally is a young field, but (in practice) I think the Frequentists just take the cake on this one.
This is again correct, of course, but I am specifically criticizing the essence of Yudkowsky’s point of “if it’s any good, it must be approximating a Bayesian answer”: who’s approximating who? Here it seems much more sensible to say that we have a good answer (the OLS estimate), one that we have reasons to prefer in some scenarios (e. g. Gauss-Markov, general distribution-free niceness) that a Bayesian method, strictly speaking, can only approximate, and which seems at odds with a pure subjectivist point of view (because the prior is incoherent, but this is much more salient in the Cox model example). Indeed in practice this is irrelevant.
A general way my mental model of how statistics works disagrees with what you write here is on whether the specific properties that are in different contexts required of estimators (calibration, unbiasedness, minimum variance, etc.) are the things we want. I think of them as proxies, and I think Goodhart’s law applies: when you try to get the best estimator in one of these senses, you “pull the cover” and break some other property that you would actually care about on reflection but are not aware of.
(Not answering many points in your comment to cut it short, I prioritized this one.)