You can have a law-view interpretation of such synthesis where we conceptualize Bayesianism as an imperfect approximation, a special case of the True Law, which should also capture all the good insights of Frequentism.
Yes, but such an interpretation falls outside of Yudkowsky’s view, as I understand it (for example on that X thread in another comment in this post, and his comments on other statistics topics I’ve seen around—could fish for the quotes, but I’m a bit held up at this precise moment), which is what that I’m focusing on here.
On Walker, on that paragraph he is criticizing the specific (common) practice of comparing separate Bayesian models and picking the best (via ratios or errors or some such) when there is uncertainty about the truth instead of appropriately representing this uncertainty about your sampling model in the prior.
Rolling a die is a bit of a nifty example here since it’s the case where you assign a separate probability to each label in the sample space, so that your likelihood is in fact fully general, which is where the idea for a Dirichlet prior comes from in an attempt to generalize this notion of covering all possible models for less trivial problems. In the rest of the intro, Walker points to Bayesians fitting models with different likelihoods (i. e., Weibull v. s. Lognormal, I think he is a survival guy), each with their own “inner” priors, comparing them against each other and then picking the one which is “best” as incoherent, since picking a prior just to compare posteriors on some frequentist property like error or coverage is not an accurate representation of your prior uncertainty (instead, he wants you to pick some nonparametric model).
On Bayesian finite-sample miscalibration, simply pick a prior which is sufficiently far off from the true value and your predictive intervals will be very bad for a long time (you may check by simulation on some conjugate model). This is a contrived example, of course, but this happens on some level all the time, since Bayesian methods make no promise of finite-sample calibration—your prediction intervals just reflect belief, not what future data might be (in practice, I’ve heard people complain about this in Bayesian machine learning type situations). Of course, asymptotically and under some regularity conditions you will be calibrated, but one would rather be right before then. If you want finite-sample calibration, you have to look for a method which promises it. In terms of the coverage of credible intervals more generally, though, unless you want to be in the throes of asymptotics, you’d have to pick what is called a matching prior, which again seems in conflict with subjectivist information input
On minimax: I don’t know how to format math in this site on a phone, so I will be a bit terse here, but the proof is very simple (I thought). in statistics, when we call an estimator “minimax” it means that it minimizes the maximum risk, which is the expectation of the loss over the distribution of the estimator. Since we have no data, all estimators are some constant c. The expectation of the loss with is just the loss with respect to the parameter (i. e. (c-p)^2). Clearly the maxima are taken when p is 0 or 1, so minimize the maximum of c^2, (1-c)^2, which has a minimum at c=0.5. Which is to say, 1⁄2 has this nice Frequentist property, which is how one could justify it.
On your last comment, it seems like a bit of an open question to attribute the existence of practical intuition and reasoning about mathematical constructs like this to a Bayesian prior updating process. Certainly I reason, and I change my mind, but to me personally I see no reason to imagine this was Bayesian in some way (or that those thoughts were expressed in credence-probabilities which I shifted by conditioning on a type of sense-data), nor that I would be ideally doing this instead. But, I suppose such a thing could be possible.
Yes, but such an interpretation falls outside of Yudkowsky’s view, as I understand it
Maybe! But I would expect him to change his view to something like this in case you managed to persuade him that there is some crucial flaw in Bayesianism. While your goal seems to be to propagate the toolbox-view as the only valid approach. So you might as well engage with a stronger version of law-view right now.
On Walker, on that paragraph he is criticizing the specific (common) practice of comparing separate Bayesian models and picking the best (via ratios or errors or some such) when there is uncertainty about the truth instead of appropriately representing this uncertainty about your sampling model in the prior.
Rolling a die is a bit of a nifty example here since it’s the case where you assign a separate probability to each label in the sample space, so that your likelihood is in fact fully general, which is where the idea for a Dirichlet prior comes from in an attempt to generalize this notion of covering all possible models for less trivial problems.
So, suppose that instead of assigning equal probabilities to each label of a die, I consider this as just one of multiple possible models from a set of models with different priors. According to one of them:
P(1) = 1⁄2, P(2) = P(3) = P(4)=P(5)=P(6)=1/10
According to another:
P(2) = 1⁄2, P(1)=P(3)=P(4)=P(5)=P(6)=1/10
And so on and so forth.
And then I assign equiprobable prior between these models and start collecting experimental data—see how well all of them perform. Do I understand correctly, that Walker considers such approach incoherent?
In which case, I respectfully disagree with him. While it’s true that this approach doesn’t represent our uncertainty about which label of an unknown die will be shown on a roll, it, nevertheless, represents the uncertainty about which bayesian model best approximates the behavior of this particular die. And there is nothing incoherent in modelling the latter kind of uncertainty instead of the former.
And likewise for more complicated settings and models. Whenever we have uncertainty about which model is the best one we can model this uncertainty and get a probabilistic answer to it via bayesian methods. And then get a probabilistic answer according to this model, if we want to.
On Bayesian finite-sample miscalibration, simply pick a prior which is sufficiently far off from the true value and your predictive intervals will be very bad for a long time
But why would the prior, capturing all your information about a setting, be sufficiently far off from the true value, in the first place? This seems to happen mostly when you misuse the bayesian method, by picking some arbitrary prior for no particular reason. Which is a weird complain. Surely we can also misuse frequentist methods in a similar fashion—p-hacking immediately comes to mind, or just ignoring bunch of data points altogether. But what’s the point in talking about this? We are interested in situations when the art fails us, not when we fail the art, aren’t we?
On minimax
Interesting! So is there a agreement among frequentists that probability of an unfair coin about which we know nothing else to land Tails is 1/2? Or is it more like: “Well we have a bunch of tools and here one of them says 1⁄2, but we do not have a principled reason to prefer it to other tools regarding the question of what probability is, so the question is still open”.
On your last comment, it seems like a bit of an open question to attribute the existence of practical intuition and reasoning about mathematical constructs like this to a Bayesian prior updating process.
Is it? I though everyone is in agreement that Bayes theorem naturally follows from the axioms of probability theory. In which case the only reason why such reasoning doesn’t follow Bayesian updating procedure is that, somehow, probability theory is not applicable to the reasoning about mathematical constructs in particular, but why would that be true?
Certainly I reason, and I change my mind, but to me personally I see no reason to imagine this was Bayesian in some way (or that those thoughts were expressed in credence-probabilities which I shifted by conditioning on a type of sense-data), nor that I would be ideally doing this instead.
Oh wait, you don’t think that probability theory is applicable to reasoning in general? Surely I’m misunderstanding you here? Could you elaborate on your position here? I feel that this is the most important crux of disagreement.
Maybe! But I would expect him to change his view to something like this in case you managed to persuade him that there is some crucial flaw in Bayesianism. While your goal seems to be to propagate the toolbox-view as the only valid approach. So you might as well engage with a stronger version of law-view right now.
Well, maybe, I don’t know. As it stands it just seems best to argue against what he has said at his word then to assume otherwise, though, insofar as other people take this view at face value. Though, if such a thing does come about, I would of course have to write a different post. This could be some part of LessWrong culture that I am just ignorant of, though, so apologies.
And then I assign equiprobable prior between these models and start collecting experimental data—see how well all of them perform. Do I understand correctly, that Walker considers such approach incoherent?
It depends on what you mean by see how well all of them perform. In this situation, where you can easily get a reasonably small set of models that might represent your total uncertainty, and then (crucially) obtain whatever estimates or uncertainties you desire by updating the posterior of the complete model (including these sub-models—i. e.P(p1,p2,…,p6|X) must be marginalized over M).
To a Bayesian, this is simply the uniquely-identified distribution function which represents your uncertainty about these parameters—no other function represents this and any other probability represents some other belief. This of course includes the procedure of maximizing some data score (i. e. finding an empirical ‘best model’), which would be something like P(p1,p2,…,p6|X,M=m∗i) where m∗i=argm∈MmaxT(X,m) in which T is some model evaluation score (possibly just the posterior probability of the model).
This seems like a very artificial thing to report as your uncertainty about these parameters and essentially guarantees that your uncertainties will be underestimated—among other things, there is no guarantee that such a procedure follows the likelihood principle (for most measures of model correctness other than something in proportion to the posterior probability of each model, but if you have those at hand you might as well just marginalize over them), and by the uniqueness part of Cox’s proof it will break one of the presuppositions there (and therefore likely fall pray to some Dutch Book). Therefore, if you consider these the reasons to be Bayesian, to be a Bayesian and do this seems ill-fated.
Of course, he and I and most reasonable people accept that this could be a useful practical device in many contexts, but it is indeed incoherent in the formal sense of the word.
But why would the prior, capturing all your information about a setting, be sufficiently far off from the true value, in the first place?
Well, you told me to grab you the simplest possible example, not the most cogent or common or general one!
But, well, regardless if you’re looking for this happening in practice, this complaint is sometimes levied at Bayesian versions of machine learning models where it is especially hard to convincingly supply priors for humongous quantities of parameters. Here’s some random example I found of this happening in a much less complex situation. This is all beside the point, though, which is again that there is no guarantee that I know of for Bayesians to be finite-sample calibrated AT ALL. There are asymptotic arguments, and there are approximate arguments, but there is simply no property which says that even an ideal Bayesian (under the subjectivist viewpoint) will be approximately calibrated.
Note that I also have no reason to assume the prior must be sufficiently close to the true value—since a prior for a subjectivist is just an old posterior, this is tantamount to just assuming a Bayesian is not only consistent but well on his way to asymptotic-land, so the cart is put before the horse.
Interesting! (...) “so the question is still open”.
Few Frequentists believe that there is an absolute principle to which an estimate is the unique best one, no, but this doesn’t make the question ‘still open’, just only as open as any other statistical question (in that, until proven otherwise, you could always pose a method which is just as good as the original in one way or another while also being better in some other dimension). 1⁄2 seems hard to beat, though, unless you specify some asymmetrical loss function instead (which seems like an appropriate amount of openness to me, in that in those situations you obviously do want to use a different estimate).
though everyone is in agreement that Bayes theorem naturally follows from the axioms of probability theory.
Yes, of course, but from this does not follow that any particular bit of thought can be represented as a posterior obtained from some prior. Clearly, whatever thought might be in my head there could easily just not follow the axioms of probability, and frankly I think this is almost certain. Maybe there does exist some decent representation of these sorts of practical mental problems such as these, but I would have to see it to believe it. Not only that, but I am doubtful of the value of supposing that whatever ideal this thought ought to aspire to achieve is a Bayesian one (thus the content of the post—consider that a representation of this practical problem is another formally nonparametric one, in that the ideal list of mathematical properties must be comically large—if I am to assume some smaller space, I am implicitly claiming a probability of zero on rather a lot of mathematical laws which could be good which I cannot conceive of immediately, which seems horrible as an ideal.)
Yes, but such an interpretation falls outside of Yudkowsky’s view, as I understand it (for example on that X thread in another comment in this post, and his comments on other statistics topics I’ve seen around—could fish for the quotes, but I’m a bit held up at this precise moment), which is what that I’m focusing on here.
On Walker, on that paragraph he is criticizing the specific (common) practice of comparing separate Bayesian models and picking the best (via ratios or errors or some such) when there is uncertainty about the truth instead of appropriately representing this uncertainty about your sampling model in the prior.
Rolling a die is a bit of a nifty example here since it’s the case where you assign a separate probability to each label in the sample space, so that your likelihood is in fact fully general, which is where the idea for a Dirichlet prior comes from in an attempt to generalize this notion of covering all possible models for less trivial problems. In the rest of the intro, Walker points to Bayesians fitting models with different likelihoods (i. e., Weibull v. s. Lognormal, I think he is a survival guy), each with their own “inner” priors, comparing them against each other and then picking the one which is “best” as incoherent, since picking a prior just to compare posteriors on some frequentist property like error or coverage is not an accurate representation of your prior uncertainty (instead, he wants you to pick some nonparametric model).
On Bayesian finite-sample miscalibration, simply pick a prior which is sufficiently far off from the true value and your predictive intervals will be very bad for a long time (you may check by simulation on some conjugate model). This is a contrived example, of course, but this happens on some level all the time, since Bayesian methods make no promise of finite-sample calibration—your prediction intervals just reflect belief, not what future data might be (in practice, I’ve heard people complain about this in Bayesian machine learning type situations). Of course, asymptotically and under some regularity conditions you will be calibrated, but one would rather be right before then. If you want finite-sample calibration, you have to look for a method which promises it. In terms of the coverage of credible intervals more generally, though, unless you want to be in the throes of asymptotics, you’d have to pick what is called a matching prior, which again seems in conflict with subjectivist information input
On minimax: I don’t know how to format math in this site on a phone, so I will be a bit terse here, but the proof is very simple (I thought). in statistics, when we call an estimator “minimax” it means that it minimizes the maximum risk, which is the expectation of the loss over the distribution of the estimator. Since we have no data, all estimators are some constant c. The expectation of the loss with is just the loss with respect to the parameter (i. e. (c-p)^2). Clearly the maxima are taken when p is 0 or 1, so minimize the maximum of c^2, (1-c)^2, which has a minimum at c=0.5. Which is to say, 1⁄2 has this nice Frequentist property, which is how one could justify it.
On your last comment, it seems like a bit of an open question to attribute the existence of practical intuition and reasoning about mathematical constructs like this to a Bayesian prior updating process. Certainly I reason, and I change my mind, but to me personally I see no reason to imagine this was Bayesian in some way (or that those thoughts were expressed in credence-probabilities which I shifted by conditioning on a type of sense-data), nor that I would be ideally doing this instead. But, I suppose such a thing could be possible.
Maybe! But I would expect him to change his view to something like this in case you managed to persuade him that there is some crucial flaw in Bayesianism. While your goal seems to be to propagate the toolbox-view as the only valid approach. So you might as well engage with a stronger version of law-view right now.
So, suppose that instead of assigning equal probabilities to each label of a die, I consider this as just one of multiple possible models from a set of models with different priors. According to one of them:
P(1) = 1⁄2, P(2) = P(3) = P(4)=P(5)=P(6)=1/10
According to another:
P(2) = 1⁄2, P(1)=P(3)=P(4)=P(5)=P(6)=1/10
And so on and so forth.
And then I assign equiprobable prior between these models and start collecting experimental data—see how well all of them perform. Do I understand correctly, that Walker considers such approach incoherent?
In which case, I respectfully disagree with him. While it’s true that this approach doesn’t represent our uncertainty about which label of an unknown die will be shown on a roll, it, nevertheless, represents the uncertainty about which bayesian model best approximates the behavior of this particular die. And there is nothing incoherent in modelling the latter kind of uncertainty instead of the former.
And likewise for more complicated settings and models. Whenever we have uncertainty about which model is the best one we can model this uncertainty and get a probabilistic answer to it via bayesian methods. And then get a probabilistic answer according to this model, if we want to.
But why would the prior, capturing all your information about a setting, be sufficiently far off from the true value, in the first place? This seems to happen mostly when you misuse the bayesian method, by picking some arbitrary prior for no particular reason. Which is a weird complain. Surely we can also misuse frequentist methods in a similar fashion—p-hacking immediately comes to mind, or just ignoring bunch of data points altogether. But what’s the point in talking about this? We are interested in situations when the art fails us, not when we fail the art, aren’t we?
Interesting! So is there a agreement among frequentists that probability of an unfair coin about which we know nothing else to land Tails is 1/2? Or is it more like: “Well we have a bunch of tools and here one of them says 1⁄2, but we do not have a principled reason to prefer it to other tools regarding the question of what probability is, so the question is still open”.
Is it? I though everyone is in agreement that Bayes theorem naturally follows from the axioms of probability theory. In which case the only reason why such reasoning doesn’t follow Bayesian updating procedure is that, somehow, probability theory is not applicable to the reasoning about mathematical constructs in particular, but why would that be true?
Oh wait, you don’t think that probability theory is applicable to reasoning in general? Surely I’m misunderstanding you here? Could you elaborate on your position here? I feel that this is the most important crux of disagreement.
Well, maybe, I don’t know. As it stands it just seems best to argue against what he has said at his word then to assume otherwise, though, insofar as other people take this view at face value. Though, if such a thing does come about, I would of course have to write a different post. This could be some part of LessWrong culture that I am just ignorant of, though, so apologies.
It depends on what you mean by see how well all of them perform. In this situation, where you can easily get a reasonably small set of models that might represent your total uncertainty, and then (crucially) obtain whatever estimates or uncertainties you desire by updating the posterior of the complete model (including these sub-models—i. e.P(p1,p2,…,p6|X) must be marginalized over M).
To a Bayesian, this is simply the uniquely-identified distribution function which represents your uncertainty about these parameters—no other function represents this and any other probability represents some other belief. This of course includes the procedure of maximizing some data score (i. e. finding an empirical ‘best model’), which would be something like P(p1,p2,…,p6|X,M=m∗i) where m∗i=argm∈MmaxT(X,m) in which T is some model evaluation score (possibly just the posterior probability of the model).
This seems like a very artificial thing to report as your uncertainty about these parameters and essentially guarantees that your uncertainties will be underestimated—among other things, there is no guarantee that such a procedure follows the likelihood principle (for most measures of model correctness other than something in proportion to the posterior probability of each model, but if you have those at hand you might as well just marginalize over them), and by the uniqueness part of Cox’s proof it will break one of the presuppositions there (and therefore likely fall pray to some Dutch Book). Therefore, if you consider these the reasons to be Bayesian, to be a Bayesian and do this seems ill-fated.
Of course, he and I and most reasonable people accept that this could be a useful practical device in many contexts, but it is indeed incoherent in the formal sense of the word.
Well, you told me to grab you the simplest possible example, not the most cogent or common or general one!
But, well, regardless if you’re looking for this happening in practice, this complaint is sometimes levied at Bayesian versions of machine learning models where it is especially hard to convincingly supply priors for humongous quantities of parameters. Here’s some random example I found of this happening in a much less complex situation. This is all beside the point, though, which is again that there is no guarantee that I know of for Bayesians to be finite-sample calibrated AT ALL. There are asymptotic arguments, and there are approximate arguments, but there is simply no property which says that even an ideal Bayesian (under the subjectivist viewpoint) will be approximately calibrated.
Note that I also have no reason to assume the prior must be sufficiently close to the true value—since a prior for a subjectivist is just an old posterior, this is tantamount to just assuming a Bayesian is not only consistent but well on his way to asymptotic-land, so the cart is put before the horse.
Few Frequentists believe that there is an absolute principle to which an estimate is the unique best one, no, but this doesn’t make the question ‘still open’, just only as open as any other statistical question (in that, until proven otherwise, you could always pose a method which is just as good as the original in one way or another while also being better in some other dimension). 1⁄2 seems hard to beat, though, unless you specify some asymmetrical loss function instead (which seems like an appropriate amount of openness to me, in that in those situations you obviously do want to use a different estimate).
Yes, of course, but from this does not follow that any particular bit of thought can be represented as a posterior obtained from some prior. Clearly, whatever thought might be in my head there could easily just not follow the axioms of probability, and frankly I think this is almost certain. Maybe there does exist some decent representation of these sorts of practical mental problems such as these, but I would have to see it to believe it. Not only that, but I am doubtful of the value of supposing that whatever ideal this thought ought to aspire to achieve is a Bayesian one (thus the content of the post—consider that a representation of this practical problem is another formally nonparametric one, in that the ideal list of mathematical properties must be comically large—if I am to assume some smaller space, I am implicitly claiming a probability of zero on rather a lot of mathematical laws which could be good which I cannot conceive of immediately, which seems horrible as an ideal.)