Thanks for the reply, sorry I just saw this. It was indeed my goal to talk about existing ideas in a nontechnical way, which is why I didn’t frame things in terms of model expansion, etc.. Beyond that however, I am confused by your reply, as it seems to make little contact with my intended argument. You state that I recommend “just ignoring” the issue, and suggest that I endorse double-counting as OK. Can you explain what parts of the post led you to believe that was my recommendation? Because that is very much not my intended message!
(I stress that I’m not trying to be snarky. The goal of the post is to be a non-technical explanation, and I don’t want to change that. But if the post reads as you suggest, I interpret that as a failure of the post, and I’d like to fix that.)
Thanks for replying. Given that it’s been a month, sadly, I don’t fully remember all the details of why I wrote what I wrote in my initial comment, but I’ll try to roughly rewrite my objections in a more specific way so that you get where it makes contact with your post (“if I had more time, I would’ve written a shorter letter”). Forgive me if it was somehow hard to understand, English is my second language.
My first issue: the post is titled “Good if make prior after data instead of before”. Yet, the post’s driving example is a situation where the (marginal) prior probability of what you’re interested in doesn’t actually change, but instead is coupled to a larger model with a larger probability space where the likelihood is different at these different points. So, what you’re talking about isn’t really post-hoc changes to the prior, but something like model expansion, as you write in the comment.
In the context of methodologies for Bayesian model expansion, there is a lot of controversy and much ink has been spilled, because being ad-hoc and implicitly accepting a data-driven prior/selected model leads to incoherence; the decision-procedure you now derive from this is not actually Bayesian in the sense that it satisfies all the nice properties people expect of Bayesian decision rules and Bayesian reasoning, it just vaguely follows Bayes’s rule for conditioning. When you write
So the only practical way to get good results is to first look at the data to figure out what categories are important, and then to ask yourself how likely you would have said those categories were, if you hadn’t yet seen any of the evidence.
you are sidestepping all of these issues (what I called “solving by not solving”) and accepting incoherence as OK. And, well, this can be a fine approach—being approximately incoherent can be approximately no problem. But, I think that the post not only fails to address the negatives of this particular approach, positioning it as kind of the only thing you can reasonably do (which is in itself a sufficiently large problem), but fails to consider any other ones (A classic objection to this type of methodology in a canonical introductory textbook, providing one of the alternatives I mentioned, is here, for example, in which the idea is to have a model flexible and general enough that it can learn in essentially any situation; I mentioned other methods in the comment). Do you not see the incoherence of a data-driven prior as bad somehow?
To be clear, the other approach you consider of “never change your model/prior after seeing the data, even if your model makes no sense, your posterior is stuck as it is” is also bad for all the obvious model misspecification reasons. But, at the very least it is coherent (and, of course, by data-splitting you get to enjoy this coherency without being rigid at the cost of a little data, so there’s another approach, much less technical to explain than the nonparametric approach mentioned prior). This is my main problem with the article, really: it proposes just this one idea among several without discussing its positives or negatives in relation to any of the other ones.
My point with this article endorsing “double-counting” is that one way in which this approach (roughly summarized as “construct the model after seeing the data, pretending like you haven’t seen the data”) is that, in comparison to either a nonparametric approach or some M-open idea like model mixing or stacking, it will privilege the particular model which you happened to construct on the basis of the data more so than a fully coherent theoretical approach.
An easy way to see this is to imagine if you were to try this approach while being knowledgeable about all possible models you could have picked (i. e. in model averaging, they wave at a similar critique to this idea in this other intro); in this representation, instead of observing the data and updating yourself towards one particular model representation which fits best with the data, your method is to set one model’s probability to 1 and all others to zero, which is a rather extreme version of double-counting[1].
So, in my perspective, a good version of this article would not talk about anything being “the only practical way to get good results”, and would situate this idea alongside all the other ones in this vein which have been discussed for decades, or at least sort of gestures at the more common approaches you consider sensible and that you think you can explain nontechnically (hopefully referenced by their names), and at the bare minimum it should explain the pros and cons of what it advocates with more balance. Admittedly, this is a much harder article to write, because the issue has become nuanced, and I would not know how to write it non-technically, at least immediately. However, the issue seems to be nuanced, at least to me, and this level of simplification misleads more than it helps.
In the original comment I decided to talk more about how easy it is to make double-counting methods seem arbitrarily good by way of constructing examples where you know the truth in advance, since of course it looks better if you get to the truth twice as fast, but the double-counting when the data happens to be misleading gets you doubly wrong too, but this objection seems kind of petty and irrelevant compared to the other ones, in hindsight.
Thanks for the reply, sorry I just saw this. It was indeed my goal to talk about existing ideas in a nontechnical way, which is why I didn’t frame things in terms of model expansion, etc.. Beyond that however, I am confused by your reply, as it seems to make little contact with my intended argument. You state that I recommend “just ignoring” the issue, and suggest that I endorse double-counting as OK. Can you explain what parts of the post led you to believe that was my recommendation? Because that is very much not my intended message!
(I stress that I’m not trying to be snarky. The goal of the post is to be a non-technical explanation, and I don’t want to change that. But if the post reads as you suggest, I interpret that as a failure of the post, and I’d like to fix that.)
Thanks for replying. Given that it’s been a month, sadly, I don’t fully remember all the details of why I wrote what I wrote in my initial comment, but I’ll try to roughly rewrite my objections in a more specific way so that you get where it makes contact with your post (“if I had more time, I would’ve written a shorter letter”). Forgive me if it was somehow hard to understand, English is my second language.
My first issue: the post is titled “Good if make prior after data instead of before”. Yet, the post’s driving example is a situation where the (marginal) prior probability of what you’re interested in doesn’t actually change, but instead is coupled to a larger model with a larger probability space where the likelihood is different at these different points. So, what you’re talking about isn’t really post-hoc changes to the prior, but something like model expansion, as you write in the comment.
In the context of methodologies for Bayesian model expansion, there is a lot of controversy and much ink has been spilled, because being ad-hoc and implicitly accepting a data-driven prior/selected model leads to incoherence; the decision-procedure you now derive from this is not actually Bayesian in the sense that it satisfies all the nice properties people expect of Bayesian decision rules and Bayesian reasoning, it just vaguely follows Bayes’s rule for conditioning. When you write
you are sidestepping all of these issues (what I called “solving by not solving”) and accepting incoherence as OK. And, well, this can be a fine approach—being approximately incoherent can be approximately no problem. But, I think that the post not only fails to address the negatives of this particular approach, positioning it as kind of the only thing you can reasonably do (which is in itself a sufficiently large problem), but fails to consider any other ones (A classic objection to this type of methodology in a canonical introductory textbook, providing one of the alternatives I mentioned, is here, for example, in which the idea is to have a model flexible and general enough that it can learn in essentially any situation; I mentioned other methods in the comment). Do you not see the incoherence of a data-driven prior as bad somehow?
To be clear, the other approach you consider of “never change your model/prior after seeing the data, even if your model makes no sense, your posterior is stuck as it is” is also bad for all the obvious model misspecification reasons. But, at the very least it is coherent (and, of course, by data-splitting you get to enjoy this coherency without being rigid at the cost of a little data, so there’s another approach, much less technical to explain than the nonparametric approach mentioned prior). This is my main problem with the article, really: it proposes just this one idea among several without discussing its positives or negatives in relation to any of the other ones.
My point with this article endorsing “double-counting” is that one way in which this approach (roughly summarized as “construct the model after seeing the data, pretending like you haven’t seen the data”) is that, in comparison to either a nonparametric approach or some M-open idea like model mixing or stacking, it will privilege the particular model which you happened to construct on the basis of the data more so than a fully coherent theoretical approach.
An easy way to see this is to imagine if you were to try this approach while being knowledgeable about all possible models you could have picked (i. e. in model averaging, they wave at a similar critique to this idea in this other intro); in this representation, instead of observing the data and updating yourself towards one particular model representation which fits best with the data, your method is to set one model’s probability to 1 and all others to zero, which is a rather extreme version of double-counting[1].
So, in my perspective, a good version of this article would not talk about anything being “the only practical way to get good results”, and would situate this idea alongside all the other ones in this vein which have been discussed for decades, or at least sort of gestures at the more common approaches you consider sensible and that you think you can explain nontechnically (hopefully referenced by their names), and at the bare minimum it should explain the pros and cons of what it advocates with more balance. Admittedly, this is a much harder article to write, because the issue has become nuanced, and I would not know how to write it non-technically, at least immediately. However, the issue seems to be nuanced, at least to me, and this level of simplification misleads more than it helps.
In the original comment I decided to talk more about how easy it is to make double-counting methods seem arbitrarily good by way of constructing examples where you know the truth in advance, since of course it looks better if you get to the truth twice as fast, but the double-counting when the data happens to be misleading gets you doubly wrong too, but this objection seems kind of petty and irrelevant compared to the other ones, in hindsight.