and I think I get it. The example is phrased in the language of sampling/missing data, but for those in the audience familiar w/ Pearl, we can rephrase it as a causal inference problem. After all, causal inference is just another type of missing data problem.
We have a treatment A (a drug), and an outcome Y (death). Doctors assign A to some patients, but not others, based on their baseline covariates C. Then some patients die. The resulting data is an observational study, and we want to infer from it the effect of drug on survival, which we can obtain from p(Y | do(A=yes)).
We know in this case that p(Y | do(A=yes)) = sum{C} p(Y | A=yes,C) p(C) (this is just what “adjusting for confounders” means).
If we then had a parametric model for E[Y | A=yes,C], we could just fit that model and average (this is “likelihood based inference.”) Larry and Jamie are worried about the (admittedly adversarial) situation where maybe the relationship between Y and A and C is really complicated, and any specific parametric model we might conceivably use will be wrong, while non-parametric methods may have issues due to the curse of dimensionality in moderate samples. But of course the way we specified the problem, we know p(A | C) exactly, because doctors told us the rule by which they assign treatments.
Something like the Horvitz/Thompson estimator which uses this (correct) model only, or other estimators which address issues with the H/T estimator by also using the conditional model for Y, may have better behavior in such settings. But importantly, these methods are exploiting a part of the model we technically do not need (p(A | C) does not appear in the above “adjustment for confounders” expression anywhere), because in this particular setting it happens to be specified exactly, while the parts of the models we do technically need for likelihood based inference to work are really complicated and hard to get right at moderate samples.
But these kinds of estimators are not Bayesian. Of course arguably this entire setting is one Bayesians don’t worry about (but maybe they should? These settings do come up).
The CODA paper apparently stimulated some subsequent Bayesian activity, e.g.:
You’re welcome for the link, and it’s more than repaid by your causal inference restatement of the Robins-Ritov problem.
Of course arguably this entire setting is one Bayesians don’t worry about (but maybe they should? These settings do come up).
Yeah, I think this is the heart of the confusion. When you encounter a problem, you can turn the Bayesian crank and it will always do the Right thing, but it won’t always do the right thing. What I find disconcerting (as a Bayesian drifting towards frequentism) is that it’s not obvious how to assess the adequacy of a Bayesian analysis from within the Bayesian framework. In principle, you can do this mindlessly by marginalizing over all the model classes that might apply, maybe? But in practice, a single model class usually gets picked by non-Bayesian criteria like “does the posterior depend on the data in the right way?” or “does the posterior capture the ‘true model’ from simulated data?”. Or a Bayesian may (rightly or wrongly) decide that a Bayesian analysis is not appropriate in that setting.
That’s an interesting example, thanks for linking it. I read it carefully, and also some of Robins/Ritov CODA paper:
http://www.biostat.harvard.edu/robins/coda.pdf
and I think I get it. The example is phrased in the language of sampling/missing data, but for those in the audience familiar w/ Pearl, we can rephrase it as a causal inference problem. After all, causal inference is just another type of missing data problem.
We have a treatment A (a drug), and an outcome Y (death). Doctors assign A to some patients, but not others, based on their baseline covariates C. Then some patients die. The resulting data is an observational study, and we want to infer from it the effect of drug on survival, which we can obtain from p(Y | do(A=yes)).
We know in this case that p(Y | do(A=yes)) = sum{C} p(Y | A=yes,C) p(C) (this is just what “adjusting for confounders” means).
If we then had a parametric model for E[Y | A=yes,C], we could just fit that model and average (this is “likelihood based inference.”) Larry and Jamie are worried about the (admittedly adversarial) situation where maybe the relationship between Y and A and C is really complicated, and any specific parametric model we might conceivably use will be wrong, while non-parametric methods may have issues due to the curse of dimensionality in moderate samples. But of course the way we specified the problem, we know p(A | C) exactly, because doctors told us the rule by which they assign treatments.
Something like the Horvitz/Thompson estimator which uses this (correct) model only, or other estimators which address issues with the H/T estimator by also using the conditional model for Y, may have better behavior in such settings. But importantly, these methods are exploiting a part of the model we technically do not need (p(A | C) does not appear in the above “adjustment for confounders” expression anywhere), because in this particular setting it happens to be specified exactly, while the parts of the models we do technically need for likelihood based inference to work are really complicated and hard to get right at moderate samples.
But these kinds of estimators are not Bayesian. Of course arguably this entire setting is one Bayesians don’t worry about (but maybe they should? These settings do come up).
The CODA paper apparently stimulated some subsequent Bayesian activity, e.g.:
http://www.is.tuebingen.mpg.de/fileadmin/user_upload/files/publications/techreport2007_6326[0%5D.pdf
So, things are working as intended :).
You’re welcome for the link, and it’s more than repaid by your causal inference restatement of the Robins-Ritov problem.
Yeah, I think this is the heart of the confusion. When you encounter a problem, you can turn the Bayesian crank and it will always do the Right thing, but it won’t always do the right thing. What I find disconcerting (as a Bayesian drifting towards frequentism) is that it’s not obvious how to assess the adequacy of a Bayesian analysis from within the Bayesian framework. In principle, you can do this mindlessly by marginalizing over all the model classes that might apply, maybe? But in practice, a single model class usually gets picked by non-Bayesian criteria like “does the posterior depend on the data in the right way?” or “does the posterior capture the ‘true model’ from simulated data?”. Or a Bayesian may (rightly or wrongly) decide that a Bayesian analysis is not appropriate in that setting.