The example I gave in my first reply (where g(x) is known to be either one of two known functions h(x) or j(x)) can easily be extended into the kind of fully specified counterexample you are looking for
That looks like a parametric model. There is one parameter, a binary variable that chooses h or j. A belief about that parameter is a probability p that h is the function. Yes, I can see that updating p on sight of the data may give a better estimate of E[Y|f(X)=1], which is known a priori to be either h(1) or j(1).
I expect it would be similar for small numbers of parameters also, such as a linear relationship between X and Y. Using the whole sample should improve on only looking at the subsample around f(X)=1.
However, in the nonparametric case (I think you are arguing) this goes wrong. The sample size is not large enough to estimate a model that gives a narrow estimate of E[Y|f(X)=1]. Am I understanding you yet?
It seems to me that the problem arises even before getting to the nonparametric case. If a parametric model has too many parameters to estimate from the sample, and the model predictions are everywhere sensitive to all of the parameters (so it cannot be approximated by any simpler model) then trying to estimate E[Y|f(X)=1] by first fitting the model, then predicting from the model, will also not work.
It so clearly will not work that it must be a wrong thing to do. It is not yet clear to me that a Bayesian statistician must do it anyway. The set {Y|f(X)=1} conveys information about E[Y|f(X)=1] directly, independently of the true model (assumed for the purpose of this discussion to be within the model space being considered). Estimating it via fitting a model ignores that information. Is there no Bayesian method of using it?
A partial answer to your question:
So if the prior is weak (as it is in my main post) you don’t model the function, and if the prior is strong, you model the function (and therefore make use of all the observations)? Where do you draw the line?
would be that the less the model helps, the less attention you pay it relative to calculating Mean{Y|f(X)=1}. I don’t have a mathematical formulation of how to do that though.
That looks like a parametric model. There is one parameter, a binary variable that chooses h or j. A belief about that parameter is a probability p that h is the function. Yes, I can see that updating p on sight of the data may give a better estimate of E[Y|f(X)=1], which is known a priori to be either h(1) or j(1).
I expect it would be similar for small numbers of parameters also, such as a linear relationship between X and Y. Using the whole sample should improve on only looking at the subsample around f(X)=1.
However, in the nonparametric case (I think you are arguing) this goes wrong. The sample size is not large enough to estimate a model that gives a narrow estimate of E[Y|f(X)=1]. Am I understanding you yet?
It seems to me that the problem arises even before getting to the nonparametric case. If a parametric model has too many parameters to estimate from the sample, and the model predictions are everywhere sensitive to all of the parameters (so it cannot be approximated by any simpler model) then trying to estimate E[Y|f(X)=1] by first fitting the model, then predicting from the model, will also not work.
It so clearly will not work that it must be a wrong thing to do. It is not yet clear to me that a Bayesian statistician must do it anyway. The set {Y|f(X)=1} conveys information about E[Y|f(X)=1] directly, independently of the true model (assumed for the purpose of this discussion to be within the model space being considered). Estimating it via fitting a model ignores that information. Is there no Bayesian method of using it?
A partial answer to your question:
would be that the less the model helps, the less attention you pay it relative to calculating Mean{Y|f(X)=1}. I don’t have a mathematical formulation of how to do that though.