Why should one only try to model E[Y|f(X)=1] and not the underlying function g(x)=E[Y|x]?
What would it tell you if you could? The problem is to estimate Y for a certain population. Therefore, look at that population. I am not seeing a reason why one would consider modelling g, so I am at a loss to answer the question, why not model g?
Jaynes and a few others generally write things like E[ Y | I ] or P( Y | I ) where I represents “all of your background knowledge”, not further analysed. f(X)=1 is playing the role of I here. It’s a placeholder for the stuff we aren’t modelling and within which the statistical reasoning takes place.
Suppose f was a very simple function, for example, the identity. You are asked to estimate E[ Y | X=1 ]. What do the Bayesian and the frequentist do in this case? They are still only being asked about the population for which X=1. Can either of them get better information about E[ Y | X=1 ] by looking (also) at samples where X is not 1?
The example is a simplification of Wasserman’s; I’m not sure if a similar answer can be made there.
BTW, I’m not a statistician, and these aren’t rhetorical questions.
ETA: Here’s an even simpler example, in which it might be possible to demonstrate mathematically the answer to the question, can better information be obtained about E[ Y | X=1 ] by looking at members of the population where X is not 1? Suppose it is given that X and Y have a bivariate normal distribution, with unknown parameters. You take a sample of 1000, and are given a choice of taking it either from the whole population, or from that sliver for which X is in some range 1 +/- ε for ε very small compared with the standard deviation of X. You then use whatever tools you prefer to estimate E[ Y | X=1 ]. Which method of sampling will allow a better estimate?
ETA2: Here is my own answer to my last question, after looking up some formulas concerning linear regression. Let Y1 be the mean of Y in a sample drawn from a narrow neighbourhood of X=1, and let Y2 be the estimate of E[ Y | X=1 ] obtained by doing linear regression on a sample drawn from the whole population. Both samples have the same size n, assumed large enough to ignore small-sample corrections. Then the ratio of the standard error of Y2 to that of Y1 is sqrt( 1 + k^2 ), where k is the difference between 1 and E[X], in units of the standard deviation of X. So at least for this toy example, a narrow sample always works at least as well as a broad one, and is almost always better. Is this a general fact, or are there equally simple examples where the opposite is found?
ETA3: I might have such an example. Suppose that the distribution of Y|X is a + bX + ε(X), where ε(X) is a random variable whose mean is always zero but whose variance is high in the neighbourhood of X=1 and low elsewhere. Then a linear regression on a sample from the full population may allow a better estimate of E[Y|X] than a sample from the neighbourhood of X=1. A sample that avoids that region may do better still. Intuitively, if there’s a lot of noise where you want to look, extrapolate from where there’s less noise.
But it’s not clear to me that this bears on the Bayesian vs. frequentist matter. Both of them are faced with the decision to take a wide sample or a narrow one. The frequentist can’t insist that the Bayesian takes notice of structure in the problem that the frequentist chooses to ignore.
What would it tell you if you could? The problem is to estimate Y for a certain population. Therefore, look at that population. I am not seeing a reason why one would consider modelling g, so I am at a loss to answer the question, why not model g?
Jaynes and a few others generally write things like E[ Y | I ] or P( Y | I ) where I represents “all of your background knowledge”, not further analysed. f(X)=1 is playing the role of I here. It’s a placeholder for the stuff we aren’t modelling and within which the statistical reasoning takes place.
Suppose f was a very simple function, for example, the identity. You are asked to estimate E[ Y | X=1 ]. What do the Bayesian and the frequentist do in this case? They are still only being asked about the population for which X=1. Can either of them get better information about E[ Y | X=1 ] by looking (also) at samples where X is not 1?
The example is a simplification of Wasserman’s; I’m not sure if a similar answer can be made there.
BTW, I’m not a statistician, and these aren’t rhetorical questions.
ETA: Here’s an even simpler example, in which it might be possible to demonstrate mathematically the answer to the question, can better information be obtained about E[ Y | X=1 ] by looking at members of the population where X is not 1? Suppose it is given that X and Y have a bivariate normal distribution, with unknown parameters. You take a sample of 1000, and are given a choice of taking it either from the whole population, or from that sliver for which X is in some range 1 +/- ε for ε very small compared with the standard deviation of X. You then use whatever tools you prefer to estimate E[ Y | X=1 ]. Which method of sampling will allow a better estimate?
ETA2: Here is my own answer to my last question, after looking up some formulas concerning linear regression. Let Y1 be the mean of Y in a sample drawn from a narrow neighbourhood of X=1, and let Y2 be the estimate of E[ Y | X=1 ] obtained by doing linear regression on a sample drawn from the whole population. Both samples have the same size n, assumed large enough to ignore small-sample corrections. Then the ratio of the standard error of Y2 to that of Y1 is sqrt( 1 + k^2 ), where k is the difference between 1 and E[X], in units of the standard deviation of X. So at least for this toy example, a narrow sample always works at least as well as a broad one, and is almost always better. Is this a general fact, or are there equally simple examples where the opposite is found?
ETA3: I might have such an example. Suppose that the distribution of Y|X is a + bX + ε(X), where ε(X) is a random variable whose mean is always zero but whose variance is high in the neighbourhood of X=1 and low elsewhere. Then a linear regression on a sample from the full population may allow a better estimate of E[Y|X] than a sample from the neighbourhood of X=1. A sample that avoids that region may do better still. Intuitively, if there’s a lot of noise where you want to look, extrapolate from where there’s less noise.
But it’s not clear to me that this bears on the Bayesian vs. frequentist matter. Both of them are faced with the decision to take a wide sample or a narrow one. The frequentist can’t insist that the Bayesian takes notice of structure in the problem that the frequentist chooses to ignore.