philosophically ideal thing is unattainable in this case
Slightly confused here. Rationality is defined as winning, yes? If your “ideal thing” is not winning it’s not rational, and should be dropped like a hot potato. In fact, if it’s losing in what sense is it “ideal?”
Posteriors, etc. are tools, that’s all.
I think the Robins/Wasserman example is about the interplay of structural assumptions of how data came to be, and statistical inference from this data (specificallly its about where information lives). In particular, about how the classical Bayesian setup in fact tacitly assumes certain structural assumptions that lead to all information living in the likelihood function. In fact these assumptions do not hold in the Robins/Wasserman case, most of the information lives in the assignment probability (which is outside the likelihood).
This is similar to how classification problems in machine learning cannot be solved by standard methods if certain tacit assumptions (training and test data are from the same distribution) fail to hold. In that case you need to use not only standard machine learning insights about what makes a good classifier, but also additional insights that correct for the structural differences in the training and test data properly.
In particular, about how the classical Bayesian setup in fact tacitly assumes certain structural assumptions that lead to all information living in the likelihood function. In fact these assumptions do not hold in the Robins/Wasserman case, most of the information lives in the assignment probability (which is outside the likelihood).
I’m having trouble following this (i’m not actually that versed in statistics, and I don’t know what you mean by ‘assignment probability’. But it seems to me that we only think Horwitz-Thompson is a good answer because of tacit assumptions we hold about the data.
We have X, let’s say baseline facts about a person (X are features we would use to build a classifier in machine learning). We have a probability of a binary event A, conditional on X: p(A | X). If A is 1, we don’t see the value of Y. If A is 0, we see the value of Y. p(A=0 | X) is what I call the “assignment probability” and p(A | X) is what the OP calls the “importance sampling distribution.” It is also sometimes called “the propensity score.”
And yes you are right, Horvitz-Thompson only comes into play because somehow p(A=0 | X) played a very important role in determining the data on X,Y we actually see. But if we were to write the likelihood function for X,Y, the probability p(A | X) would not appear in this function. So any method that just uses the likelihood function will ignore p(A | X). What saves Bayesians is their ability to insert p(A | X) into the prior (they have nowhere else to put it).
This is kind of tricky, because it doesn’t seem like it should hold information, unless it correlates with R&W’s theta (probability of Y = 1).
If pi and theta were guaranteed independent, would Horwitz-Thompson in any meaningful way outperform Sum(Y) / Sum(R), that is, the average observed value of Y in cases where Y is observed?
The reason p(A | X) holds info is because it determines what Y we see. Say for a moment A was independent of X, so we saw Y if a fair coin came up heads (p(A = 0) = 0.5). Then the Ys we see are the same as the Ys we don’t see, because the coin doesn’t look at anything about Y to determine whether to come up heads.
But if the coin depends on X, the worry is the Ys we see may have particular Xs and not others. So if we just ignore the Ys we don’t see, we will get a biased view of the underlying Y based on the Ys we actually see based on P(A|X).
Somehow, to correctly deal with this bias, we must involve p(A|X) (explicitly or implicitly).
Sure. But if we know or suspect any correlation between A and Y, there’s nothing strange about the common information between them being expressed in the prior, right?
Slightly confused here. Rationality is defined as winning, yes? If your “ideal thing” is not winning it’s not rational, and should be dropped like a hot potato. In fact, if it’s losing in what sense is it “ideal?”
Posteriors, etc. are tools, that’s all.
I think the Robins/Wasserman example is about the interplay of structural assumptions of how data came to be, and statistical inference from this data (specificallly its about where information lives). In particular, about how the classical Bayesian setup in fact tacitly assumes certain structural assumptions that lead to all information living in the likelihood function. In fact these assumptions do not hold in the Robins/Wasserman case, most of the information lives in the assignment probability (which is outside the likelihood).
This is similar to how classification problems in machine learning cannot be solved by standard methods if certain tacit assumptions (training and test data are from the same distribution) fail to hold. In that case you need to use not only standard machine learning insights about what makes a good classifier, but also additional insights that correct for the structural differences in the training and test data properly.
I’m having trouble following this (i’m not actually that versed in statistics, and I don’t know what you mean by ‘assignment probability’. But it seems to me that we only think Horwitz-Thompson is a good answer because of tacit assumptions we hold about the data.
We have X, let’s say baseline facts about a person (X are features we would use to build a classifier in machine learning). We have a probability of a binary event A, conditional on X: p(A | X). If A is 1, we don’t see the value of Y. If A is 0, we see the value of Y. p(A=0 | X) is what I call the “assignment probability” and p(A | X) is what the OP calls the “importance sampling distribution.” It is also sometimes called “the propensity score.”
And yes you are right, Horvitz-Thompson only comes into play because somehow p(A=0 | X) played a very important role in determining the data on X,Y we actually see. But if we were to write the likelihood function for X,Y, the probability p(A | X) would not appear in this function. So any method that just uses the likelihood function will ignore p(A | X). What saves Bayesians is their ability to insert p(A | X) into the prior (they have nowhere else to put it).
Ah, R&W’s pi function.
This is kind of tricky, because it doesn’t seem like it should hold information, unless it correlates with R&W’s theta (probability of Y = 1).
If pi and theta were guaranteed independent, would Horwitz-Thompson in any meaningful way outperform Sum(Y) / Sum(R), that is, the average observed value of Y in cases where Y is observed?
The reason p(A | X) holds info is because it determines what Y we see. Say for a moment A was independent of X, so we saw Y if a fair coin came up heads (p(A = 0) = 0.5). Then the Ys we see are the same as the Ys we don’t see, because the coin doesn’t look at anything about Y to determine whether to come up heads.
But if the coin depends on X, the worry is the Ys we see may have particular Xs and not others. So if we just ignore the Ys we don’t see, we will get a biased view of the underlying Y based on the Ys we actually see based on P(A|X).
Somehow, to correctly deal with this bias, we must involve p(A|X) (explicitly or implicitly).
Sure. But if we know or suspect any correlation between A and Y, there’s nothing strange about the common information between them being expressed in the prior, right?
Granted, H-T will have nice worst-case performance if we’re not confident about A and Y being independent, but that reduces to this debate http://lesswrong.com/lw/k9c/can_noise_have_power/.
I wrote up a pretty detailed reply to Luke’s question: http://lesswrong.com/lw/kd4/the_power_of_noise/