If the sample data that you’re using to estimate P(Q(s) | numStarts=4, O(s)) = 0.9 has the same P(numStarts) distribution as the real data you’re going to run this over, then you don’t need to do anything special; just estimate P(Q(s) | O(s)) directly, caring not about numStarts, and go from there. Since you’re not doing that, I assume your sample data and real data have different numStarts distributions.
Here is the information I assume you have to work with. Call numStarts=S.
P(Q | S=k) = 1/k
P(S) in real data
Pd(Q | O,S) from an expert classifier over sample data d, with Pd(S) instead of P(S), and in particular your data d is basically a list of judgments (Q,O,S) that I can aggregate however I choose.
P(Q | N) from another bunch of expert classifiers independent of the first
What you’d like is to be able to compute P(Q | N,O) on real data. And to make it nice, do that by
P(Q | N,O) = 1 − 1 / (1 + Odds(Q | N,O)
with
Odds(Q | N,O) = Odds(Q) L(N|Q) L(O|Q)
You already know how to find Odds(Q) and L(N|Q). The question is how to find L(O|Q) on real data given that you have Pd(Q | O,S) rather than P(Q | O,S), the expert’s judgment on sample data d rather than real data. The answer as far as I can tell, unless I’ve missed part of your question or assumptions, is as follows:
If the sample data that you’re using to estimate P(Q(s) | numStarts=4, O(s)) = 0.9 has the same P(numStarts) distribution as the real data you’re going to run this over, then you don’t need to do anything special; just estimate P(Q(s) | O(s)) directly, caring not about numStarts, and go from there. Since you’re not doing that, I assume your sample data and real data have different numStarts distributions.
Here is the information I assume you have to work with. Call numStarts=S. P(Q | S=k) = 1/k P(S) in real data Pd(Q | O,S) from an expert classifier over sample data d, with Pd(S) instead of P(S), and in particular your data d is basically a list of judgments (Q,O,S) that I can aggregate however I choose. P(Q | N) from another bunch of expert classifiers independent of the first
What you’d like is to be able to compute P(Q | N,O) on real data. And to make it nice, do that by P(Q | N,O) = 1 − 1 / (1 + Odds(Q | N,O) with Odds(Q | N,O) = Odds(Q) L(N|Q) L(O|Q)
You already know how to find Odds(Q) and L(N|Q). The question is how to find L(O|Q) on real data given that you have Pd(Q | O,S) rather than P(Q | O,S), the expert’s judgment on sample data d rather than real data. The answer as far as I can tell, unless I’ve missed part of your question or assumptions, is as follows:
L(O|Q) = sum(P(O|Q,S) P(S)) / sum(P(O|~Q,S) P(S))
[note that P(O|Q,S) remains the same across samples]
P(O|Q,S) = P(Q,O|S) / P(Q|S), so (with C=Count)
P(O|Q,S) = C(Q,O,S)/C(S) / (1/k) = k C(Q,O,S)/C(S) and P(O|~Q,S) = (k/(k-1)) C(~Q,O,S)/C(S)
thus
L(O|Q) = sum(k C(Q,O,S=k)/C(S=k) P(S=k)) / sum((k/(k-1)) C(~Q,O,S=k)/C(S=k) P(S=k))
so to calculate L(O|Q) on your real data, first note P(S=k) on your real data, then on your sample data d say
L(O|Q) = L[numerator]/L[denominator]
You have to bin your training data, you don’t have to bin your test data.
Edit: I found and fixed a couple of errors so there are probably more. Think, debug, and test for yourself as usual. :D