GuySrinivasan comments on A follow-up probability question: Data samples with different priors

GuySrinivasan 26 Oct 2012 17:34 UTC
0 points
If the sample data that you’re using to estimate P(Q(s) | numStarts=4, O(s)) = 0.9 has the same P(numStarts) distribution as the real data you’re going to run this over, then you don’t need to do anything special; just estimate P(Q(s) | O(s)) directly, caring not about numStarts, and go from there. Since you’re not doing that, I assume your sample data and real data have different numStarts distributions.

Here is the information I assume you have to work with. Call numStarts=S. P(Q | S=k) = 1/k P(S) in real data Pd(Q | O,S) from an expert classifier over sample data d, with Pd(S) instead of P(S), and in particular your data d is basically a list of judgments (Q,O,S) that I can aggregate however I choose. P(Q | N) from another bunch of expert classifiers independent of the first

What you’d like is to be able to compute P(Q | N,O) on real data. And to make it nice, do that by P(Q | N,O) = 1 − 1 / (1 + Odds(Q | N,O) with Odds(Q | N,O) = Odds(Q) L(N|Q) L(O|Q)

You already know how to find Odds(Q) and L(N|Q). The question is how to find L(O|Q) on real data given that you have Pd(Q | O,S) rather than P(Q | O,S), the expert’s judgment on sample data d rather than real data. The answer as far as I can tell, unless I’ve missed part of your question or assumptions, is as follows:

L(O|Q) = sum(P(O|Q,S) P(S)) / sum(P(O|~Q,S) P(S))

[note that P(O|Q,S) remains the same across samples]

P(O|Q,S) = P(Q,O|S) / P(Q|S), so (with C=Count)

P(O|Q,S) = C(Q,O,S)/C(S) / (1/k) = k C(Q,O,S)/C(S) and P(O|~Q,S) = (k/(k-1)) C(~Q,O,S)/C(S)

thus

L(O|Q) = sum(k C(Q,O,S=k)/C(S=k) P(S=k)) / sum((k/(k-1)) C(~Q,O,S=k)/C(S=k) P(S=k))

so to calculate L(O|Q) on your real data, first note P(S=k) on your real data, then on your sample data d say
```
foreach D in d,
.   C[D->k]++
.   if (D->O) C[D->Q,D->k]++
foreach k
.   L[numerator] += Ps[k] * C[Q,k]/C[k] / (1/k)
.   L[denominator] += Ps[k] * (C[k]-C[Q,k])/C[k] / ((k-1)/k)
```
L(O|Q) = L[numerator]/L[denominator]

You have to bin your training data, you don’t have to bin your test data.

Edit: I found and fixed a couple of errors so there are probably more. Think, debug, and test for yourself as usual. :D