Use conditional probabilities to clear up error rate confusion

It’s common in my part of the data science world to speak of model accuracy for classifiers in terms of four quantities: True Positive Rate (AKA Sensitivity, Recall), False Positive Rate, True Negative Rate (AKA Specificity), and False Negative Rate. But after more than 5 years in the field, I still have a hard time remembering which is which. Recently I found that writing conditional probabilities instead makes things clearer for me. For a binary classifier—where the response, or actual value, is one of {true, false} - the following rates and probabilities are the same:

  • True Positive Rate is the same as the probability of predicting “true” when the actual label is “true”. Or: P(predicted true | actually true).

  • False Positive Rate: P(predicted true | actually false)

  • True Negative Rate: P(predicted false | actually false)

  • False Negative Rate: P(predicted false | actually true)

And too for the non-binary case. If I have possible outcomes, then there is a False Positive Rate for each, = P(predicted to be outcome | not actually ), and so on for the other rates.

I think those probability expressions are what I’ve been mentally translating the rate terms into, in my head, every time I hear or say something like “False Positive Rate”. For me, that subconscious translation was slow, and annoying, and bogged up my thinking. Writing the probability expressions directly when displaying a confusion matrix, or the axes of a ROC curve, has cleared up my thinking and conversations about classifier error rates. You don’t have to remember which is which—you can just read off the conditional probability.


Wait, are those the right probabilities?

All I meant to write for this post was the single section above. So you can stop here and still be good. But… the way these error rates are commonly calculated might not be quite right. I’ve always calculated e.g. True Positive Rate as , where is “number of actual positives” and is “number of those positives predicted correctly”. But the most basic frequency expression for a rate, where one knows nothing—nothing! - except that it is physically possible for an outcome to be 1 or 0, and additionally that there have been 1′s in a sample of size , is Laplace’s rule of succession: . So we should probably be using the denominator (2 + “number of actual positives”), and the numerator (1 + “number of those predicted positive”), for our True Positive Rate calculation, instead. And so on for the other rates.

(Why is , instead of , the right posterior expression for this state of knowledge? It’s because the knowledge that either “right” or “wrong” can happen, but knowing nothing else at all, is the same state of knowledge as knowing that there were at most 2 outcomes, but not knowing whether both were possible, and then seeing one “right” and one “wrong”. Mathematically, if all you know is both outcomes are possible but haven’t seen any labels /​ done any experiments, then and are both 0, giving . In other words, knowing only that both are possible, your prior probability for each is .)

(By the way, it isn’t always the case that you know both outcomes are possible: say you want to figure out whether a given type of candy has the same reaction with Coke as Mentos does. You drop a jolly rancher into a Coke bottle, and nothing happens. In this situation, you don’t know that it’s possible for the jolly rancher to ever react with the Coke. For that case, is not the appropriate posterior.)

Is the rule of succession appropriate for the binary classifier case? Only if

  1. We know both “right” and “wrong” are possible outcomes, and

  2. We don’t know anything else about whether a particular prediction will be right or wrong.

#1 is easy—yes, it is possible for a model to predict correctly, and also possible for a model to predict incorrectly. #2 is harder. Do we really know nothing beyond “each prediction could be right or wrong” when computing something like a ROC curve? At the very first training, and the first evaluation of the classifier on out-of-sample data, this really might be our state of knowledge! Later investigation could tell us that certain regions of the parameter space are more difficult than others, and at that point it gets difficult—I don’t know the right expression for the prior in such cases. Frankly, I always throw that extra information away when computing the different types of error rates, because I don’t know how to incorporate it. Many do the same. But if we must throw away information for expediency, we should throw away as little as we can. Using is throwing away even the prior knowledge that both outcomes are possible; we can retain that information by using instead.

No comments.