Log-odds are better than Probabilities

Link post

[This is a cross-post from my blog at aizi.substack.com. I’m sure someone has made a point like this before, but I don’t know any specific instances and I wanted to give my take on it.]

At my previous job I worked on ML classifiers, and I learned a useful alternative way to think about probabilities which I want to share. I’m referring to log-odds aka logits, where a probability p is represented by logit(p):=log(p/​(1-p))[1].

I claim that, at least for Bayesian updates and binary prediction, it can be better to think in terms of log-odds than probabilities, and this post is laying out that case.

Log-odds simplifies Bayesian calculations

Do you do Bayesian updates in your head? I didn’t, in part because the classic Bayes formula is kinda bad to work with:

\[P(H|E)= \frac{P(H)P(E|H)}{P(E)}\]

The first problem is that you need to know P(E), the chance that E is true at all. But the value of P(E) should be irrelevant since we know we live in a timeline where E is true! Of course you can use a formula like this to hide P(E) but at a complexity cost:

\[P(H|E)=\frac{P(H)P(E|H)}{P(E|H)P(H)+P(E \neg H)P(\neg H)}\]

For me, this calculation requires too many operations and cached numbers to do easily in my head.

But more importantly, these formula don’t emphasize how P(H) was updated. Sure, you can say P(H) is being multiplied by P(E|H)/​P(E), but that number isn’t really comparable across priors. For instance, if P(E|H)/​P(E)=2, that’s a small update if your prior is P(H)=.1 (taking you from 10% to 20%), a huge update if P(H)=.5 (taking you from a coinflip to certainty), and impossible for P(H)>.5. So “P(E|H)/​P(E)=2” isn’t a meaningful intermediary calculation.

Now let’s compare the log-odds version. I’ll write L(H) for logit(P(H)):

\[\begin{eqnarray*} L(H|E) &=& \log\left( \frac{P(H|E)}{P(\neg H | E)}\right)\\ &=& \log\left( \frac{\left(\frac{P(H)P(E|H)}{\cancel{P(E)}} \right)}{\left( \frac{P(\neg H)P(E| \neg H)}{\cancel{P(E)}}\right)}\right)\\ &=&\log(\left(\frac{P(H) P(E|H)}{P(\neg H)P(E|\neg H)} \right)\\ &=& \log \left( \frac{P(H)}{P(\neg H)} \right)+\log \left( \frac{P(E|H)}{P(E|\neg H)} \right) \\ &=& L(H)+ \log \left( \frac{P(E|H)}{P(E|\neg H)} \right) \end{eqnarray*}\]

Omitting intermediate steps:

\[\begin{eqnarray*} L(H|E) &=& L(H)+ \log \left( \frac{P(E|H)}{P(E|\neg H)} \right) \end{eqnarray*}\]

Now that’s clear! A Bayesian update is just adding a new term, the log-ratio of seeing this evidence when the hypothesis is true vs when its false. For me, this is a very easy calculation to do in my head (only two binary operations and a log, and you don’t need to cache numbers between steps), and when I had to do Bayesian updates in my head I would convert to log-odds space and calculate them there.

But I want to claim something stronger: the sheer simplicity of the log-odds Bayes rule suggests we’re thinking in the right terms. Our intermediate calculation log(P(E|H)/​P(E| -H)) is comparable across priors and it connects in an intuitive way to what’s happening in the world. If we call that term “the strength of the evidence” (a name I think is justified), Bayesian updating is literally “adding the strength of the evidence to your prior”. That’s great! As a mathematician, I’d say this is so great (in terms of its elegance, simplicity, correspondence to our natural language, etc) that it’s a sign we’ve found the “right definition”.

That’s my main argument, but there are other minor perks too.

Probability changes lack meaning without base rates

“We’ve improved classification accuracy by 10 percentage points”. Is that good or bad? Taking a classifier from 5050 correct/​incorrect to 6040 is a small improvement, but taking it from 8911 to 991 is a massive improvement! The problem is you really want to measure the change in both correct and incorrect classes, simultaneously. Log-odds do that because they’re a function of p/​(1-p).

Every number is meaningful

Log-odds space is the real line, which corresponds to probabilities in the open interval (0,1). Therefore you can be confident that any rescaling or shifting you do to finite log-odds will result in a new meaningful number, whereas for probabilities you have to be careful never to leave the interval [0,1]. The fact that you can’t uniformly increase a probability by 10% (or 10 percentage points) is an indication they’re not the “right” way to think of things.

Certainty is infinite, and there’s a lot of space near infinity

Probabilities of 0 and 1 correspond to log-odds negative infinity and positive infinity, respectively. This is good because it reminds us that complete certainty is qualitatively different than any amount of uncertainty. For instance, it’s easy to see how any updating from certainty is like adding a finite number to infinity—it still results in infinity.

Also, very-high-confidence predictions are spread out in a sensible way in log-odds space. Predictions of 99% and 99.9% sound very similar in terms of probabilities, but in log-odds space they are ~2 and ~3 hartleys respectively, showing that the second one is much more confident.

Negation is the complement and 0 is neutral

The complement operation (the odds of “not X”) on probabilities is P’=1-P, resulting in a neutral point at .5 (i.e. 5050 odds). This is okay, but log-odds space wins because the compliment operation is L’=-L, so the neutral point is 0. This is more aesthetically pleasing (and maybe has other benefits idk).

Probabilities are still good for other things

I hope I’ve convinced you that log-odds are a useful substitute for probabilities in some situations. However, I don’t want to pretend you should think of everything in terms of log-odds. Probabilities have some real perks, especially in cases where there are three or more options to track, so I wanted to shout out some of those:

  1. Probabilities are used even when thinking in terms of log-odds. For instance, when we wrote Bayesian updating as “add the strength of the evidence”, probabilities (not log-odds) are used to calculate the “strength of the evidence”. As far as I know, there’s no way to do updating just in terms of log-odds[2].

  2. Probabilities sum to 1, log-odds do not. If there are just two classes, the log-odds will be negatives of each other (and hence sum to 0), but if there are three or more classes you don’t have any nice rule about how a complete set of log-odds relate to each other[2]. Similarly, there’s no rule like “the total area under a PDF is 1” for log-odds.

  3. Probabilities are the right units for combining events (e.g. P(A or B) = P(A)+P(B)-P(A and B)), so they’re right for convolutions and other fundamental calculations, and I don’t know of anything like this for log-odds[2].

  1. ^

    The choice of log base doesn’t matter as long as you’re consistent, and the resulting units are called shannons/​nats/​hartleys for bases 2/​e/​10 respectively.

  2. ^

    Without cheating by converting your log-odds into probabilities.