I have a confession: for years, I lied to myself about understanding Bayes’ Theorem.
I could write the formula on a whiteboard. I could pass the exams. But I didn’t feel it. In my mind, it was always a hostile mess of algebra that looked like this:
P(H|E)=P(E|H)⋅P(H)P(E)
It felt like a chore. Especially that denominator, P(E), which asks you to sum up every possible way the universe could have turned out just to normalize a single result.
Then I discovered Bits of Evidence.
The same bits Claude Shannon defined in 1948 to measure information turn out to be the natural units for reasoning under uncertainty.
This concept has deep roots: I. J. Good, one of Alan Turing’s colleagues at Bletchley Park, formalized this idea of measuring evidence in logarithmic units. Good called it the “Weight of Evidence,” building on Turing’s wartime work on Bayesian inference for codebreaking.
When you use the right units, the scary formula vanishes. Conditional probability inference stops being algebra and becomes a simple tally of information.
The 50⁄50 Intuition: What is Zero?
We use the phrase “50/50” every day without thinking. What is the operation? It’s 50 cases supporting the idea and 50 cases against it.
In a truly linear scale of information, I don’t know should be Zero. When people say 50⁄50, they are describing a state of complete ignorance. You have exactly zero bits of information pushing you in either direction.
To reach this clarity, we move to Log-Odds space and measure everything in Bits. A bit is simply the amount of information needed to choose between two equally likely possibilities.
The Case of the Rare Disease
Let’s see this in action with a medical example. A patient is worried about a rare condition (H). The question: Is he Sick, or Healthy?
Imagine a giant Balance Scale.
Left Side: Evidence for Healthy (Negative Bits)
Right Side: Evidence for Sick (Positive Bits)
1. The Starting Point: True Ignorance
Before knowing anything, the scale is perfectly level at 0 bits. This is the true prior: complete ignorance, a 50⁄50 state with no information pushing in either direction.
2. The Population (The Heavy Anchor)
The first piece of evidence: look at the world. This disease is rare. In a population of 10,000 people, typically 12 are sick and 9,988 are healthy.
In textbooks, this base rate is called the Prior. There’s nothing magical about it. It’s just the first piece of evidence, and a heavy one. It loads a huge weight on the Healthy side of the scale.
The Scale Tips: The scale is deep in the red. Without additional tests or symptoms there’s a 9.70 bit load on the Healthy side. Rightfully skeptical.
3. A Weak Symptom (Headache)
The patient has a headache. In our data, 25% of sick patients have headaches, but 10% of healthy patients do too. Not very distinctive. The weight of this evidence is:
A lab test is run that is 99% accurate. Most people think that a positive result means they are 99% likely to be sick. But the weight of a 99% test is just:
This is the shocker. Even with a headache and a 99% accurate positive test, the scale is still leaning toward Healthy. You are still more likely to be fine than to be sick.
Why? Because the rare disease prior was so heavy (-9.70 bits) that even a strong positive test (+6.63 bits) only managed to partially cancel it out. You haven’t crossed the line. You’re at roughly 30:70 odds.
This is exactly why doctors order a second test.
5. The Confirming Scan
An imaging scan comes back. This one has 90% sensitivity with only 2% false positives:
bscan=log2(90%)−log2(2%)=log2(45)=+5.49 bits
Not as strong as the lab test, but absolutely critical. Now the full tally:
That’s roughly 13:1 odds, about 93% certainty. Enough to act on.
6. The Master Tally
Notice: no multiplication anywhere. The bit-weight for each source was calculated independently and added up:
Source of Information
Sick (W+)
Healthy (W−)
Bits of Evidence
Population Base Rate
12
9,988
−9.70
Headache (% with symptom)
25%
10%
+1.32
Lab Test (% positive)
99%
1%
+6.63
Scan (% positive)
90%
2%
+5.49
FINAL SUM
+3.74
No single piece of evidence was sufficient. The headache was weak. The 99% test was strong but not strong enough to overcome the prior alone. Only the combination crossed the threshold.
The Universal Rule
Every piece of information, whether a population study, a symptom, or a lab test, is just a weight on the scale. That weight is measured by comparing the count of cases where the evidence appears for the hypothesis (W+) vs. the cases where it appears against it (W−):
bits=log2(W+)−log2(W−)
This formula is an omnivore: counts, weights, percentages all work, as long as W+ and W− share the same units. The ratio eats the denominator. Everything else is just adding and subtracting these bits.
Why Addition Works (and Probability Fails)
The reason Bayes is difficult is that Probability is non-linear. It compresses information at the extremes.
Think of it like a distorted map projection. Just as the Mercator projection stretches polar regions (making Greenland appear as large as Africa), the probability scale does the opposite—it compresses the extremes. The difference between 99% and 99.9% looks like a tiny 0.9%, but in the world of information, it requires 3.3 bits of evidence. That’s the exact same amount of work it takes to go from a 50⁄50 guess to a 90% certainty.
Bits map information perfectly. Probability is the distorted projection we inherited from history.
This Explains So Much in ML
Once you see it, connections click into place everywhere:
Naive Bayes classifiers? Literally addition. You compute bits from each feature and sum them up. The naive part is assuming features are independent (see caveat in Appendix), but the addition itself is exactly correct in bit space.
Neural network outputs? They produce logits (log-odds, i.e., bits) before the final sigmoid or softmax. The network learns how many bits of evidence each input provides. The sigmoid at the end just converts bits back to probabilities for human consumption.
Logistic regression? It’s called logistic because it works in log-odds space. The coefficients directly tell you how many bits each feature contributes.
Cross-entropy loss? It measures information (in bits or nats) between predicted and true distributions. Same mathematical structure.
Try It Yourself
Next time you encounter a probability problem:
Calculate Bits of Evidence for each observation: log2(W+)−log2(W−)
Add them all up
Convert to probability only at the end if needed: p=2bits/(1+2bits)
You’ll find it’s faster, clearer, and builds better intuition than wrestling with Bayes’ formula.
Summary
Start at Zero. True prior is 0 bits, complete ignorance.
Bits are Linear. Evidence accumulates by adding. Each clue is just log2(W+)−log2(W−).
Cross the Line. Belief isn’t a percentage; it’s just having a positive balance in your bit tally.
Next time you see a 99% accurate claim, don’t ask for the probability. Ask: How many bits is that? You’ll find the math finally feels like common sense.
Appendix: Is it really so?
Is this just a metaphor, or is it the actual math?
Here is how we get from the original multi-step formula to simple addition.
Start with the classic Bayes Rule:
P(H|E)=P(E|H)⋅P(H)P(E)
Write it for the negation (¬H). In our medical example, ¬H is Healthy, the null hypothesis we’re trying to reject:
P(¬H|E)=P(E|¬H)⋅P(¬H)P(E)
Divide the first equation by the second. This cancels out that annoying P(E) and moves towards Odds notation:
P(H|E)P(¬H|E)=P(H)P(¬H)⋅P(E|H)P(E|¬H)
Take the Logarithm (log2) of both sides. Since log(a⋅b)=log(a)+log(b):
Since any Odds ratio is just W+W−, and log(ab)=log(a)−log(b), we reach the final, additive form:
Total Bits=[log2(W+prior)−log2(W−prior)]+[log2(W+E)−log2(W−E)]+…
There is no approximation here. This is the exact, mathematical identity of Bayesian inference. We just chose to do it in the natural units of information.
Why Units Cancel
In the main text, counts were mixed (12 sick vs 9,988 healthy) with percentages (25% vs 10%). How is this legal?
Because the formula always takes a ratio. Consider the headache evidence: 25% of sick patients have headaches, 10% of healthy patients do. Whether written as:
log2(25%)−log2(10%)
log2(0.25)−log2(0.10)
log2(25)−log2(10) (pretending a 100-person study)
...the result is always log2(2.5)≈1.32 bits. The shared denominator (100, or any sample size) cancels in the ratio. This is why the formula is an omnivore: any units work, as long as W+ and W− are measured in the same units.
The Independence Caveat
The additive property (total bits = sum of individual bit-weights) holds only when pieces of evidence are conditionally independent given H.
Example of when it breaks: Suppose the lab test is more likely to be ordered because the patient has a headache. Then headache and positive test are correlated, and adding their bits double-counts some evidence.
This is why Naive Bayes is called naive. In practice, evidence is rarely perfectly independent. The simple sum gives you an upper bound on the evidence, which is often good enough for intuition. For rigorous analysis, you’d need the joint likelihood:
bE1,E2=log2(P(E1,E2|H)P(E1,E2|¬H))
...which doesn’t decompose into a simple sum.
The Bit-to-Certainty Translator
Belief (Bits)
Odds
Probability (P)
Intuition
−10 bits
1 : 1,024
~0.1%
Virtually impossible
−3.3 bits
1 : 10
~9%
Unlikely
0 bits
1 : 1
50%
Complete uncertainty
+3.3 bits
10 : 1
~91%
Strong evidence
+10 bits
1,024 : 1
~99.9%
Beyond reasonable doubt
When you see neural networks outputting logits, or Naive Bayes classifiers summing features, or cross-entropy loss being minimized—you’re seeing this same structure. Evidence is information, and it accumulates linearly.
Probability served us well, just as Mercator projection serves sailors for centuries. But we have better coordinates now. If the units of measurement objectively simplify the math—shouldn’t we use them?
TIL: Bayes Theorem is Just Addition (In the Right Units)
I have a confession: for years, I lied to myself about understanding Bayes’ Theorem.
I could write the formula on a whiteboard. I could pass the exams. But I didn’t feel it. In my mind, it was always a hostile mess of algebra that looked like this:
P(H|E)=P(E|H)⋅P(H)P(E)
It felt like a chore. Especially that denominator, P(E), which asks you to sum up every possible way the universe could have turned out just to normalize a single result.
Then I discovered Bits of Evidence.
The same bits Claude Shannon defined in 1948 to measure information turn out to be the natural units for reasoning under uncertainty.
This concept has deep roots: I. J. Good, one of Alan Turing’s colleagues at Bletchley Park, formalized this idea of measuring evidence in logarithmic units. Good called it the “Weight of Evidence,” building on Turing’s wartime work on Bayesian inference for codebreaking.
When you use the right units, the scary formula vanishes. Conditional probability inference stops being algebra and becomes a simple tally of information.
The 50⁄50 Intuition: What is Zero?
We use the phrase “50/50” every day without thinking. What is the operation? It’s 50 cases supporting the idea and 50 cases against it.
In a truly linear scale of information, I don’t know should be Zero. When people say 50⁄50, they are describing a state of complete ignorance. You have exactly zero bits of information pushing you in either direction.
To reach this clarity, we move to Log-Odds space and measure everything in Bits. A bit is simply the amount of information needed to choose between two equally likely possibilities.
The Case of the Rare Disease
Let’s see this in action with a medical example. A patient is worried about a rare condition (H). The question: Is he Sick, or Healthy?
Imagine a giant Balance Scale.
Left Side: Evidence for Healthy (Negative Bits)
Right Side: Evidence for Sick (Positive Bits)
1. The Starting Point: True Ignorance
Before knowing anything, the scale is perfectly level at 0 bits. This is the true prior: complete ignorance, a 50⁄50 state with no information pushing in either direction.
2. The Population (The Heavy Anchor)
The first piece of evidence: look at the world. This disease is rare. In a population of 10,000 people, typically 12 are sick and 9,988 are healthy.
In textbooks, this base rate is called the Prior. There’s nothing magical about it. It’s just the first piece of evidence, and a heavy one. It loads a huge weight on the Healthy side of the scale.
bprior=log2(12)−log2(9988)=3.585−13.286=−9.70 bits
The Scale Tips: The scale is deep in the red. Without additional tests or symptoms there’s a 9.70 bit load on the Healthy side. Rightfully skeptical.
3. A Weak Symptom (Headache)
The patient has a headache. In our data, 25% of sick patients have headaches, but 10% of healthy patients do too. Not very distinctive. The weight of this evidence is:
bheadache=log2(25%)−log2(10%)=log2(2.5)=+1.32 bits
A weak clue. It barely moves the needle.
4. The 99% Accurate Test
A lab test is run that is 99% accurate. Most people think that a positive result means they are 99% likely to be sick. But the weight of a 99% test is just:
btest=log2(99%)−log2(1%)=log2(99)=+6.63 bits
Strong, but is it enough? Let’s tally up:
−9.70Prior+1.32Headache+6.63Test=−1.75 bits
This is the shocker. Even with a headache and a 99% accurate positive test, the scale is still leaning toward Healthy. You are still more likely to be fine than to be sick.
Why? Because the rare disease prior was so heavy (-9.70 bits) that even a strong positive test (+6.63 bits) only managed to partially cancel it out. You haven’t crossed the line. You’re at roughly 30:70 odds.
This is exactly why doctors order a second test.
5. The Confirming Scan
An imaging scan comes back. This one has 90% sensitivity with only 2% false positives:
bscan=log2(90%)−log2(2%)=log2(45)=+5.49 bits
Not as strong as the lab test, but absolutely critical. Now the full tally:
−9.70Prior+1.32Headache+6.63Test+5.49Scan=+3.74 bits
That’s roughly 13:1 odds, about 93% certainty. Enough to act on.
6. The Master Tally
Notice: no multiplication anywhere. The bit-weight for each source was calculated independently and added up:
No single piece of evidence was sufficient. The headache was weak. The 99% test was strong but not strong enough to overcome the prior alone. Only the combination crossed the threshold.
The Universal Rule
Every piece of information, whether a population study, a symptom, or a lab test, is just a weight on the scale. That weight is measured by comparing the count of cases where the evidence appears for the hypothesis (W+) vs. the cases where it appears against it (W−):
bits=log2(W+)−log2(W−)
This formula is an omnivore: counts, weights, percentages all work, as long as W+ and W− share the same units. The ratio eats the denominator. Everything else is just adding and subtracting these bits.
Why Addition Works (and Probability Fails)
The reason Bayes is difficult is that Probability is non-linear. It compresses information at the extremes.
Think of it like a distorted map projection. Just as the Mercator projection stretches polar regions (making Greenland appear as large as Africa), the probability scale does the opposite—it compresses the extremes. The difference between 99% and 99.9% looks like a tiny 0.9%, but in the world of information, it requires 3.3 bits of evidence. That’s the exact same amount of work it takes to go from a 50⁄50 guess to a 90% certainty.
Bits map information perfectly. Probability is the distorted projection we inherited from history.
This Explains So Much in ML
Once you see it, connections click into place everywhere:
Naive Bayes classifiers? Literally addition. You compute bits from each feature and sum them up. The naive part is assuming features are independent (see caveat in Appendix), but the addition itself is exactly correct in bit space.
Neural network outputs? They produce logits (log-odds, i.e., bits) before the final sigmoid or softmax. The network learns how many bits of evidence each input provides. The sigmoid at the end just converts bits back to probabilities for human consumption.
Logistic regression? It’s called logistic because it works in log-odds space. The coefficients directly tell you how many bits each feature contributes.
Cross-entropy loss? It measures information (in bits or nats) between predicted and true distributions. Same mathematical structure.
Try It Yourself
Next time you encounter a probability problem:
Calculate Bits of Evidence for each observation: log2(W+)−log2(W−)
Add them all up
Convert to probability only at the end if needed: p=2bits/(1+2bits)
You’ll find it’s faster, clearer, and builds better intuition than wrestling with Bayes’ formula.
Summary
Start at Zero. True prior is 0 bits, complete ignorance.
Bits are Linear. Evidence accumulates by adding. Each clue is just log2(W+)−log2(W−).
Cross the Line. Belief isn’t a percentage; it’s just having a positive balance in your bit tally.
Next time you see a 99% accurate claim, don’t ask for the probability. Ask: How many bits is that? You’ll find the math finally feels like common sense.
Appendix: Is it really so?
Is this just a metaphor, or is it the actual math?
Here is how we get from the original multi-step formula to simple addition.
Start with the classic Bayes Rule: P(H|E)=P(E|H)⋅P(H)P(E)
Write it for the negation (¬H). In our medical example, ¬H is Healthy, the null hypothesis we’re trying to reject: P(¬H|E)=P(E|¬H)⋅P(¬H)P(E)
Divide the first equation by the second. This cancels out that annoying P(E) and moves towards Odds notation: P(H|E)P(¬H|E)=P(H)P(¬H)⋅P(E|H)P(E|¬H)
Take the Logarithm (log2) of both sides. Since log(a⋅b)=log(a)+log(b):
log2(P(H|E)P(¬H|E))New Belief (Bits)=log2(P(H)P(¬H))Prior (Bits)+log2(P(E|H)P(E|¬H))Evidence (Bits)
Since any Odds ratio is just W+W−, and log(ab)=log(a)−log(b), we reach the final, additive form: Total Bits=[log2(W+prior)−log2(W−prior)]+[log2(W+E)−log2(W−E)]+…
There is no approximation here. This is the exact, mathematical identity of Bayesian inference. We just chose to do it in the natural units of information.
Why Units Cancel
In the main text, counts were mixed (12 sick vs 9,988 healthy) with percentages (25% vs 10%). How is this legal?
Because the formula always takes a ratio. Consider the headache evidence: 25% of sick patients have headaches, 10% of healthy patients do. Whether written as:
log2(25%)−log2(10%)
log2(0.25)−log2(0.10)
log2(25)−log2(10) (pretending a 100-person study)
...the result is always log2(2.5)≈1.32 bits. The shared denominator (100, or any sample size) cancels in the ratio. This is why the formula is an omnivore: any units work, as long as W+ and W− are measured in the same units.
The Independence Caveat
The additive property (total bits = sum of individual bit-weights) holds only when pieces of evidence are conditionally independent given H.
Example of when it breaks: Suppose the lab test is more likely to be ordered because the patient has a headache. Then headache and positive test are correlated, and adding their bits double-counts some evidence.
This is why Naive Bayes is called naive. In practice, evidence is rarely perfectly independent. The simple sum gives you an upper bound on the evidence, which is often good enough for intuition. For rigorous analysis, you’d need the joint likelihood:
bE1,E2=log2(P(E1,E2|H)P(E1,E2|¬H))
...which doesn’t decompose into a simple sum.
The Bit-to-Certainty Translator
When you see neural networks outputting logits, or Naive Bayes classifiers summing features, or cross-entropy loss being minimized—you’re seeing this same structure. Evidence is information, and it accumulates linearly.
Probability served us well, just as Mercator projection serves sailors for centuries. But we have better coordinates now. If the units of measurement objectively simplify the math—shouldn’t we use them?