(A standalone math post that I want to be able to link back to later/elsewhere)

There’s this statistical phenomenon where it’s possible for two multivariate distributions to overlap along any one variable, but be cleanly separable when you look at the entire configuration space at once. This is perhaps easiest to see with an illustrative diagram—

The denial of this possibility (in arguments of the form, “the distributions overlap along this variable, therefore you can’t say that they’re different”) is sometimes called the “univariate fallacy.” (Eliezer Yudkowsky proposes “covariance denial fallacy” or “cluster erasure fallacy” as potential alternative names.)

Let’s make this more concrete by making up an example with actual numbers instead of just a pretty diagram. Imagine we have some datapoints that live in the forty-dimensional space {1, 2, 3, 4}⁴⁰ that are sampled from one of two probability distibutions, which we’ll call PA and PB.

For simplicity, let’s suppose that the individual variables x₁, x₂, … x₄₀—the coördinates of a point in our forty-dimensional space—are statistically independent and identically distributed. For every individual xi, the marginal distribution of PA is—

If you look at any one xi-coördinate for a point, you can’t be confident which distribution the point was sampled from. For example, seeing that x₁ takes the value 2 gives you a ^{7}⁄_{4} (= 1.75) likelihood ratio in favor of that the point having been sampled from PA rather than PB, which is log₂(7/4) ≈ 0.807 bits of evidence.

That’s … not a whole lot of evidence. If you guessed that the datapoint came from PA based on that much evidence, you’d be wrong about 4 times out of 10. (Given equal (1:1) prior odds, an odds ratio of 7:4 amounts to a probability of (7/4)/(1 + ^{7}⁄_{4}) ≈ 0.636.)

And yet if we look at many variables, we can achieve supreme, godlike confidence about which distribution a point was sampled from. Proving this is left as an exercise to the particularly intrepid reader, but a concrete demonstration is probably simpler and should be pretty convincing! Let’s write some Python code to sample a point →x ∈ {1, 2, 3, 4}⁴⁰ from PA—

import random
def a():
return random.sample(
[1]*4 + # 1/4
[2]*7 + # 7/16
[3]*4 + # 1/4
[4], # 1/16
1
)[0]
x = [a() for _ in range(40)]
print(x)

Go ahead and run the code yourself. (With an online REPL if you don’t have Python installed locally.) You’ll probably get a value of x that “looks something like”

If someone off the street just handed you this →x without telling you whether she got it from PA or PB, how would you compute the probability that it came from PA?

Well, because the coördinates/variables are statistically independent, you can just tally up (multiply) the individual likelihood ratios from each variable. That’s only a little bit more code—

import logging
logging.basicConfig(level=logging.INFO)
def odds_to_probability(o):
return o/(1+o)
def tally_likelihoods(x, p_a, p_b):
total_odds = 1
for i, x_i in enumerate(x, start=1):
lr = p_a[x_i-1]/p_b[x_i-1] # (-1s because of zero-based array indexing)
logging.info(“x_%s = %s, likelihood ratio is %s”, i, x_i, lr)
total_odds *= lr
return total_odds
print(
odds_to_probability(
tally_likelihoods(
x,
[1/4, 7/16, 1/4, 1/16],
[1/16, 1/4, 7/16, 1/4]
)
)
)

If you run that code, you’ll probably see “something like” this—

INFO:root:x_1 = 2, likelihood ratio is 1.75
INFO:root:x_2 = 1, likelihood ratio is 4.0
INFO:root:x_3 = 2, likelihood ratio is 1.75
INFO:root:x_4 = 2, likelihood ratio is 1.75
INFO:root:x_5 = 1, likelihood ratio is 4.0
[blah blah, redacting some lines to save vertical space in the blog post, blah blah]
INFO:root:x_37 = 2, likelihood ratio is 1.75
INFO:root:x_38 = 3, likelihood ratio is 0.5714285714285714
INFO:root:x_39 = 3, likelihood ratio is 0.5714285714285714
INFO:root:x_40 = 4, likelihood ratio is 0.25
0.9999936561215961

Our computed probability that →x came from PA has several nines in it. Wow! That’s pretty confident!

## The Univariate Fallacy

(A standalone math post that I want to be able to link back to later/elsewhere)There’s this statistical phenomenon where it’s possible for two multivariate distributions to overlap along any one variable, but be cleanly separable when you look at the entire configuration space at once. This is perhaps easiest to see with an illustrative diagram—

The denial of this possibility (in arguments of the form, “the distributions overlap along this variable, therefore you can’t say that they’re different”) is sometimes called the “univariate fallacy.” (Eliezer Yudkowsky proposes “covariance denial fallacy” or “cluster erasure fallacy” as potential alternative names.)

Let’s make this more concrete by making up an example with actual numbers instead of just a pretty diagram. Imagine we have some datapoints that live in the forty-dimensional space {1, 2, 3, 4}⁴⁰ that are sampled from one of two probability distibutions, which we’ll call PA and PB.

For simplicity, let’s suppose that the individual variables

x₁,x₂, …x₄₀—the coördinates of a point in our forty-dimensional space—are statistically independent and identically distributed. For every individual xi, the marginal distribution of PA is—PA(xi)=⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩1/4xi=17/16xi=21/4xi=31/16xi=4

And for PB—

PB(xi)=⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩1/16xi=11/4xi=27/16xi=31/4xi=4

If you look at any one xi-coördinate for a point, you can’t be confident which distribution the point was sampled from. For example, seeing that

x₁takes the value 2 gives you a^{7}⁄_{4}(= 1.75) likelihood ratio in favor of that the point having been sampled from PA rather than PB, which is log₂(7/4) ≈ 0.807 bits of evidence.That’s … not a whole lot of evidence. If you guessed that the datapoint came from PA based on that much evidence, you’d be wrong about 4 times out of 10. (Given equal (1:1) prior odds, an odds ratio of 7:4 amounts to a probability of (7/4)/(1 +

^{7}⁄_{4}) ≈ 0.636.)And yet if we look at

manyvariables, we can achievesupreme, godlikeconfidence about which distribution a point was sampled from.Provingthis is left as an exercise to the particularly intrepid reader, but a concretedemonstrationis probably simpler and should be pretty convincing! Let’s write some Python code to sample a point →x ∈ {1, 2, 3, 4}⁴⁰ from PA—Go ahead and run the code yourself. (With an online REPL if you don’t have Python installed locally.) You’ll

probablyget a value of`x`

that “looks something like”If someone off the street just handed you this →x without telling you whether she got it from PA or PB, how would you compute the probability that it came from PA?

Well, because the coördinates/variables are statistically independent, you can just tally up (multiply) the individual likelihood ratios from each variable. That’s only a little bit more code—

If you run that code, you’ll

probablysee “something like” this—Our computed probability that →x came from PA has several nines in it. Wow! That’s pretty confident!

Thanks for reading!