# The Univariate Fallacy

(A stan­dalone math post that I want to be able to link back to later/​el­se­where)

There’s this statis­ti­cal phe­nomenon where it’s pos­si­ble for two mul­ti­vari­ate dis­tri­bu­tions to over­lap along any one vari­able, but be cleanly sep­a­rable when you look at the en­tire con­figu­ra­tion space at once. This is per­haps eas­iest to see with an illus­tra­tive di­a­gram—

The de­nial of this pos­si­bil­ity (in ar­gu­ments of the form, “the dis­tri­bu­tions over­lap along this vari­able, there­fore you can’t say that they’re differ­ent”) is some­times called the “uni­vari­ate fal­lacy.” (Eliezer Yud­kowsky pro­poses “co­var­i­ance de­nial fal­lacy” or “cluster era­sure fal­lacy” as po­ten­tial al­ter­na­tive names.)

Let’s make this more con­crete by mak­ing up an ex­am­ple with ac­tual num­bers in­stead of just a pretty di­a­gram. Imag­ine we have some dat­a­points that live in the forty-di­men­sional space {1, 2, 3, 4}⁴⁰ that are sam­pled from one of two prob­a­bil­ity dis­ti­bu­tions, which we’ll call and .

For sim­plic­ity, let’s sup­pose that the in­di­vi­d­ual vari­ables x₁, x₂, … x₄₀—the coör­di­nates of a point in our forty-di­men­sional space—are statis­ti­cally in­de­pen­dent and iden­ti­cally dis­tributed. For ev­ery in­di­vi­d­ual , the marginal dis­tri­bu­tion of is—

And for

If you look at any one -coör­di­nate for a point, you can’t be con­fi­dent which dis­tri­bu­tion the point was sam­pled from. For ex­am­ple, see­ing that x₁ takes the value 2 gives you a 74 (= 1.75) like­li­hood ra­tio in fa­vor of that the point hav­ing been sam­pled from rather than , which is log₂(7/​4) ≈ 0.807 bits of ev­i­dence.

That’s … not a whole lot of ev­i­dence. If you guessed that the dat­a­point came from based on that much ev­i­dence, you’d be wrong about 4 times out of 10. (Given equal (1:1) prior odds, an odds ra­tio of 7:4 amounts to a prob­a­bil­ity of (7/​4)/​(1 + 74) ≈ 0.636.)

And yet if we look at many vari­ables, we can achieve supreme, godlike con­fi­dence about which dis­tri­bu­tion a point was sam­pled from. Prov­ing this is left as an ex­er­cise to the par­tic­u­larly in­trepid reader, but a con­crete demon­stra­tion is prob­a­bly sim­pler and should be pretty con­vinc­ing! Let’s write some Python code to sam­ple a point ∈ {1, 2, 3, 4}⁴⁰ from

im­port ran­dom

def a():
re­turn ran­dom.sam­ple(
[1]*4 +  # 1/​4
[2]*7 +  # 7/​16
[3]*4 +  # 1/​4
[4],     # 1/​16
1
)[0]

x = [a() for _ in range(40)]
print(x)


Go ahead and run the code your­self. (With an on­line REPL if you don’t have Python in­stalled lo­cally.) You’ll prob­a­bly get a value of x that “looks some­thing like”

[2, 1, 2, 2, 1, 1, 2, 2, 1, 2, 1, 4, 4, 2, 2, 3, 3, 1, 2, 2, 2, 4, 2, 2, 1, 2, 1, 4, 3, 3, 2, 1, 1, 3, 3, 2, 2, 3, 3, 4]


If some­one off the street just handed you this with­out tel­ling you whether she got it from or , how would you com­pute the prob­a­bil­ity that it came from ?

Well, be­cause the coör­di­nates/​vari­ables are statis­ti­cally in­de­pen­dent, you can just tally up (mul­ti­ply) the in­di­vi­d­ual like­li­hood ra­tios from each vari­able. That’s only a lit­tle bit more code—

im­port log­ging

log­ging.ba­sicCon­fig(level=log­ging.INFO)

def odds_to_prob­a­bil­ity(o):
re­turn o/​(1+o)

def tally_like­li­hoods(x, p_a, p_b):
to­tal_odds = 1
for i, x_i in enu­mer­ate(x, start=1):
lr = p_a[x_i-1]/​p_b[x_i-1]  # (-1s be­cause of zero-based ar­ray in­dex­ing)
log­ging.info(“x_%s = %s, like­li­hood ra­tio is %s”, i, x_i, lr)
to­tal_odds *= lr
re­turn to­tal_odds

print(
odds_to_prob­a­bil­ity(
tally_like­li­hoods(
x,
[1/​​4, 7/​​16, 1/​​4, 1/​​16],
[1/​​16, 1/​​4, 7/​​16, 1/​​4]
)
)
)


If you run that code, you’ll prob­a­bly see “some­thing like” this—

INFO:root:x_1 = 2, like­li­hood ra­tio is 1.75
INFO:root:x_2 = 1, like­li­hood ra­tio is 4.0
INFO:root:x_3 = 2, like­li­hood ra­tio is 1.75
INFO:root:x_4 = 2, like­li­hood ra­tio is 1.75
INFO:root:x_5 = 1, like­li­hood ra­tio is 4.0
[blah blah, redact­ing some lines to save ver­ti­cal space in the blog post, blah blah]
INFO:root:x_37 = 2, like­li­hood ra­tio is 1.75
INFO:root:x_38 = 3, like­li­hood ra­tio is 0.5714285714285714
INFO:root:x_39 = 3, like­li­hood ra­tio is 0.5714285714285714
INFO:root:x_40 = 4, like­li­hood ra­tio is 0.25
0.9999936561215961


Our com­puted prob­a­bil­ity that came from has sev­eral nines in it. Wow! That’s pretty con­fi­dent!

• Good post. Some feed­back:

• I think you can re­place the first in­stance of “are statis­ti­cally in­de­pen­dent” with “are statis­ti­cally in­de­pen­dent and iden­ti­cally dis­tributed” & im­prove clar­ity.

• IMO, your ar­gu­ment needs work if you want it to be more than an in­tu­ition pump. If the ques­tion is the ex­is­tence or nonex­is­tence of par­tic­u­lar clusters, you are es­sen­tially as­sum­ing what you need to prove in this post. Plus, the ex­is­tence or nonex­is­tence of clusters is a “choice of on­tol­ogy” ques­tion which doesn’t nec­es­sar­ily have a sin­gle cor­rect an­swer.

• You’re also fuzzing things by talk­ing about dis­crete dis­tri­bu­tions here, then link­ing to Eliezer’s dis­cus­sion of con­tin­u­ous la­tent vari­ables (“in­tel­li­gence”) with­out not­ing the differ­ence. And: If a num­ber of char­ac­ter­is­tics have been ob­served to co-vary, this isn’t suffi­cient ev­i­dence for any par­tic­u­lar causal mechanism. Cor­re­la­tion isn’t cau­sa­tion. As I pointed out in this es­say, it’s pos­si­ble there’s some la­tent fac­tor like the ease of ob­tain­ing calories in an or­ganism’s en­vi­ron­ment which ex­plains in­ter­species in­tel­li­gence differ­ences but doesn’t say any­thing about the “in­tel­li­gence” of soft­ware.

• re­place the first in­stance of “are statis­ti­cally in­de­pen­dent” with “are statis­ti­cally in­de­pen­dent and iden­ti­cally dis­tributed”

Done, thanks!

talk­ing about dis­crete dis­tri­bu­tions here, then link­ing to Eliezer’s dis­cus­sion of con­tin­u­ous la­tent vari­ables (“in­tel­li­gence”) with­out not­ing the difference

The differ­ence doesn’t seem rele­vant to the nar­row point I’m try­ing to make? I was origi­nally go­ing to use mul­ti­vari­ate nor­mal dis­tri­bu­tions with differ­ent means, but then de­cided to just make up “peaked” dis­crete dis­tri­bu­tions in or­der to keep the ar­ith­metic sim­ple.

• I agree with your other two points (mostly—I don’t feel that the dis­tinc­tion be­tween dis­crete and con­tin­u­ous vari­ables is im­por­tant for Zack’s ar­gu­ment so it seems fine to gloss over it) but I dis­agree with the first.

In or­der to be able to sim­ply mul­ti­ply like­li­hood ra­tios, the suffi­cient fact is that they’re statis­ti­cally in­de­pen­dent. In this toy model, they also hap­pen to be iden­ti­cally dis­tributed, but I think it’s clear from con­text that Zack would like to ap­ply his ar­gu­ment to a va­ri­ety of situ­a­tions where the differ­ent di­men­sions have differ­ent dis­tri­bu­tions. You’re sug­gest­ing re­plac­ing “X, there­fore Z” with “X and Y, there­fore Z”, when in fact X->Z, and it is not the case that Y->Z.

• Hi Zack,

Can you clar­ify some­thing? In the pic­ture you draw, there is a codi­men­sion-1 lin­ear sub­space sep­a­rat­ing the pa­ram­e­ter space into two halves, with all red points to one side, and all blue points to the other. Pro­ject­ing onto any 1-di­men­sional sub­space or­thog­o­nal to this (there is a unique one through the ori­gin) will thus yield a vari­able’ which cleanly sep­a­rates the two points into the red and blue cat­e­gories. So in the illus­trated ex­am­ple, it looks just like a prob­lem of bad co­or­di­nate choice.

On the other hand, one can eas­ily have much more patholog­i­cal situ­a­tions; for ex­am­ples, the red points could all lie in­side a cer­tain sphere, and the blue points out­side it. Then no choice of lin­ear co­or­di­nates will illus­trate this, and one has to use more ad­vanced anal­y­sis tech­niques to pick up on it (e.g. per­sis­tent ho­mol­ogy).

So, to my vague ques­tion: do you have only the first situ­a­tion in mind, or are you also con­sid­er­ing the gen­eral case, but made the illus­trated ex­am­ple ex­tra-sim­ple?

Per­haps this is clar­ified by your nu­mer­i­cal ex­am­ple, I’m afraid I’ve not checked.

• Pro­ject­ing onto any 1-di­men­sional sub­space or­thog­o­nal to this (there is a unique one through the ori­gin) will thus yield a ‘vari­able’ which cleanly sep­a­rates the two points into the red and blue cat­e­gories. So in the illus­trated ex­am­ple, it looks just like a prob­lem of bad co­or­di­nate choice.

Thanks, this is a re­ally im­por­tant point! In­deed, for freely-reparametriz­able ab­stract points in an ab­stract vec­tor space, this is just a bad choice of co­or­di­nates. The rea­son this ob­jec­tion doesn’t make the post com­pletely use­less, is that for some ap­pli­ca­tions (you know, if you’re one of those weird peo­ple who cares about “ap­pli­ca­tions”), we do want to re­gard some bases as more “fun­da­men­tal”, if the vari­ables rep­re­sent real-world mea­sure­ments.

For ex­am­ple, you might be able to suc­cess­fully clas­sify two differ­ent species of flower us­ing both “stem length” and “petal color” mea­sure­ments, even if the dis­tri­bu­tions over­lap for ei­ther stem length or petal color con­sid­ered in­di­vi­d­u­ally. Math­e­mat­i­cally, we could view the dis­tri­bu­tions as not over­lap­ping with re­spect to some vari­able that cor­re­sponds to some weighted func­tion of stem length and petal color, but that vari­able seems “ar­tifi­cial”, less “in­ter­pretable.”

• Another way to suc­cinctly say this is that two dis­tri­bu­tions may be cleanly sep­a­rable via a sin­gle im­mea­surable vari­able, but over­lap when mea­sured on any given mea­surable vari­able, such that a rep­re­sen­ta­tion of the sep­a­ra­tion achieved by a sin­gle im­mea­surable vari­able is only achiev­able through mul­ti­ple mea­surable vari­ables.

• Thanks for the re­ply, Zack.

The rea­son this ob­jec­tion doesn’t make the post com­pletely use­less...

Sorry, I hope I didn’t sug­gest I thought that! You make a good point about some vari­ables be­ing more nat­u­ral in given ap­pli­ca­tions. I think it’s good to keep in mind that some­times it’s just a mat­ter of co­or­di­nate choice, and other times the points may be sep­a­rated but not in a lin­ear way.

• Sorry, I hope I didn’t sug­gest I thought that!

I mean, it doesn’t mat­ter whether you think it, right? It mat­ters whether it’s true. Like, if I were to were to write a com­pletely use­less blog post on ac­count of failing to un­der­stand the con­cept of a change of ba­sis, then some­one should tell me, be­cause that would be helping me stop be­ing de­ceived about the qual­ity of my blog­ging.

• FYI, one of the sym­bols in this post is not ren­der­ing prop­erly. It ap­pears to be U+20D7 COMBINING RIGHT ARROW ABOVE (ap­pear­ing right af­ter the ‘x’ char­ac­ters) but, at least on this ma­chine (Mac OS 10.11.6, Chrome 74.0.3729.131), it ren­ders as a box:

It is prob­a­bly a good idea to use LaTeX to en­code such sym­bols.

UPDATE: It does work prop­erly in Fire­fox 67.0.2 (on the same ma­chine):

• Thanks for the bug re­port; I ed­ited the post to use LaTeX \vec{x}`. (The com­bin­ing ar­row worked for me on Fire­fox 67.0.1 and was kind-of-ugly-but-definitely-ren­ders on Chromium 74.0.3729.169, on Xubuntu 16.04)

It is prob­a­bly a good idea to use LaTeX to en­code such sym­bols.

I’ve been do­ing this thing where I pre­fer to use “plain” Uni­code where pos­si­ble (where, e.g., the sub­script in “x₁” is 0x2081 SUBSCRIPT ONE) and only re­sort to “fancy” (and there­fore sus­pi­cious) LaTeX when I re­ally need it, but the re­ported Chrome-on-macOS be­hav­ior does slightly al­ter my per­cep­tion of “re­ally need it.”

• I’ve been do­ing this thing where I pre­fer to use “plain” Uni­code where possible

I en­tirely sym­pa­thize with this prefer­ence!

Un­for­tu­nately, proper ren­der­ing of Uni­code de­pends on the availa­bil­ity of the req­ui­site char­ac­ters in the fal­lback fonts available in a user’s OS/​client com­bi­na­tion (which vary un­pre­dictably). This means that the more ex­otic code points can­not be re­lied on to prop­erly ren­der with ac­cept­able con­sis­tency.

Now, that hav­ing been said, and availa­bil­ity and proper ren­der­ing aside, I can­not en­dorse your use of such code points as U+2081 SUBSCRIPT ONE. Such ty­po­graphic fea­tures as sub­scripts ought prop­erly to be en­coded via OpenType meta­data[1], not via Uni­code (and in­deed I con­sider the ex­is­tence of these code points to be a nec­es­sary evil at best, and pos­si­bly just a bad idea). In the case where OpenType meta­data edit­ing[2] is not available, the proper ap­proach is ei­ther LaTeX, or “low-tech” ap­prox­i­ma­tions such as brack­ets.

1. Which, in turn, ought to be gen­er­ated pro­gram­mat­i­cally from, e.g., HTML markup (or even higher-level markup lan­guages like Mark­down or wiki markup), rather than in­serted man­u­ally. This is be­cause the out­put gen­er­a­tion code must be able to de­cide whether to use OpenType meta­data or whether to in­stead use lower-level ap­proaches like the HTML+CSS lay­out sys­tem, etc., de­pend­ing on the ca­pa­bil­ities of the out­put medium in any given case. ↩︎

2. That is, the edit­ing of the req­ui­site markup that will gen­er­ate the proper OpenType meta­data; see pre­vi­ous foot­note. ↩︎

• (I won­der how a “-1” ended up in the canon­i­cal URL slug (/​cu7YY7WdgJBs3DpmJ/​the-uni­vari­ate-fal­lacy-1)? Did some­one else have a draft of the same name, and the sys­tem wants unique slugs??)

• Per­haps it an­ti­ci­pates that you will write a se­quel.