Probability as Minimal Map

What’s the differ­ence be­tween a map and the raw data used to gen­er­ate that map?

Sup­pose, for ex­am­ple, that google sends a bunch of cars out driv­ing around New York City, snap­ping pho­tos ev­ery few feet, then uses the pho­tos to re­con­struct a streetmap. In an in­for­ma­tional sense, what’s the differ­ence be­tween the gi­ant database full of pho­tos and the streetmap?

Ob­vi­ous an­swer: the map throws away a huge amount of in­for­ma­tion which we don’t care about, at least for the map’s use-case.

Let’s try to for­mal­ize this in­tu­ition a lit­tle. We want the abil­ity to run queries against the map—e.g. things like “How far apart are Times Square and the Met?” or ques­tions about street net­work topol­ogy. Let’s call the whole class of queries we want to sup­port . We also have some in­put data - e.g. the database full of images from the streets of NYC. The goal of the map-syn­the­siz­ing pro­cess is to throw out as much in­for­ma­tion as pos­si­ble from , while still keep­ing any and all in­for­ma­tion rele­vant to queries in .

Nat­u­ral next ques­tion: given the query class and some data-gen­er­at­ing pro­cess, can we char­ac­ter­ize “op­ti­mal” map-mak­ing pro­cesses which max­i­mize the amount of in­for­ma­tion “thrown out” (mea­sured in an in­for­ma­tion-the­o­retic sense) while still keep­ing all in­for­ma­tion rele­vant to ?

For the very sim­plest query classes, we can an­swer this ques­tion quite neatly.

Sup­pose that our query class con­tains only a sin­gle query , and the query is a true/​false ques­tion. For in­stance, maybe the only query we care about at all is = “Is the dis­tance be­tween Times Square and the Met greater than one mile?”, and we want to throw out as much in­for­ma­tion as pos­si­ble with­out los­ing any info rele­vant to the query.

Claim: any solu­tion to this prob­lem is iso­mor­phic to the prob­a­bil­ity . In other words, is it­self a min­i­mal “lossless” map (where by “lossless” we mean it does not throw out any info from rele­vant to the query, not that it perfectly pre­dicts ), and any other min­i­mal “lossless” map is a re­versible trans­for­ma­tion of .

Let’s prove it.

We’ll for­mal­ize our claim in terms of mu­tual in­for­ma­tion . We have two pieces:

  1. throws away no in­for­ma­tion rele­vant to :
  2. For any func­tion , if throws away no in­for­ma­tion rele­vant to , then can be com­puted from : im­plies for some .

One im­por­tant note: our no­ta­tion is re­ally short­hand for , so is not a func­tion of the “out­come” of - it’s only a func­tion of . Also note that the sec­ond con­di­tion im­plies that the en­tropy of is at least as high as , so is in­deed min­i­mal in terms of in­for­ma­tion con­tent.

In the next two sec­tions, we’ll sep­a­rately prove our two pieces.

Prob­a­bil­ity-Com­put­ing Circuits

We want to show that throws away no in­for­ma­tion rele­vant to : .

We’ll start with a short side-quest to prove a lemma: . Here’s an in­ter­pre­ta­tion: sup­pose you work late try­ing to calcu­late the prob­a­bil­ity of based on some gi­ant pile of data . You fi­nally com­pute the re­sult: . You go to bed feel­ing vic­to­ri­ous, but wake up to find that your com­puter has died and your gi­ant pile of data is lost. How­ever, you still re­mem­ber the fi­nal re­sult - even though you don’t know , you do know that . Based on this in­for­ma­tion, what prob­a­bil­ity should you as­sign to ? Pre­sum­ably the an­swer should be .

We can di­rectly prove this in a few lines:

Note that the in­di­ca­tor func­tion in the main sum is zero when­ever is not , so we can re­place with in that sum:

We can also write this in the more com­pact form , which we’ll use later.

To build a bit more in­tu­ition, let’s ex­pand on this re­sult a bit. Imag­ine a cir­cuit (aka de­ter­minis­tic causal di­a­gram or com­pu­ta­tional DAG) which takes in data and out­puts . We’ll think about some ar­bi­trary cross-sec­tion of that cir­cuit—some set of in­ter­nal nodes which cut the cir­cuit in two, so that me­di­ates the in­ter­ac­tion be­tween and . We can write this as for some func­tion .

Claim: for any cross-sec­tion , . In­tu­itively, we can imag­ine that we’re part­way through calcu­lat­ing , we’ve calcu­lated all the ’s but haven’t finished the full com­pu­ta­tion yet, when sud­denly our com­puter dies and we lose all the origi­nal data . As long as we still have the ’s around, we’re good—we just shrug and con­tinue the calcu­la­tion, since the ’s were all we needed to finish up any­way.

I won’t write out the proof for this one, but it works ex­actly like the pre­vi­ous proof.

No­tice that our cross-sec­tion the­o­rem has two in­ter­est­ing cor­ner cases: all of is a cross-sec­tion, and by it­self is a cross-sec­tion. The former yields the triv­ial state­ment , while the lat­ter yields our lemma from ear­lier: .

Armed with our lemma, let’s re­turn to the main prob­lem: show that throws away no in­for­ma­tion rele­vant to , for­mal­ized as .

We’ll start by ap­ply­ing an iden­tity from in­for­ma­tion the­ory:

… where is the Shan­non en­tropy. So to show that , we can show . Our lemma makes this triv­ial:

… so con­tains all the in­for­ma­tion in rele­vant to .

The Data Pro­cess­ing Inequality

Next, we’d like to show that is a min­i­mal rep­re­sen­ta­tion of the in­for­ma­tion in rele­vant to . For­mally, for any , if throws away no in­for­ma­tion rele­vant to , then is a func­tion of : im­plies for some .

In par­tic­u­lar, we’ll show that , so we can choose . In­tu­itively, this says that if throws out no in­for­ma­tion from rele­vant to , then it yields the same prob­a­bil­ity for .

To prove this, we’ll use a the­o­rem from in­for­ma­tion the­ory called the data pro­cess­ing in­equal­ity (DPI). In­tu­itively, the DPI says that if we start with the data , we can­not gain any new in­for­ma­tion about just by pro­cess­ing . For­mally, for any func­tion :

… with equal­ity iff and are in­de­pen­dent con­di­tional on (so that tells us noth­ing else about once we know ). The proof and a bit more dis­cus­sion can be found in Cover & Thomas’ text­book on page 34.

We’re in­ter­ested in the equal­ity case , so the DPI tells us that and are in­de­pen­dent con­di­tioned on :

But we can always just ex­pand via the chain rule in­stead:

Equate these two ex­pres­sions for , and we find

… which is what we wanted.

Why Is This In­ter­est­ing?

We’ve shown that the prob­a­bil­ity sum­ma­rizes all the in­for­ma­tion in rele­vant to , and throws out as much ir­rele­vant in­for­ma­tion as pos­si­ble. Why does this mat­ter?

First, hope­fully this pro­vides some in­tu­ition for in­ter­pret­ing a prob­a­bil­ity as a rep­re­sen­ta­tion of the in­for­ma­tion in rele­vant to . In short: prob­a­bil­ities di­rectly rep­re­sent in­for­ma­tion. This in­ter­pre­ta­tion makes sense even in the ab­sence of “agents” with “be­liefs”, or “in­de­pen­dent ex­per­i­ments” re­peated in­finitely many times. It di­rectly talks about maps match­ing ter­ri­to­ries, and the role prob­a­bil­ity plays, with­out in­vok­ing any of the ma­chin­ery of fre­quen­tist or sub­jec­tivist in­ter­pre­ta­tions. That means we can po­ten­tially ap­ply it in a broader va­ri­ety of situ­a­tions—we can talk about sim­ple me­chan­i­cal pro­cesses which pro­duce “maps” of the world, and the prob­a­bil­is­tic calcu­la­tions em­bed­ded in those pro­cesses.

Se­cond, this is a nat­u­ral first step in char­ac­ter­iz­ing map-mak­ing pro­cesses. The data pro­cess­ing in­equal­ity also sug­gests a nat­u­ral next step: suffi­cient statis­tics, which serve as min­i­mal “lossless” maps when our data are in­de­pen­dent sam­ples from a dis­tri­bu­tion, and our queries can ask any­thing at all about the pa­ram­e­ters of the dis­tri­bu­tion. From there, we could fur­ther gen­er­al­ize to ap­prox­i­mate suffi­cient statis­tics—maps which throw out a small amount of in­for­ma­tion rele­vant to the queries, in cases where there is no “lossless” map smaller than .

That said, this still only talks about half of the full map-mak­ing pro­cess: the data-pro­cess­ing half. There’s still the data-col­lec­tion half to ad­dress—the pro­cess which gen­er­ates in the first place, and some­how ties it to the ter­ri­tory. And of course, when map-mak­ing pro­cesses are em­bed­ded in the real world, there might not be a clean sep­a­ra­tion be­tween data-col­lec­tion and data-pro­cess­ing, so we need to deal with that some­how.

Third, the in­ter­pre­ta­tion of prob­a­bil­ities as a min­i­mal map—or a rep­re­sen­ta­tion of in­for­ma­tion—sug­gests pos­si­ble av­enues to re­lax tra­di­tional prob­a­bil­ity to deal with things which are time-like sep­a­rated from us, e.g. for han­dling self-refer­ence and ad­vanced de­ci­sion-the­o­retic prob­lems. In par­tic­u­lar, it sug­gests strate­gi­cally throw­ing away in­for­ma­tion to deal with self-refer­en­tial tom­fool­ery—but con­tin­u­ing to use prob­a­bil­ities to rep­re­sent the re­duced in­for­ma­tion re­main­ing.