How to Throw Away Information

Prob­a­bil­ity as Min­i­mal Map ar­gued that the prob­a­bil­ity is a min­i­mal rep­re­sen­ta­tion of the in­for­ma­tion in data which is rele­vant to the query . In other words, is a perfectly effi­cient map of some ter­ri­tory based on data , suit­able for the query . More gen­er­ally, a full prob­a­bil­ity dis­tri­bu­tion is a perfectly effi­cient map suit­able for any queries about the ran­dom vari­able . Bayesian prob­a­bil­ity tells how to up­date our map as we gain new in­for­ma­tion.

What if, for some rea­son, we wanted to in­stead throw away some par­tic­u­lar in­for­ma­tion? We still want to rep­re­sent our map us­ing a prob­a­bil­ity dis­tri­bu­tion, but rather than adding in­for­ma­tion to it (via Bayes’ Rule), we want to re­move some in­for­ma­tion.

Let’s start with an ar­tifi­cial but con­crete ex­am­ple.

Coin-Sum Problem

I flip two coins, and ob­serve both out­comes—ei­ther 0 or 1 for each. I want to keep as much in­for­ma­tion as pos­si­ble, while throw­ing away all in­for­ma­tion from the ob­ser­va­tion rele­vant to their sum. How should I “up­date” my dis­tri­bu­tion over the out­comes?

We’ll write the out­come of our coin­flip as (“B” for “bit” or “bi­nary” or “Bernoulli”), and our fi­nal in­for­ma­tion-re­moved dis­tri­bu­tion as - so the no­ta­tion in­di­cates that we should throw out all info about the sum (the “/​” is meant to evoke e.g. a group quo­tient op­er­a­tion). Note that this kind of “re­move in­for­ma­tion” op­er­a­tion does not fol­low the usual rules of Bayesian prob­a­bil­ity.

At first, I rea­son as fol­lows:

  • My fi­nal dis­tri­bu­tion is a func­tion of the out­comes I saw, so my over­all strat­egy should spec­ify four differ­ent dis­tri­bu­tions—one each for (0, 0), (1, 0), (0, 1), and (1, 1). E.g. if I see (0, 0), then I will up­date to the (0, 0) dis­tri­bu­tion

  • The fi­nal dis­tri­bu­tion for (0, 0) and (1, 1) must be the same, oth­er­wise we could dis­t­in­guish a sum of 0 from a sum of 2 by look­ing at which fi­nal dis­tri­bu­tion was cho­sen.

  • … but then the fi­nal dis­tri­bu­tions for (0, 1) and (1, 0) must also be the same as (0, 0), else we could tell that the sum is 1 when­ever we see some other dis­tri­bu­tion.

… so in or­der to throw out all in­for­ma­tion about the sum, we have to throw out all the in­for­ma­tion from the ob­ser­va­tions. We effec­tively don’t up­date at all, and just keep our prior.

This seems kind of weird from an in­for­ma­tion-the­o­retic stand­point. We have 2 bits of in­for­ma­tion from the ob­ser­va­tion, and the sum only con­tains bits. It seems like we ought to be able to keep around bit of in­for­ma­tion, some­how.

The trick turns out to be right at the start of the above rea­son­ing: “my fi­nal dis­tri­bu­tion is a func­tion of the out­comes I saw”. This quietly as­sumed that our “up­date” had to be de­ter­minis­tic. We can do bet­ter if we al­low ran­dom­ized up­dates—e.g. if I see (0, 0), then I ran­domly choose one of sev­eral dis­tri­bu­tions to up­date to.

Here’s an ex­am­ple of a ran­dom­ized strat­egy:

  • (0, 1) and (1, 0) have the same sum, so I’d like to be able to dis­t­in­guish them with cer­tainty. I’ll make and each de­ter­minis­tic, and as­sume they’re not equal.

  • In or­der to be un­able to dis­t­in­guish be­tween sum 0 and sum 1, when the out­come is (0, 0) I’m forced to ran­domly choose be­tween and , 50% chance of each. So, = 50% chance , 50% chance .

  • Same rea­son­ing ap­plies to (1, 1) as (0, 0).

  • In or­der for the fi­nal dis­tri­bu­tions to ac­cu­rately rep­re­sent our fi­nal in­for­ma­tion, they must be and .

Half the time (when the sum is 1) this strat­egy con­veys 1 bit of in­for­ma­tion (whether the first or sec­ond coin is the 1), so over­all we keep bit—ex­actly the in­for­ma­tion-the­o­retic limit.

Generalizing

The gen­eral form of the prob­lem:

  • We have some ran­dom vari­ables and . We ob­serve , and want to throw out all in­for­ma­tion about .

  • To do that, we con­struct some new vari­able (which may de­pend on , , and some ran­dom­ness), which con­tains as much in­for­ma­tion as pos­si­ble about but no in­for­ma­tion about .

  • We then throw away and keep , effec­tively “up­dat­ing” our prob­a­bil­ities from to .

In the coin ex­am­ple above, we can en­code as: 0 when is (0,1), 1 when is (1,0), oth­er­wise a ran­dom 5050 choice be­tween 0 and 1. Calcu­lat­ing then yields the dis­tri­bu­tions from the pre­vi­ous sec­tion. In this case, we can in­ter­pret S as the value of coin 1 as­sum­ing the sum was 1 - oth­er­wise it’s ran­dom.

Two ob­vi­ous ques­tions are ex­is­tence and unique­ness:

  • Is it always pos­si­ble to find some which achieves the in­for­ma­tion-the­o­retic bound, i.e. (where is mu­tual in­for­ma­tion and is en­tropy)?

  • Is unique?

Unique­ness we can an­swer eas­ily: no, the “op­ti­mal” re­duced in­for­ma­tion is not unique. Coun­terex­am­ple: flip two coins, and throw out all info about whether the sum is odd or even. We have 2 bits of info to­tal, we have to throw out 1 bit of info, and we can keep around 1 bit in (at least) two differ­ent ways: just keep the out­come of the first coin, or just keep the out­come of the sec­ond coin. Either of these, by it­self, con­tains 1 bit of info and tells us noth­ing about whether the sum is odd or even.

Ex­is­tence is trick­ier.

If we ob­serve both and , then we can definitely con­struct , us­ing cousin_it’s method from this com­ment: for each pos­si­ble value of , con­tains an -value ran­domly sam­pled from - ex­cept for the ob­served -value, which maps to the ob­served -value. For in­stance, if we ob­served (1, 0) in our coin-sum prob­lem, then one pos­si­ble value of would be - so S tells us that if the sum Y=1, then X = (1, 0). tells us noth­ing at all about , but if we later re-learn the value of , then we can just look up the value of in . This im­plies that achieves the in­for­ma­tion-the­o­retic bound: and to­gether are suffi­cient to re­con­struct .

How­ever, we may want to throw out what­ever in­for­ma­tion we have about some , even when we don’t have com­plete in­for­ma­tion about . In other words, what we re­ally want is to con­struct with­out know­ing - i.e. con­struct from alone. I still don’t know whether the in­for­ma­tion-the­o­retic bound is always achiev­able in this case, or what the op­ti­mal in­for­ma­tion con­tent is if the bound isn’t achiev­able.

Why Is This In­ter­est­ing?

A lot of tricky prob­lems in de­ci­sion the­ory feel like the agent should choose to throw away in­for­ma­tion. Chris Leong ar­gues that this is a use­ful way to think about log­i­cal coun­ter­fac­tu­als in gen­eral, and I’ve heard other peo­ple ex­press similar in­tu­itions. The ap­proach in this post offers a way to for­mal­ize and quan­tify “for­get­ting”, in a form po­ten­tially use­ful for these kinds of ap­pli­ca­tions.

Along similar lines: the sort of ran­dom­iza­tion used in game the­ory feels differ­ent from the sort of “ran­dom­ness” in­volved in un­cer­tainty. The former is util­i­tar­ian, and used to con­fuse ex­ter­nal op­po­nents. The lat­ter is epistemic, and only rele­vant to in­ter­nal be­liefs. The ran­dom­iza­tion in­volved in throw­ing away in­for­ma­tion feels similar to game-the­o­retic ran­dom­ness, yet it’s in the sort of in­for­ma­tional frame­work usu­ally used for rea­son­ing about un­cer­tainty—sug­gest­ing that these two types of ran­dom­ness could be han­dled in a more unified con­cep­tual frame­work.

For ex­am­ple: sup­pose agent 1 and agent 2 play a zero-sum game. Agent 2 chooses its ac­tion by run­ning a copy of agent 1’s source code, then out­putting what­ever ac­tion beats the copy’s ac­tion. What should agent 1 do? One pos­si­ble ap­proach is for agent 1 to throw away any in­for­ma­tion about its copy’s ac­tion which is con­tained in its own ac­tion—so that the copy’s be­hav­ior yields no in­for­ma­tion about agent 1’s be­hav­ior. Ran­dom­iza­tion then nat­u­rally en­ters pic­ture, due to throw­ing away in­for­ma­tion. (This doesn’t seem like quite the right de­ci­sion-mak­ing rule in gen­eral, but it feels like some­thing similar could work—the main idea is to make “throw away in­for­ma­tion” an ac­tion in the game it­self, and push game-the­o­retic ran­dom­iza­tion of choices into the info-throw-away ac­tion.) One could imag­ine re-de­riv­ing Nash equil­ibria via agents throw­ing away in­for­ma­tion as an ac­tion, with­out any other source of ran­dom­ness. Whether that can ac­tu­ally work is an­other open prob­lem.