Public Static: What is Abstraction?

Author’s Note: Most of the posts in this se­quence are es­sen­tially a log of work-in-progress. This post is in­tended as a more pre­sentable (“pub­lic”) and higher-con­fi­dence (“static”) write-up of some for­mal­iza­tions of ab­strac­tion. Much of the ma­te­rial has ap­peared in other posts; the first two sec­tions in par­tic­u­lar are drawn al­most ver­ba­tim from the open­ing “What is Ab­strac­tion?” post.

Let’s start with a few ex­am­ples (bor­rowed from here) to illus­trate what we’re talk­ing about:

  • We have a gas con­sist­ing of some huge num­ber of par­ti­cles. We throw away in­for­ma­tion about the par­ti­cles them­selves, in­stead keep­ing just a few sum­mary statis­tics: av­er­age en­ergy, num­ber of par­ti­cles, etc. We can then make highly pre­cise pre­dic­tions about things like e.g. pres­sure just based on the re­duced in­for­ma­tion we’ve kept, with­out hav­ing to think about each in­di­vi­d­ual par­ti­cle. That re­duced in­for­ma­tion is the “ab­stract layer”—the gas and its prop­er­ties.

  • We have a bunch of tran­sis­tors and wires on a chip. We ar­range them to perform some log­i­cal op­er­a­tion, like maybe a NAND gate. Then, we throw away in­for­ma­tion about the un­der­ly­ing de­tails, and just treat it as an ab­stract log­i­cal NAND gate. Us­ing just the ab­stract layer, we can make pre­dic­tions about what out­puts will re­sult from what in­puts. Note that there’s some fuzzi­ness − 0.01 V and 0.02 V are both treated as log­i­cal zero, and in rare cases there will be enough noise in the wires to get an in­cor­rect out­put.

  • I tell my friend that I’m go­ing to play ten­nis. I have ig­nored a huge amount of in­for­ma­tion about the de­tails of the ac­tivity—where, when, what racket, what ball, with whom, all the dis­tri­bu­tions of ev­ery micro­scopic par­ti­cle in­volved—yet my friend can still make some re­li­able pre­dic­tions based on the ab­stract in­for­ma­tion I’ve pro­vided.

  • When we ab­stract for­mu­las like “1+1=2*1” and “2+2=2*2″ into “n+n=2*n”, we’re ob­vi­ously throw­ing out in­for­ma­tion about the value of n, while still mak­ing what­ever pre­dic­tions we can given the in­for­ma­tion we kept. This is what ab­strac­tion is all about in math and pro­gram­ming: throw out as much in­for­ma­tion as you can, while still main­tain­ing the core “pre­dic­tion”—i.e. the the­o­rem or al­gorithm.

  • I have a street map of New York City. The map throws out lots of info about the phys­i­cal streets: street width, pot­holes, power lines and wa­ter mains, build­ing fa­cades, signs and stoplights, etc. But for many ques­tions about dis­tance or reach­a­bil­ity on the phys­i­cal city streets, I can trans­late the ques­tion into a query on the map. My query on the map will re­turn re­li­able pre­dic­tions about the phys­i­cal streets, even though the map has thrown out lots of info.

The gen­eral pat­tern: there’s some ground-level “con­crete” model (or ter­ri­tory), and an ab­stract model (or map). The ab­stract model throws away or ig­nores in­for­ma­tion from the con­crete model, but in such a way that we can still make re­li­able pre­dic­tions about some as­pects of the un­der­ly­ing sys­tem.

No­tice that the pre­dic­tions of the ab­stract mod­els, in most of these ex­am­ples, are not perfectly ac­cu­rate. We’re not deal­ing with the sort of “ab­strac­tion” we see in e.g. pro­gram­ming or alge­bra, where ev­ery­thing is ex­act. There are go­ing to be prob­a­bil­ities in­volved.

In the lan­guage of em­bed­ded world-mod­els, we’re talk­ing about multi-level mod­els: mod­els which con­tain both a no­tion of “table”, and of all the pieces from which the table is built, and of all the atoms from which the pieces are built. We want to be able to use pre­dic­tions from one level at other lev­els (e.g. pre­dict bulk ma­te­rial prop­er­ties from micro­scopic struc­ture and/​or macro­scopic mea­sure­ments, or pre­dict from ma­te­rial prop­er­ties whether it’s safe to sit on the table), and we want to move be­tween lev­els con­sis­tently.

For­mal­iza­tion: Start­ing Point

To re­peat the in­tu­itive idea: an ab­stract model throws away or ig­nores in­for­ma­tion from the con­crete model, but in such a way that we can still make re­li­able pre­dic­tions about some as­pects of the un­der­ly­ing sys­tem.

So to for­mal­ize ab­strac­tion, we first need some way to spec­ify which “as­pects of the un­der­ly­ing sys­tem” we wish to pre­dict, and what form the pre­dic­tions take. The ob­vi­ous start­ing point for pre­dic­tions is prob­a­bil­ity dis­tri­bu­tions. Given that our pre­dic­tions are prob­a­bil­ity dis­tri­bu­tions, the nat­u­ral way to spec­ify which as­pects of the sys­tem we care about is via a set of events or logic state­ments for which we calcu­late prob­a­bil­ities. We’ll be ag­nos­tic about the ex­act types for now, and just call these “queries”.

That leads to a rough con­struc­tion. We start with some low-level model and a set of queries . From these, we con­struct a min­i­mal high-level model by keep­ing ex­actly the in­for­ma­tion rele­vant to the queries, and throw­ing away all other in­for­ma­tion. By the min­i­mal map the­o­rems, we can rep­re­sent di­rectly by the full set of prob­a­bil­ities ; and con­tain ex­actly the same in­for­ma­tion. Of course, in prac­ti­cal ex­am­ples, the prob­a­bil­ities will usu­ally have some more com­pact rep­re­sen­ta­tion, and will usu­ally con­tain some ex­tra­ne­ous in­for­ma­tion as well.

To illus­trate a bit, let’s iden­tify the low-level model, class of queries, and high-level model for a few of the ex­am­ples from ear­lier.

  • Ideal Gas:

    • Low-level model is the full set of molecules, their in­ter­ac­tion forces, and a dis­tri­bu­tion rep­re­sent­ing our knowl­edge about their ini­tial con­figu­ra­tion.

    • Class of queries con­sists of com­bi­na­tions of macro­scopic mea­sure­ments, e.g. one query might be “pres­sure = 12 torr & vol­ume = 1 m^3 & tem­per­a­ture = 110 K”.

    • For an ideal gas, the high-level model can be rep­re­sented by e.g. tem­per­a­ture, num­ber of par­ti­cles (of each type if the gas is mixed), and con­tainer vol­ume. Given these val­ues and as­sum­ing a near-equil­ibrium ini­tial con­figu­ra­tion dis­tri­bu­tion, we can pre­dict the other macro­scopic mea­surables in the queries (e.g. pres­sure).

  • Ten­nis:

    • Low-level model is the full micro­scopic con­figu­ra­tion of me and the phys­i­cal world around me as I play ten­nis (or what­ever else I do).

    • Class of queries is hard to sharply define at this point, but in­cludes things like “John will an­swer his cell phone in the next hour”, “John will hold a racket and hit a fuzzy ball in the next hour”, “John will play Civ for the next hour”, etc—all the things whose prob­a­bil­ities change on hear­ing that I’m go­ing to play ten­nis.

    • High-level model is just the sen­tence “I am go­ing to play ten­nis”.

  • Street Map:

    • Low-level model is the phys­i­cal city streets

    • Class of queries in­cludes things like “short­est path from Times Square to Cen­tral Park starts by fol­low­ing Broad­way”, “dis­tance be­tween the Met and the Hud­son is less than 1 mile”, etc—all the things we can de­duce from a street map.

    • High-level model is the map. Note that the phys­i­cal map also in­cludes some ex­tra­ne­ous in­for­ma­tion, e.g. the po­si­tions of all the in­di­vi­d­ual atoms in the piece of pa­per/​smart­phone.

Already with the sec­ond two ex­am­ples there seems to be some “cheat­ing” go­ing on in the model defi­ni­tion: we just define the query class as all the events/​logic state­ments whose prob­a­bil­ities change based on the in­for­ma­tion in the map. But if we can do that, then any­thing can be a “high-level map” of any “low-level ter­ri­tory”, with the queries taken to be the events/​state­ments about the ter­ri­tory which the map ac­tu­ally has some in­for­ma­tion about—not a very use­ful defi­ni­tion!

In­for­ma­tion About Things “Far Away”

In or­der for ab­strac­tion to ac­tu­ally be use­ful, we need some effi­cient way to know which queries the ab­stract model can ac­cu­rately an­swer, with­out hav­ing to di­rectly eval­u­ate each query within the low-level model.

In prac­tice, we usu­ally seem to have a no­tion of which vari­ables are “far apart”, in the sense that any in­ter­ac­tions be­tween the two are me­di­ated by many in-be­tween vari­ables.

In this graph­i­cal model, in­ter­ac­tions be­tween the vari­ables and the vari­ables are me­di­ated by the noisy vari­ables . Ab­strac­tion throws out in­for­ma­tion from which is wiped out by noise in , keep­ing only the in­for­ma­tion rele­vant to .

The me­di­at­ing vari­ables are noisy, so they wipe out most of the “fine-grained” in­for­ma­tion pre­sent in the vari­ables of in­ter­est. We can there­fore ig­nore that fine-grained in­for­ma­tion when mak­ing pre­dic­tions about things far away. We just keep around what­ever high-level sig­nal makes it past the noise of me­di­at­ing vari­ables, and throw out ev­ery­thing else, so long as we’re only ask­ing ques­tions about far-away vari­ables.

An ex­am­ple: when I type “4+3” in a python shell, I think of that as adding two num­bers, not as a bunch of con­tin­u­ous voltages driv­ing elec­tric fields and cur­rent flows in lit­tle patches of metal and doped sili­con. Why? Be­cause, if I’m think­ing about what will show up on my mon­i­tor af­ter I type “4+3” and hit en­ter, then the ex­act voltages and cur­rent flows on the CPU are not rele­vant. This re­mains true even if I’m think­ing about the voltages driv­ing in­di­vi­d­ual pix­els in my mon­i­tor—even at a fairly low level, the ex­act voltages in the ar­ith­metic-logic unit on the CPU aren’t rele­vant to any­thing more than a few microns away—ex­cept for the high-level in­for­ma­tion con­tained in the “num­bers” passed in and out. In­for­ma­tion about ex­act voltages in spe­cific wires is quickly wiped out by noise within the chip.

Another ex­am­ple: if I’m an as­tronomer pre­dict­ing the tra­jec­tory of the sun, then I’m pre­sum­ably go­ing to treat other stars as point-masses. At such long dis­tances, the ex­act mass dis­tri­bu­tion within the star doesn’t re­ally mat­ter—ex­cept for the high-level in­for­ma­tion con­tained in the to­tal mass, mo­men­tum and cen­ter-of-mass lo­ca­tion.

For­mal­iz­ing this in the same lan­guage as the pre­vi­ous sec­tion:

  • We have some vari­ables and in the low-level model.

  • In­ter­ac­tions be­tween and are me­di­ated by noisy vari­ables .

  • Noise in wipes out most fine-grained in­for­ma­tion about , so only the high-level sum­mary is rele­vant to .

Math­e­mat­i­cally: for any which is “not too close” to - i.e. any which do not over­lap with (or with it­self). Our high-level model re­places with , and our set of valid queries is the whole joint dis­tri­bu­tion of given .

Now that we have two defi­ni­tions, it’s time to start the Venn di­a­gram of defi­ni­tions of ab­strac­tion.

So far, we have:

  • A high-level model throws out in­for­ma­tion from a low-level model in such a way that some set of queries can still be an­swered cor­rectly: .

  • A high-level model throws out in­for­ma­tion from some vari­able in such a way that all in­for­ma­tion about “far away” vari­ables is kept: .

Sys­tems View

The defi­ni­tion in the pre­vi­ous sec­tion just fo­cuses on ab­stract­ing a sin­gle vari­able . In prac­tice, we of­ten want to take a sys­tem-level view, ab­stract­ing a whole bunch of low-level vari­ables (or sets of low-level vari­ables) all at once. This doesn’t in­volve chang­ing the pre­vi­ous defi­ni­tion, just ap­ply­ing it to many vari­ables in par­allel.

We have mul­ti­ple non-over­lap­ping sets of low-level vari­ables , each with a set of “nearby” vari­ables . Ab­strac­tion will only re­tain in­for­ma­tion from each rele­vant to ’s which do not over­lap the cor­re­spond­ing . In par­tic­u­lar, this means queries will only be main­tained by the ab­strac­tion if is not “close to” - i.e. if does not over­lap . In the no­ta­tion be­low, these are called , to re­mind that they are the low-level vari­ables.

Rather than just one vari­able of in­ter­est , we have many low-level vari­ables (or non-over­lap­ping sets of vari­ables) and their high-level sum­maries . For each of the , we have some set of vari­ables “nearby” , which me­di­ate its in­ter­ac­tions with ev­ery­thing else. Our “far-away” vari­ables Y are now any far-away ’s, so we want

for any sets of in­dices and which are “far apart”—mean­ing that does not over­lap any or .

(No­ta­tion: I will use lower-case in­dices like for in­di­vi­d­ual vari­ables, and up­per-case in­dices like to rep­re­sent sets of vari­ables. I will also treat any sin­gle in­dex in­ter­change­ably with the set con­tain­ing just that in­dex.)

For in­stance, if we’re think­ing about wires and tran­sis­tors on a CPU, we might look at sep­a­rate chunks of cir­cuitry. Voltages in each chunk of cir­cuitry are , and sum­ma­rizes the bi­nary voltage val­ues. are voltages in any com­po­nents phys­i­cally close to chunk on the chip. Any­thing phys­i­cally far away on the chip will de­pend only on the bi­nary voltage val­ues in the com­po­nents, not on the ex­act voltages.

The main up­shot of all this is that we can rewrite the math in a cleaner way: as a (par­tial) fac­tor­iza­tion. Each of the low-level com­po­nents are con­di­tion­ally in­de­pen­dent given the high-level sum­maries, so:

This con­di­tion only needs to hold when picks out in­dices such that (i.e. we pick out a sub­set of the ’s such that no two are “close to­gether”). Note that we can pick any set of in­dices which satis­fies this con­di­tion—so we re­ally have a whole fam­ily of fac­tor­iza­tions of marginal dis­tri­bu­tions in which no two vari­ables are “close to­gether”. See the ap­pendix to this post for a proof of the for­mula.

In English: any set of low-level vari­ables which are all “far apart” are in­de­pen­dent given their high-level sum­maries . In­tu­itively, the pic­ture looks like this:

The ab­strac­tion con­di­tions let us swap low-level vari­ables with their high-level sum­maries, as long as all swapped vari­ables and any query vari­ables are all “far apart”.

We pick some set of low-level vari­ables which are all far apart, and com­pute their sum­maries . By con­struc­tion, we have a model in which each of the high-level vari­ables is a leaf in the graph­i­cal model, de­ter­mined only by the cor­re­spond­ing low-level vari­ables. But thanks to the ab­strac­tion con­di­tion, we can in­de­pen­dently swap any sub­set of the sum­maries with their cor­re­spond­ing low-level vari­ables—as­sum­ing that all of them are “far apart”.

Re­turn­ing to the digi­tal cir­cuit ex­am­ple: if we pick any sub­set of the wires and tran­sis­tors on a chip, such that no two are too phys­i­cally close to­gether, then we ex­pect that their ex­act voltages are roughly in­de­pen­dent given the high-level sum­mary of their digi­tal val­ues.

We’ll add this to our Venn di­a­gram as an equiv­a­lent for­mu­la­tion of the pre­vi­ous defi­ni­tion.

I have found this for­mu­la­tion to be the most use­ful start­ing point in most of my own think­ing, and it will be the jump­ing-off point for our last two no­tions of ab­strac­tion in the next two sec­tions.


So far we’ve only talked about “queries” on the joint dis­tri­bu­tion of vari­ables. Another nat­u­ral step is to in­tro­duce causal struc­ture into the low-level model, and re­quire in­ter­ven­tional queries to hold on far apart vari­ables.

There are some de­grees of free­dom in which in­ter­ven­tional queries hold on far apart vari­ables. One ob­vi­ous an­swer is “all of them”:

… with the same con­di­tions on as be­fore, plus the added con­di­tion that the in­dices in and also be far apart. This is the usual re­quire­ment in math/​pro­gram­ming ab­strac­tion, but it’s too strong for many real-world ap­pli­ca­tions. For in­stance, when think­ing about fluid dy­nam­ics, we don’t ex­pect our ab­strac­tions to hold when all the molecules in a par­tic­u­lar cell of space are pushed into the cor­ner of that cell. In­stead, we could weaken the low-level in­ter­ven­tion to sam­ple from low-level states com­pat­i­ble with the high-level in­ter­ven­tion:

We could even have low-level in­ter­ven­tions sam­ple from some en­tirely differ­ent dis­tri­bu­tion, to re­flect e.g. a phys­i­cal ma­chine used to perform the in­ter­ven­tions.

Another post will talk more about this, but it turns out that we can say quite a bit about causal ab­strac­tion while re­main­ing ag­nos­tic to the de­tails of the low-level in­ter­ven­tions. Any of the above in­ter­ven­tional query re­quire­ments have qual­i­ta­tively-similar im­pli­ca­tions, though ob­vi­ously some are stronger than oth­ers.

In day-to-day life, causal ab­strac­tion is ar­guably more com­mon than non-causal. In fully de­ter­minis­tic prob­lems, val­idity of in­ter­ven­tional queries is es­sen­tially the only con­straint (though of­ten in guises which do not ex­plic­itly men­tion causal­ity, e.g. func­tional be­hav­ior or logic). For in­stance, sup­pose I want to write a python func­tion to sort a list. The only con­straint is the ab­stract in­put/​out­put be­hav­ior, i.e. the be­hav­ior of the des­ig­nated “out­put” un­der in­ter­ven­tions on the des­ig­nated “in­puts”. The low-level de­tails—i.e. the ac­tual steps performed by the al­gorithm—are free to vary, so long as those high-level in­ter­ven­tional con­straints are satis­fied.

This gen­er­al­izes to other de­sign/​en­g­ineer­ing prob­lems: the de­sired be­hav­ior of a sys­tem is usu­ally some ab­stract, high-level be­hav­ior un­der in­ter­ven­tions. Low-level de­tails are free to vary so long as the high-level con­straints are satis­fied.

Ex­act Abstraction

Fi­nally, one im­por­tant spe­cial case. In math and pro­gram­ming, we typ­i­cally use ab­strac­tions with sharper bound­aries than most of those dis­cussed here so far. Pro­to­typ­i­cal ex­am­ples:

  • A func­tion in pro­gram­ming: be­hav­ior of ev­ery­thing out­side the func­tion is in­de­pen­dent of the func­tion’s in­ter­nal vari­ables, given a high-level sum­mary con­tain­ing only the func­tion’s in­puts and out­puts. Same for pri­vate vari­ables/​meth­ods of a class.

  • Ab­stract alge­bra: many prop­er­ties of math­e­mat­i­cal ob­jects hold in­de­pen­dent of the in­ter­nal de­tails of the ob­ject, given cer­tain high-level sum­mary prop­er­ties—e.g. the group ax­ioms, or the ring ax­ioms, or …

  • In­ter­faces for ab­stract data struc­tures: the in­ter­nal or­ga­ni­za­tion of the data struc­ture is ir­rele­vant to ex­ter­nal users, given the ab­stract “in­ter­face”—a high-level sum­mary of the ob­ject’s be­hav­ior un­der differ­ent in­puts (a.k.a. differ­ent in­ter­ven­tions).

In these cases, there’s no noisy in­ter­me­di­ate vari­ables, and no no­tion of “far away” vari­ables. There’s just a hard bound­ary: the in­ter­nal de­tails of high-level ab­stract ob­jects do not in­ter­act with things of in­ter­est “out­side” the ob­ject ex­cept via the high-level sum­maries.

We can eas­ily cast this as a spe­cial case of our ear­lier no­tion of ab­strac­tion: the set of noisy in­ter­me­di­ate vari­ables is empty. The “high-level sum­mary” of the low-level vari­ables con­tains all in­for­ma­tion rele­vant to any vari­ables out­side of them­selves.

Of course, ex­act ab­strac­tion over­laps quite a bit with causal ab­strac­tion. Ex­act ab­strac­tions in math/​pro­gram­ming are typ­i­cally de­ter­minis­tic, so they’re mainly con­strained by in­ter­ven­tional pre­dic­tions rather than dis­tri­bu­tional pre­dic­tions.


We started with a very gen­eral no­tion of ab­strac­tion: we take some low-level model and ab­stract it into a high-level model by throw­ing away in­for­ma­tion in such a way that we can still ac­cu­rately an­swer some queries. This is ex­tremely gen­eral, but in or­der to ac­tu­ally be use­ful, we need some effi­cient way to know which queries are and are not sup­ported by the ab­strac­tion.

That brought us to our next defi­ni­tion: ab­strac­tion keeps in­for­ma­tion rele­vant to “far away” vari­ables. We imag­ine that in­ter­ac­tions be­tween the vari­able-to-be-ab­stracted and things far away are me­di­ated by some noisy “nearby” vari­ables , which wipe out most of the in­for­ma­tion in . So, we can sup­port all queries on things far away by keep­ing only a rel­a­tively small sum­mary .

Ap­ply­ing this defi­ni­tion to a whole sys­tem, rather than just one vari­able, we find a clean for­mu­la­tion: all sets of far-apart low-level vari­ables are in­de­pen­dent given the cor­re­spond­ing high-level sum­maries.

Next, we ex­tended this to causal ab­strac­tion by re­quiring that in­ter­ven­tional queries also be sup­ported.

Fi­nally, we briefly men­tioned the spe­cial case in which there are no noisy in­ter­me­di­ate vari­ables, so the ab­strac­tion bound­ary is sharp: there’s just the vari­ables to be ab­stracted, and ev­ery­thing out­side of them. This is the usual no­tion of ab­strac­tion in math and pro­gram­ming.

Ap­pendix: Sys­tem For­mu­la­tion Proof

We start with two pieces. By con­struc­tion, is calcu­lated en­tirely from , so


… with­out any re­stric­tion on which sub­sets of the vari­ables we look at. Then we also have the ac­tual ab­strac­tion condition


… as long as does not over­lap or .

We want to show that

… for any set of non-nearby vari­ables (i.e. ). In English: sets of far-apart low-level vari­ables are in­de­pen­dent given their high-level coun­ter­parts.

Let’s start with defi­ni­tions of “far-apart” and “nearby”, so we don’t have to write them out ev­ery time:

  • Two sets of in­dices and are “far apart” if and do not over­lap , and vice-versa. In­di­vi­d­ual in­dices can be treated as sets con­tain­ing one el­e­ment for pur­poses of this defi­ni­tion—so e.g. two in­dices or an in­dex and a set of in­dices could be “far apart”.

  • Indices and/​or sets of in­dices are “nearby” if they are not far apart.

As be­fore, I will use cap­i­tal let­ters for sets of in­dices and lower-case let­ters for in­di­vi­d­ual in­dices, and I won’t dis­t­in­guish be­tween a sin­gle in­dex and the set con­tain­ing just that in­dex.

With that out of the way, we’ll prove a lemma:

… for any far apart from , both far apart from and (though and need not be far apart from each other). This lets us swap high-level with low-level given vari­ables as we wish, so long as they’re all far apart from each other and from the query vari­ables. Proof:

(by con­struc­tion)

(by ab­strac­tion)

(by con­struc­tion)

By tak­ing and then marginal­iz­ing out un­used vari­ables, this becomes

That’s the first half of our lemma. Other half:

(by Bayes)

(by first half)

(by Bayes)

That takes care of the lemma.

Armed with the lemma, we can finish the main proof by iter­at­ing through the vari­ables in­duc­tively:

(by Bayes)

(by con­struc­tion)

(by lemma)

(by Bayes)

(by lemma & can­cel­la­tion)

(by Bayes)

Here , , and are all far apart. Start­ing with empty and ap­ply­ing this for­mula to each vari­able , one-by-one, com­pletes the proof.