Abstraction = Information at a Distance

Why is abstraction useful? Why use a high-level model rather than a low-level model?

An example: when I type “4+3” in a python shell, I think of that as adding two numbers, not as a bunch of continuous voltages driving electric fields and current flows in little patches of metal and doped silicon. Why? Because, if I’m thinking about what will show up on my monitor after I type “4+3” and hit enter, then the exact voltages and current flows on the CPU are not relevant. This remains true even if I’m thinking about the voltages driving individual pixels in my monitor—even at a fairly low level, the exact voltages in the arithmetic-logic unit on the CPU aren’t relevant to anything more than a few microns away—except for the high-level information contained in the “numbers” passed in and out.

Another example: if I’m an astronomer predicting the trajectory of the sun, then I’m presumably going to treat other stars as point-masses. At such long distances, the exact mass distribution within the star doesn’t really matter—except for the high-level information contained in the total mass and center-of-mass location.

If I’m running a Markov-Chain Monte Carlo algorithm, then I take sample points fairly far apart in “time”. As long as they’re far enough apart, they’re roughly independent—there isn’t any information from one sample relevant to the next.

If I’m planning a roadtrip from San Francisco to Los Angeles, the details of my route through the Bay Area are irrelevant to planning my route within LA—except for the high-level information contained in my choice of highway for the middle leg of the trip and the rough time I expect to get there.

General point: abstraction, in practice, is about keeping information which is relevant to things “far away”, and throwing out everything else.

Formalization

Let’s start with a bunch of random variables $X_{i}$ , and some notion of which variables are “nearby”: each variable $X_{i}$ has a set $s_{i}$ of indices of variables considered “nearby” $X_{i}$ . How $s_{i}$ is chosen may vary by application—maybe each $X_{i}$ is associated with some point in space and/or time, or maybe we’re looking at Markov blankets in a graphical model, or …

We want some high-level summary of $X_{i}$ ; we’ll define that by a function $f_{i} (X_{i})$ . We require that $f_{i} (X_{i})$ contain all information relevant to things far away—i.e. $X_{_{i}}$ , the variables not in $s_{i}$ .

We’ll consider a few different notions of “relevance” here. First and most obvious is predictive relevance - $f_{i} (X_{i})$ must contain all relevant information in the usual probabilistic/information-theoretic sense. Key subtlety: which information is relevant may itself depend on the values of other variables—e.g. maybe we have a conditional in a program which picks one of two variables to return. Should we keep around all information which is relevant in any possible case? All information which is relevant after averaging over some variables?

Looking back over the examples, I think the natural answer is: we’re keeping information relevant to things “far away” (i.e. variables not in $s_{i}$ ), so those are what we’re interested in. Everything within $s_{i}$ we can average over. Examples:

If I’m thinking about the relevance of the exact value of a particular voltage on the CPU to the output of “4+3” at a python shell, I’m happy to average over all the likely voltages of nearby wires.
If I’m thinking about the relevance of a little piece of mass in a star to predicting the sun’s trajectory, I’m happy to average over all the other little pieces of mass.
If I’m planning a trip from SF to LA, and figuring out the relevance of a particular detail of the route within the Bay Area, I’m happy to average over other details in the Bay Area.

Formally, our condition is:

$P [X_{_{i}} | f_{i} (X_{i})] = P [X_{_{i}} | X_{i}]$

We could even go a step further and apply the minimal map theorems to find $f_{i} (X_{i})$ containing the least possible information, although it won’t necessarily be the most computationally efficient summary.

Another notion of “relevance” is causal influence—while probabilistic information is the key criteria for prediction, causal influence is the key for planning. We want to know what impact an intervention on $X_{i}$ will have on far-away variables. We’re still happy to average over “nearby” variables, but there’s a new subtlety: we may also want to intervene on some of the variables far-away from $X_{i}$ . For instance, if we’re planning a road-trip, we want to be able to consider possible route plans within LA—different routes would be different interventions on variables far away from SF. Our high-level model needs to hold for any of these interventions. Our criteria become:

$P [X_{_{i}} | d o (f_{i} (X_{i}) = f_{i}^{*}), d o (X_{s} = X_{s}^{*})] = P [X_{_{i}} | d o (X_{i} = X_{i}^{*}), d o (X_{s} = X_{s}^{*})]$

… for any $s \subset_{i}$ , and any intervention values $X_{s}^{*}, f_{i}^{*}, X_{i}^{*}$ for which $f_{i}^{*} = f (X_{i}^{*})$ . Here $d o (f_{i} (X_{i}) = f_{i}^{*})$ means setting $X_{i}$ to an arbitrary value $X_{i}^{*}$ such that $f_{i} (X_{i}^{*})$ = $f_{i}^{*}$ - i.e. “we just need to get to the highway by noon, the details don’t matter, we can work them out later”. This requires that the details do not, in fact, matter—i.e. $P [X_{_{i}} | d o (X_{i} = X_{i}^{*}), d o (X_{s} = X_{s}^{*})]$ has the same value for different $X_{i}^{*}$ so long as $f (X_{i}^{*})$ remains the same. That’s what the notation $d o (f_{i} (X_{i}) = f_{i}^{*})$ is expressing.

Finally, we could combine our criteria: require that any interventions on $X_{_{i}}$ be supported, with either information or intervention on $X_{i}$ . The criteria:

$P [X_{_{i}} | f_{i} (X_{i}), d o (X_{s} = X_{s}^{*})] = P [X_{_{i}} | X_{i}, d o (X_{s} = X_{s}^{*})]$

$P [X_{_{i}} | d o (f_{i} (X_{i}) = f_{i}^{*}), d o (X_{s} = X_{s}^{*})] = P [X_{_{i}} | d o (X_{i} = X_{i}^{*}), d o (X_{s} = X_{s}^{*})]$

Both of these must hold for any $s \subset_{i}$ , and any intervention values $X_{s}^{*}, f_{i}^{*}, X_{i}^{*}$ for which $f_{i}^{*} = f (X_{i}^{*})$ . In that case, we can predict the effects of arbitrary interventions on $X_{i}$ and any of the $X_{_{i}}$ on other $X_{_{i}}$ , using only the summary information present in $f_{i} (X_{i})$ .