A Correspondence Theorem in the Maximum Entropy Framework

johnswentworth11 Nov 2020 22:46 UTC

LW: 33 AF: 18

Classical mechanics didn’t work any less well once we discovered quantum, Galilean relativity and Newtonian gravity didn’t work any less well once we discovered special and general relativity, etc. This is the correspondence principle, aka Egan’s Law: in general, to the extent that old models match reality, new models must reproduce the old.

This sounds like it should be a Theorem, not just a Law—a “correspondence theorem”.

This post presents such a theorem within one particular framework—maximum entropy. It’s not the theorem I’d eventually like—even my correspondence theorem from last month has more substance to it. But this one is simple, intuitive, and both its strong points and shortcomings help to illustrate what I want out of a correspondence theorem. (In fact, I wrote down this one before the other correspondence theorem, and the shortcomings of this theorem partially inspired that one.)

Background: Principle of Maximum Entropy

By “maximum entropy framework”, I mean roughly the technique used by Jaynes—see e.g. the widget problem in Logic of Science, page 440. If you’ve seen the principle of maximum entropy in the context of statistical mechanics, it’s the same thing. I expect a lot of people who are otherwise familiar with maximum entropy distributions are not familiar with this particular framework (especially outside of a physics context), so a bit of background is in order.

We’ll start with the widget problem: a company produces red, green and blue widgets, and for inventory purposes we wish to predict how many of each color will be sold on a given day. Based on sales aggregated over the past year, the average sales per day are roughly 40 green, 20 red, and 10 blue. Given only this information, what distribution should we assign to tomorrow’s sales?

In a high-school statistics class, the answer would be “insufficient information”—the averages are not enough info to figure out the whole distribution. But this isn’t a high-school statistics class. We’re Bayesians, we’ve received relevant information, we have to update somehow.

Jaynes argues that the canonically-correct way to do this is maximum entropy: we find the model $M$ which has the highest possible entropy, subject to the constraints $E [# g r e e n | M] = 40$ , $E [# r e d | M] = 20$ , and $E [# b l u e | M] = 10$ . A couple ways to interpret this idea:

The classical argument from stat mech: the vast majority of distributions are very close to the maximum entropy distribution. Equivalently, we implicitly have a (non-normalizable) uniform prior over distribution-space which we’re updating (via Bayes’ rule) based on the expections.
The argument from information: maximizing entropy means minimizing the information assumed in the model. By maximizing entropy subject to the expectation constraint, we’re accounting for the expectation but assuming as little as possible aside from that (in an information-theoretic sense).

Regardless of how we interpret it, the math works out the same. If our variable is X and our constraints are $E [f_{i} (X) | M] = μ_{i}$ , then the maximum entropy distribution is

$P [X | M] = \frac{1}{Z} e^{λ . (f (x) - μ)} d x$

where

$Z = m i n_{λ} \int_{x} e^{λ . (f (x) - μ)} d x$

and $λ$ is the minimizing argument. You can find derivations and discussions and all that in any stat mech book, or in Jaynes. For the widget problem above, $X$ would be the colors of each widget ordered in a day, $f_{g r e e n} (X)$ would be the number of green widgets ordered, and the constraints would say $E [f_{g r e e n} (X) | M] = 40$ , etc. To compute the $λ$ ’s, we’d evaluate the integral analytically and then minimize (see Jaynes).

Anyway, the main point here is that we can now “update” on certain kinds of information by adding constraints to our maximum entropy problem. For instance, if we find out that the daily variance in green widget sales is 10, then we’d add a constraint saying $E [(f_{g r e e n} (X) - 40)^{2} | M^{'}] = 10$ . Our maximum entropy distribution would then have one additional $λ,$ and an additional term $((f_{g r e e n} (x) - 40)^{2} - 10)$ in the exponent. All written out, we’d go from

$P [X = x | M] = \frac{1}{Z} e^{λ_{g r e e n} (f_{g r e e n} (x) - 40) + λ_{r e d} (f_{r e d} (x) - 20) + λ_{b l u e} (f_{b l u e} (x) - 10)} d x$

$P [X = x | M^{'}] = \frac{1}{Z^{'}} e^{λ_{g r e e n}^{'} (f_{g r e e n} (x) - 40) + λ_{r e d}^{'} (f_{r e d} (x) - 20) + λ_{b l u e}^{'} (f_{b l u e} (x) - 10) + λ_{v a r}^{'} ((f_{g r e e n} (x) - 40)^{2} - 10)} d x$

… and we’d have to solve the modified minimization problem to find $λ^{'}$ and $Z^{'}$ .

Conceptual takeaway: rather than updating on individual data points, in this framework we’re given a sequence of summaries of “features” of the dataset, of the form “expectation of $f_{i} (X)$ is $μ_{i}$ ” (found by e.g. computing the average of $f_{i}$ over a large data set). Each such feature becomes a new constraint in an optimization problem. This turns out to be equivalent to a Bayesian update in situations where a Bayesian update makes sense, but is more general—roughly speaking, it can work directly with virtual evidence updates.

Correspondence Theorem for Maximum Entropy Updates

On to the interesting part.

Let’s imagine that two analysts are (separately) building maximum-entropy models of some data. They each query the giant database for average values of certain features - $f_{1} (X)$ for the first analyst, $f_{2} (X)$ for the second. They end up with two models:

$M_{1}$ is the maximum-entropy model with features $E [f_{1} (X) | M_{1}] = μ_{1}$
$M_{2}$ is the maximum-entropy model with features $E [f_{2} (X) | M_{2}] = μ_{2}$

We’ll assume that both of these are “correct”, in the sense that the average values $μ$ actually do match the data.

Let’s say that model 2 is “better” than model 1, in the sense that it has better predictive power on the real data-generating process $D$ : $E [l n P [X | M_{2}] | D] > E [l n P [X | M_{1}] | D]$ . (This is proportional to the average number of bits used by each model to encode a data point $X$ from $D$ .) So, the analysts’ boss plans to just use model 2. But let’s stretch the story to AI-alignment-style concerns: what if model 2 is using some weird ontology? What if the things the company cares about are easy to express in terms of the features $f_{1}$ , but hard to express in terms of the features $f_{2}$ ?

Now for the claim.

We have two possibilities. Either:

We can construct a third model $M^{'}$ which had strictly better predictive power than $M_{2}$ , OR
The features $E [f_{1} (X) | M_{1}] = μ_{1}$ are already implied by $M_{2}$ ; those features are already “in there” in some sense.

The proof will show the sense in which this is true.

Proof

The obvious thing to do is combine the two models into a single maximum-entropy model M’ with both the features $E [f_{1} (X) | M^{'}] = μ_{1}$ and $E [f_{2} (X) | M^{'}] = μ_{2}$ . How does the predictive power of this model look?

For maximum-entropy models in general, the predictive power $E [l n P [X | M] | D]$ is has a simple expression, assuming the values $μ$ are correct for $D$ (i.e. $E [f (X) | D] = μ$ ):

$E [l n P [X | M] | D] = \int_{x} [λ . (f (x) - μ) - l n Z] p [x | D] d x = λ . (μ - μ) - l n Z = - l n Z$

… so it’s just the negative log of the normalizer $Z$ . So, $M^{'}$ has higher predictive power than $M_{2}$ if-and-only-if $Z^{'} < Z_{2}$ .

Now, recall that $Z$ comes from a minimization problem. Specifically:

$Z^{'} = m i n_{λ_{1}, λ_{2}} \int_{x} e^{λ_{1} . (f_{1} (x) - μ_{1}) + λ_{2} . (f_{2} (x) - μ_{2})} d x$

$Z_{2} = m i n_{λ_{2}} \int_{x} e^{λ_{2} . (f_{2} (x) - μ_{2})} d x$

Key thing to notice: the objective for $Z_{2}$ is just the objective for $Z^{'}$ with $λ_{1}$ set to zero. In other words: the space which model $M_{2}$ searches for minima is a subset of the space which model $M^{'}$ searches for minima. Thus, $Z^{'}$ is always at least as small as $Z_{2}$ ; model $M^{'}$ has predictive power at least as high as $M_{2}$ .

Furthermore, let’s assume that the optima in these problems are unique—that’s not necessarily the case, but it is usually true in practice. (The objectives are convex, so uniqueness of the minimum can only fail in specific ways—I’ll leave the “fun” of working that out as an exercise to the reader.) We know that $M^{'}$ reduces to $M_{2}$ when $λ_{1} = 0$ ; if the optima are unique and $Z^{'} = Z_{2}$ then that means $λ_{1}$ is indeed 0, so $M^{'} = M_{2}$ .

… but $M^{'}$ has to satisfy the constraints $E [f_{1} (X) | M^{'}] = μ_{1}$ . So if $M^{'} = M_{2}$ , then $M_{2}$ also satisfies those constraints. That’s the sense in which the features $E [f_{1} (X) | D] = μ_{1}$ are “already present” in $M_{2}$ : $E [f_{1} (X) | M_{2}] = μ_{1}$ .

So, we have two cases:

$M^{'} \neq M_{2}$ : $M^{'}$ has strictly better predictive power (i.e. $E [l n P [X | M^{'}] | D] > E [l n P [X | M_{2}] | D]$ )
$M^{'} = M_{2}$ : the features $E [f_{1} (X) | D] = μ_{1}$ from model 1 are already implicit in model 2 (i.e. $E [f_{1} (X) | M_{2}] = μ_{1}$ )

What’s Really Going On Here?

If we strip away the math, the underlying phenomenon here seems kind of trivial.

The key is that we assume $μ$ is correct for the true data-generation process $D$ . We justify this by imagining that we have some very large number of data points, so law of large numbers kicks in and we can correctly estimate averages of our features. We’re not just collecting noisy data points; we’re directly learning facts about the true distribution, and we’re learning those facts with perfect certainty.

So our theorem is saying something like… two models both contain some facts about the (observable) true distribution. Either:

we can combine them into a strictly better model which contains all the facts from both, OR
all the facts from one model are already contained in the other, and the combined model makes the same predictions as the “better” original model.

(A warning, however: this intuitive story is not perfect. Even if the combined model makes the same predictions as one of the original models about the data, it can still update differently as we learn new facts.)

Is This Trivial?

Let’s go back to the starting point: it all adds up to normality. New models need to reproduce the old models in all the places where the old models worked—otherwise the new models are strictly suboptimal.

The obvious-but-trivial formalization of this is that new models have to make the same predictions about the data as the old models, in all the places where the old models predicted correctly. Corollary: any features (i.e. functions) of the data correctly predicted by the old models must also be correctly predicted by the new models.

… and that’s basically what we’ve proven. Within the maximum entropy framework, any features of the data (specifically long-run average values) correctly predicted by an “old” model must also be correctly predicted by a “new” model, else the new model is strictly suboptimal. So in that sense, it seems pretty trivial.

However, there are two senses in which it’s nontrivial. First, in the case where the new model incorrectly predicts some feature-values $μ$ encoded in the old model, we’ve explicitly constructed a new model which outperforms the old. It’s even a pretty simple, clean model—just another maximum entropy model.

Second, even the “trivial” idea that new models must make the same predictions about the data in places where the old model was right can cover some pretty nontrivial cases, because “features” of the data distribution can be pretty nontrivial. For instance, we can have a whole class of features of the form $E [g_{i} (X_{1}) g_{j} (X_{2})] = E [g_{i} (X_{1})] E [g_{j} (X_{2})]$ . With infinitely many such constraints, we can encode independence of $X_{1}$ and $X_{2}$ . After all, independence of two observed variables is a property of the data distribution, so it’s something we can use as a “feature”. Likewise for conditional independence. (Of course, once we get into infinite-feature territory we do need to be more careful about applying the theory to real, finite data sets...)

What links here?

Early Thoughts on Ontology/Grounding Problems by johnswentworth (14 Nov 2020 23:19 UTC; 32 points)

johnswentworth11 Nov 2020 22:46 UTC

LW: 33 AF: 18

7 comments6 min readLW link

World Modeling

Daniel Kokotajlo 12 Nov 2020 11:19 UTC
LW: 6 AF: 3
AF
This might be a good time to talk about different ways “it all adds up to normality” is interpreted.
I sometimes hear people use it in a stronger sense, to mean not just that the new theory must make the same successful predictions but that also the policy implications are mostly the same. E.g. “Many worlds has to add up to normality, so one way or another it still makes sense for us to worry about death, try to prevent suffering, etc.” Correct me if I’m wrong, but this sort of thing isn’t entailed by your proof, right?
There’s also the issue of what counts as the data that the new theory needs to correctly predict. Some people think that “This is a table, damn it! Not a simulated table!” is part of their data that theories need to account for. What do you say to them?
- johnswentworth 12 Nov 2020 17:24 UTC
  LW: 4 AF: 2
  AF Parent
  Good questions.
  On policy implications, I see two different types of claim there. One is something like “if the best way to achieve X was Y, then that should still hold”. In terms of abstraction: if both the goal X and the possible actions Y are defined at a high level of abstraction, and action Y is optimal in the high-level model, then any low-level model which abstracts into the high-level model should also predict that Y (or the low-level thing corresponding to Y) is optimal. Something roughly-like-that could be implied by a proof roughly-like-this, depending on exactly how we’re “learning” behavior under interventions/counterfactuals.
  The other type of claim is something like “if X was morally right before, then X should still be morally right with the new world-model”. Whether this is “entailed” by this sort of proof depends on what assumptions about morality we’re bringing to the table. As a general rule, I try to avoid directly thinking about morality at all. I think about what I want (in the “what I wish the world were like” sense, not in the “I want to eat right now” sense). If I’m building an AI (or helping build one, or causing one to be built, etc) then “what do I want the world to look like?” is the relevant question to ask, and “morality”—however it’s defined—is relevant only insofar as it influences that question. So that mostly brings us back to the previous type of claim.
  As for this:
  Some people think that “This is a table, damn it! Not a simulated table!” is part of their data that theories need to account for. What do you say to them?
  I mostly ignore them. It is not an objection which needs to be addressed in order to build an AI, or model biological or economic systems, or any of the other things I actually care about doing with the theory.
  - Daniel Kokotajlo 13 Nov 2020 8:59 UTC
    LW: 4 AF: 2
    AF Parent
    On policy implications: I think that the new theory almost always generates at least some policy implications. For example, relativity vs. newton changes how we design rockets and satellites. Closer to home, multiverse theory opens up the possibility of (some kinds of) acausal trade. I think “it all adds up to normality” is something that shouldn’t be used to convince yourself that a new theory probably has the same implications; rather, it’s something that should be used to convince yourself that the new theory is incorrect, if it seems to add up to something extremely far from normal, like paralysis or fanaticism. If it adds up to something non-normal but not that non-normal, then it’s fine.
    I brought up those people as an example of someone you probably disagree with. My purpose was to highlight that choices need to be made about what your data is, and different people make them differently. (For an example closer to home, solomonoff induction makes it differently than you do, I predict) This seems to me like the sort of thing one should think about when desigining an AI one hopes to align. Obviously if you are just going for capabilities rather than alignment you can probably get away with not thinking hard about this question.
    - johnswentworth 13 Nov 2020 18:10 UTC
      LW: 4 AF: 2
      AF Parent
      I basically agree with what you’re saying about policy implications. What I want to say is more like “if we actually tried high-level interventions X and Y, and empirically X worked better for high-level success metric Z, then that should still be true under the new model, with a lower-level grounding of X, Y and Z”. It’s still possible that an old model incorrectly predicts which of X and Y work better empirically, which would mean that the old model has worse predictive performance. Similarly: if the old model predicts that X is the optimal action, then the new model should still predict that, to the extent that the old model successfully predicts the world. If the new model is making different policy recommendations, then those should be tied to some place where the old model had inferior predictive power.
      This seems to me like the sort of thing one should think about when desigining an AI one hopes to align.
      This is not obvious to me. Can you explain the reasoning and/or try to convey the intuition?
      - Daniel Kokotajlo 14 Nov 2020 7:31 UTC
        LW: 4 AF: 2
        AF Parent
        OK, sounds good.
        I’m not sure either, but it seems true to me. Here goes intuition-conveying attempt… First, the question of what counts as your data seems like a parameter that must be pinned down one way or another, and as you mention there are clearly wrong ways to do it, and meanwhile it’s an open philosophical controversy, so on those grounds alone it seems plausibly relevant to building an aligned AI, at least if we are doing it in a principled way rather than through prosaic (i.e. we do an automated search for it) methods. Second, one’s views on what sorts of theories fit the data depend on what you think your data is. Disputes about consciousness often come down to this, I think. If you want your AI to be physicalist rather than idealist or cartesian dualist, you need to give it the corresponding notion of data. And what kind of physicalist? Etc. Or you might want it to be uncertain and engage in philosophical reasoning about what counts as its data… which sounds like also something one has to think about, it doesn’t come for free when building an AI. (It does come for free if you are searching for an AI)
adamShimi 12 Nov 2020 16:37 UTC
LW: 3 AF: 2
AF
Thanks for this post! I thought at the start that I would need to jump over a lot of the details, but I found your explanations clear. It is also a very intuitive explanation of the maximum entropy principle, that I saw mentioned multiple times but never really understood.
But let’s stretch the story to AI-alignment-style concerns: what if model 2 is using some weird ontology? What if the things the company cares about are easy to express in terms of the features $f_{1}$ , but hard to express in terms of the features $f_{2}$ ?
Wouldn’t the fact that the final model (either $M_{2}$ or $M^{'}$ depending on the case of you theorem) uses $M_{2}$ ’s ontology still be an issue? Or is the intuition that the ontology of $M_{1}$ should correct for the weirdness of the ontology of $M_{2}$ ?
(A warning, however: this intuitive story is not perfect. Even if the combined model makes the same predictions as one of the original models about the data, it can still update differently as we learn new facts.)
Do you have a simple example of that?
However, there are two senses in which it’s nontrivial. First, in the case where the new model incorrectly predicts some feature-values $μ$ encoded in the old model, we’ve explicitly constructed a new model which outperforms the old. It’s even a pretty simple, clean model—just another maximum entropy model.
Are you saying that even if $M_{2}$ doesn’t encode the features of $M_{1}$ , we can build $M^{'}$ that does? So the non-trivial part is that we can still integrate both sets of features into a better predictive model that is still simple?
- johnswentworth 12 Nov 2020 18:48 UTC
  LW: 2 AF: 1
  AF Parent
  Good questions.
  Wouldn’t the fact that the final model (either $M_{2}$ or $M^{'}$ depending on the case of you theorem) uses $M_{2}$ ’s ontology still be an issue? Or is the intuition that the ontology of $M_{1}$ should correct for the weirdness of the ontology of $M_{2}$ ?
  Roughly speaking, the final model “has the $M_{1}$ features in it”, either explicitly or implicity. There may be extra features in $M_{2}$ , which weren’t in $M_{1}$ , but the $M_{1}$ ontology is still “a subset” of the $M_{2}$ ontology.
  But the real answer here is that our intuitions about what-ontologies-should-do don’t work all that well without bringing in causal models, which we haven’t explicitly done in this post. The “features” here are an imperfect stand-in for causal structure, which comes a lot closer to capturing the idea of “ontology”. That said, when we rewrite causal models in the maximum entropy formalism, the causal structure does end up encoded in the “features”.
  Do you have a simple example of that?
  A standard normal distribution has kurtosis (fourth moment) of 3. If we explicitly add a constraint fixing the kurtosis to 3, then the contraint is slack (since it’s already satisfied by the standard normal), and the whole thing is still just a standard normal.
  But if we then add an additional constraint, things can diverge. For instance, we can add a constraint to a standard normal that the sixth moment is 2. I have no idea what the sixth moment of a standard normal is, but it probably isn’t 2, so this will change the distribution—and the new distribution probably won’t have kurtosis of 3. On the other hand, if we add our (sixth moment = 2) constraint to the (std normal + kurtosis = 3) distribution, then we definitely will end up with a kurtosis of 3, because that’s still an explicit constraint.
  Does that make sense?
  Are you saying that even if $M_{2}$ doesn’t encode the features of $M_{1}$ , we can build $M^{'}$ that does? So the non-trivial part is that we can still integrate both sets of features into a better predictive model that is still simple?
  Correct.