A tractable, interpretable formulation of approximate conditioning for pairwise-specified probability distributions over truth values

János Kramár3 Jun 2015 19:08 UTC

LW: 3 AF: 2

These results from my conversations with Charlie Steiner at the May 29-31 MIRI Workshop on Logical Uncertainty will primarily be of interest to people who’ve read section 2.4 of Paul Christiano’s Non-Omniscience paper.

If we write a reasoner that keeps track of probabilities of a collection of sentences $φ_{1}, \dots, φ_{n}$ (that grows and shrinks as the reasoner explores), we need some way of tracking known relationships between the sentences. One way of doing this is to store the pairwise probability distributions, ie not only $Pr (φ_{i})$ for all $i$ but also $Pr (φ_{i} \land φ_{j})$ for all $i, j$ .

If we do this, a natural question to ask is: how can we update this data structure if we learn that eg $φ_{1}$ is true?

We’ll refer to the updated probabilities as $Pr (\cdot | φ_{1})$ .

It’s fairly reasonable for us to want to set $Pr (φ_{i} | φ_{1}) := Pr (φ_{i} \land φ_{1}) / Pr (φ_{1})$ ; however, it’s less clear what values to assign to $Pr (φ_{i} \land φ_{j} | φ_{1})$ , because we haven’t stored $Pr (φ_{i} \land φ_{j} \land φ_{1})$ .

One option would be to find the maximum entropy distribution over truth assignments to $φ_{1}, \dots, φ_{n}$ under the constraint that the stored pairwise distributions are correct. This seems intractable for large $n$ ; however, in the spirit of locality, we could restrict our attention to the joint truth value distribution of $φ_{1}, φ_{i}, φ_{j}$ . Maximizing its entropy is simple (it boils down to either convex optimization or solving a cubic), and yields a plausible candidate for $Pr (φ_{i} \land φ_{j} \land φ_{1})$ that we can derive $Pr (φ_{i} \land φ_{j} | φ_{1})$ from. I’m not sure what global properties this has, for example whether it yields a positive semidefinite matrix $(Pr (φ_{i} \land φ_{j}))_{i, j}$ .

A different option, as noted in section 2.4.2, is to observe that the matrix $(Pr (φ_{i} \land φ_{j}))_{i, j}$ must be positive semidefinite under any joint distribution for the truth values. This means we can consider a zero-mean multivariate normal distribution with this matrix as its covariance; then there’s a closed-form expression for the Kullback-Leibler divergence of two such distributions, and this can be used to define a sort of conditional distribution, as is done in section 2.4.3.

However, as the paper remarks, this isn’t a very familiar way of defining these updated probabilities. For example, it lacks the desirable property that $Pr (φ_{i} | φ_{1}) = Pr (φ_{i} \land φ_{1}) / Pr (φ_{1})$ .

Fortunately, there is a natural construction that combines these ideas: namely, if we consider the maximum-entropy distribution for the truth assignment vector $(1_{φ_{1}}, \dots, 1_{φ_{n}})$ with the given second moments $E (1_{φ_{i}} 1_{φ_{j}})$ , but relax the requirement that their values be in ${0, 1}$ , then we find a multivariate normal distribution $N ({(Pr (φ_{i}))}_{i}, {(Pr (φ_{i} \land φ_{j}) - Pr (φ_{i}) Pr (φ_{j}))}_{i, j}) .$ If we wish to update this distribution after observing $φ_{1}$ by finding the candidate distribution $(1_{φ_{1}}, \dots, 1_{φ_{n}} | φ_{1})$ of highest relative entropy with $Pr (1_{φ_{1}} = 1 | φ_{1}) = 1$ , as proposed in the paper, then we will get the multivariate normal conditional distribution $N ⎛ ⎝ {(Pr (φ_{1} \land φ_{i}) / Pr (φ_{1}))}_{i}, {(Pr (φ_{i} \land φ_{j}) - Pr (φ_{i}) Pr (φ_{j}) - \frac{(Pr (φ_{1} \land φ_{i}) - Pr (φ_{1}) Pr (φ_{i})) (Pr (φ_{1} \land φ_{j}) - Pr (φ_{1}) Pr (φ_{j}))}{Pr (φ_{1}) - Pr (φ_{1})^{2}})}_{i j} ⎞ ⎠ .$

Note that this generally has $Var (1_{φ_{i}} ∣ ∣ φ_{1}) \neq E (1_{φ_{i}} ∣ ∣ φ_{1}) (1 - E (1_{φ_{i}} ∣ ∣ φ_{1}))$ , which is a mismatch; this is related to the fact that a conditional variance in a multivariate normal is never higher than the marginal variance, which is an undesirable feature for a distribution over truth-values.

This is also related to other undesirable features; for example, if we condition on more than one sentence, we can arrive at conditional probabilities outside of $[0, 1]$ . (For example if 3 sentences have $Pr (φ_{1}) = Pr (φ_{2}) = Pr (φ_{3}) = \frac{1}{3}, Pr (φ_{1} \land φ_{2}) = Pr (φ_{1} \land φ_{3}) = Pr (φ_{2} \land φ_{3}) = ε$ then this yields $Pr (φ_{3} | φ_{1}, φ_{2}) = \frac{- 1 + 15 ε}{1 + 9 ε} \approx - 1$ ; this makes sense because this prior is very confident that $1_{φ_{1}} + 1_{φ_{2}} + 1_{φ_{3}} \approx 1$ , with standard deviation $\sqrt{6 ε}$ .)

Intermediate relaxations that lack these particular shortcomings are possible, such as the ones that restrict the relaxed $1_{φ_{1}}, \dots, 1_{φ_{n}}$ to the sphere $\sum_{i} (2 x_{i} - 1)^{2} = n$ or ball $\sum_{i} (2 x_{i} - 1)^{2} \leq n$ . Then the maximum entropy distribution, similarly to a multivariate normal distribution, has quadratic logdensity, though the Hessian of the quadratic may have nonnegative eigenvalues (unlike in the normal case). In the spherical case, this is known as a Fisher-Bingham distribution.

Both of these relaxations seem difficult to work with, eg to compute normalizing constants for; furthermore I don’t think the analogous updating process will share the desirable property that $Pr (φ_{i} | φ_{1}) = Pr (φ_{i} \land φ_{1}) / Pr (φ_{1})$ . However, the fact that these distributions allow updating by relaxed conditioning, keep (fully conditioned) truth-values between 0 and 1, and have reasonable (at least, possibly-increasing) behavior for conditional variances, makes them seem potentially appealing.

János Kramár3 Jun 2015 19:08 UTC

LW: 3 AF: 2

3 comments2 min readLW link

János Kramár 20 Jun 2015 16:34 UTC
LW: 2 AF: 1
AF
There is a lot more to say about the perspective that isn’t relaxed to continuous random variables. In particular, the problem of finding the maximum entropy joint distribution that agrees with particular pairwise distributions is closely related to Markov Random Fields and the Ising model. (The relaxation to continuous random variables is a Gaussian Markov Random Field.) It is easily seen that this maximum entropy joint distribution must have the form $log Pr (1_{φ_{1}}, \dots, 1_{φ_{n}}) = \sum_{i < j} θ_{i j} 1_{φ_{i} \land φ_{j}} + \sum_{i} θ_{i} 1_{φ_{i}} - log Z$ where $log Z$ is the normalizing constant, or partition function. This is an appealing distribution to use, and easy to do conditioning on and to add new variables to. Computing relative entropy reduces to finding bivariate marginals and to computing $Z$ , and computing marginals reduces to computing $Z$ , which is intractable in general[^istrail], though easy if the Markov graph (ie the graph with edges $i j$ for $θ_{i, j} \neq 0$ ) is a forest.

There have been many approaches to this problem (Wainwright and Jordan[^wainwright] is a good survey), but the main ways to extend the applicability from forests have been:
- decompose components of the graph as “junction trees”, ie trees whose nodes are overlapping clusters of nodes from the original graph; this permits exact computation with cost exponential in the cluster-sizes, ie in the treewidth[^pearl]
- make use of clever combinatorial work coming out of statistical mechanics to do exact computation on “outerplanar” graphs, or on general graphs with cost exponential in the (outer-)graph genus[^schraudolph]
- find nodes such that conditioning on those nodes greatly simplifies the graph (eg makes it singly connected), and sum over their possible values explicitly (this has cost exponential in the number of nodes being conditioned on)
A newer class of models, called sum-product networks[^poon], generalizes these and other models by writing the total joint probability as a positive polynomial $1 = \sum_{x_{1}, \dots, x_{n} = 0}^{1} Pr (1_{φ_{1}} = x_{1}, \dots, 1_{φ_{n}} = x_{n}) 1_{φ_{1}}^{x_{1}} 1_{{¯ φ}_{1}}^{1 - x_{1}} \dots 1_{φ_{n}}^{x_{n}} 1_{{¯ φ}_{n}}^{1 - x_{n}}$ in the variables $1_{φ_{1}}, 1_{{¯ φ}_{1}}, \dots, 1_{φ_{n}}, 1_{{¯ φ}_{n}}$ and requiring only that this polynomial be simplifiable to an expression requiring a tractable number of additions and multiplications to evaluate. This allows easy computation of marginals, conditionals, and KL divergence, though it will likely be necessary to do some approximate simplification every so often (otherwise the complexity may accumulate, even with a fixed maximum number of sentences being considered at a time).

However, if we want to stay close to the context of the Non-Omniscience paper, we can do approximate calculations of the partition function on the complete graph—in particular, the Bethe partition function[^weller] has been widely used in practice, and while it’s not logconvex like $Z$ is, it’s often a better approximation to the partition function than well-known convex approximations such as TRW.

[^istrail]: Istrail, Sorin. “Statistical mechanics, three-dimensionality and NP-completeness: I. Universality of intractability for the partition function of the Ising model across non-planar surfaces.” In Proceedings of the thirty-second annual ACM symposium on Theory of computing, pp. 87-96. ACM, 2000.

[^weller]: Weller, Adrian. “Bethe and Related Pairwise Entropy Approximations.”

[^pearl]: Pearl, Judea. “Probabilistic reasoning in intelligent systems: Networks of plausible reasoning.” (1988).

[^schraudolph]: Schraudolph, Nicol N., and Dmitry Kamenetsky. “Efficient Exact Inference in Planar Ising Models.” arXiv preprint arXiv:0810.4401 (2008).

[^wainwright]: Wainwright, Martin J., and Michael I. Jordan. “Graphical models, exponential families, and variational inference.” Foundations and Trends® in Machine Learning 1, no. 1-2 (2008): 1-305.

[^poon]: Poon, Hoifung, and Pedro Domingos. “Sum-product networks: A new deep architecture.” In Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on, pp. 689-690. IEEE, 2011.
János Kramár 11 Jun 2015 17:31 UTC
LW: 1 AF: 1
AF
An easy way to get rid of the probabilities-outside-[0,1] problem in the continuous relaxation is to constrain the “conditional”/updated distribution to have $Var (1_{φ_{i}} ∣ ∣ \dots) \leq E (1_{φ_{i}} ∣ ∣ \dots) (1 - E (1_{φ_{i}} ∣ ∣ \dots))$ (which is a convex constraint; it’s equivalent to $Var (1_{φ_{i}} ∣ ∣ \dots) + {(E (1_{φ_{i}} ∣ ∣ \dots) - \frac{1}{2})}^{2}$ ), and then minimize KL-divergence accordingly.

The two obvious flaws are that the result of updating becomes ordering-dependent (though this may not be a problem in practice), and that the updated distribution will sometimes have $Var (1_{φ_{i}} ∣ ∣ \dots) < E (1_{φ_{i}} ∣ ∣ \dots) (1 - E (1_{φ_{i}} ∣ ∣ \dots))$ , and it’s not clear how to interpret that.
János Kramár 12 Jun 2015 1:49 UTC
0 points
AF
Actually, on further thought, I think the best thing to use here is a log-bilinear distribution over the space of truth-assignments. For these, it is easy to efficiently compute exact normalizing constants, conditional distributions, marginal distributions, and KL divergences; there is no impedance mismatch. KL divergence minimization here is still a convex minimization (in the natural parametrization of the exponential family).

The only shortcoming is that 0 is not a probability, so it won’t let you eg say that $Pr (φ_{1} \to φ_{2}) = 1$ ; but this can be remedied using a real or hyperreal approximation.