# A tractable, interpretable formulation of approximate conditioning for pairwise-specified probability distributions over truth values

Th­ese re­sults from my con­ver­sa­tions with Char­lie Steiner at the May 29-31 MIRI Work­shop on Log­i­cal Uncer­tainty will pri­mar­ily be of in­ter­est to peo­ple who’ve read sec­tion 2.4 of Paul Chris­ti­ano’s Non-Om­ni­science pa­per.

If we write a rea­soner that keeps track of prob­a­bil­ities of a col­lec­tion of sen­tences (that grows and shrinks as the rea­soner ex­plores), we need some way of track­ing known re­la­tion­ships be­tween the sen­tences. One way of do­ing this is to store the pair­wise prob­a­bil­ity dis­tri­bu­tions, ie not only for all but also for all .

If we do this, a nat­u­ral ques­tion to ask is: how can we up­date this data struc­ture if we learn that eg is true?

We’ll re­fer to the up­dated prob­a­bil­ities as .

It’s fairly rea­son­able for us to want to set ; how­ever, it’s less clear what val­ues to as­sign to , be­cause we haven’t stored .

One op­tion would be to find the max­i­mum en­tropy dis­tri­bu­tion over truth as­sign­ments to un­der the con­straint that the stored pair­wise dis­tri­bu­tions are cor­rect. This seems in­tractable for large ; how­ever, in the spirit of lo­cal­ity, we could re­strict our at­ten­tion to the joint truth value dis­tri­bu­tion of . Max­i­miz­ing its en­tropy is sim­ple (it boils down to ei­ther con­vex op­ti­miza­tion or solv­ing a cu­bic), and yields a plau­si­ble can­di­date for that we can de­rive from. I’m not sure what global prop­er­ties this has, for ex­am­ple whether it yields a pos­i­tive semidefinite ma­trix .

A differ­ent op­tion, as noted in sec­tion 2.4.2, is to ob­serve that the ma­trix must be pos­i­tive semidefinite un­der any joint dis­tri­bu­tion for the truth val­ues. This means we can con­sider a zero-mean mul­ti­vari­ate nor­mal dis­tri­bu­tion with this ma­trix as its co­var­i­ance; then there’s a closed-form ex­pres­sion for the Kul­lback-Leibler di­ver­gence of two such dis­tri­bu­tions, and this can be used to define a sort of con­di­tional dis­tri­bu­tion, as is done in sec­tion 2.4.3.

How­ever, as the pa­per re­marks, this isn’t a very fa­mil­iar way of defin­ing these up­dated prob­a­bil­ities. For ex­am­ple, it lacks the de­sir­able prop­erty that .

For­tu­nately, there is a nat­u­ral con­struc­tion that com­bines these ideas: namely, if we con­sider the max­i­mum-en­tropy dis­tri­bu­tion for the truth as­sign­ment vec­tor with the given sec­ond mo­ments , but re­lax the re­quire­ment that their val­ues be in , then we find a mul­ti­vari­ate nor­mal dis­tri­bu­tion If we wish to up­date this dis­tri­bu­tion af­ter ob­serv­ing by find­ing the can­di­date dis­tri­bu­tion of high­est rel­a­tive en­tropy with , as pro­posed in the pa­per, then we will get the mul­ti­vari­ate nor­mal con­di­tional dis­tri­bu­tion

Note that this gen­er­ally has , which is a mis­match; this is re­lated to the fact that a con­di­tional var­i­ance in a mul­ti­vari­ate nor­mal is never higher than the marginal var­i­ance, which is an un­de­sir­able fea­ture for a dis­tri­bu­tion over truth-val­ues.

This is also re­lated to other un­de­sir­able fea­tures; for ex­am­ple, if we con­di­tion on more than one sen­tence, we can ar­rive at con­di­tional prob­a­bil­ities out­side of . (For ex­am­ple if 3 sen­tences have then this yields ; this makes sense be­cause this prior is very con­fi­dent that , with stan­dard de­vi­a­tion .)

In­ter­me­di­ate re­lax­ations that lack these par­tic­u­lar short­com­ings are pos­si­ble, such as the ones that re­strict the re­laxed to the sphere or ball . Then the max­i­mum en­tropy dis­tri­bu­tion, similarly to a mul­ti­vari­ate nor­mal dis­tri­bu­tion, has quadratic log­den­sity, though the Hes­sian of the quadratic may have non­nega­tive eigen­val­ues (un­like in the nor­mal case). In the spher­i­cal case, this is known as a Fisher-Bing­ham dis­tri­bu­tion.

Both of these re­lax­ations seem difficult to work with, eg to com­pute nor­mal­iz­ing con­stants for; fur­ther­more I don’t think the analo­gous up­dat­ing pro­cess will share the de­sir­able prop­erty that . How­ever, the fact that these dis­tri­bu­tions al­low up­dat­ing by re­laxed con­di­tion­ing, keep (fully con­di­tioned) truth-val­ues be­tween 0 and 1, and have rea­son­able (at least, pos­si­bly-in­creas­ing) be­hav­ior for con­di­tional var­i­ances, makes them seem po­ten­tially ap­peal­ing.

• There is a lot more to say about the per­spec­tive that isn’t re­laxed to con­tin­u­ous ran­dom vari­ables. In par­tic­u­lar, the prob­lem of find­ing the max­i­mum en­tropy joint dis­tri­bu­tion that agrees with par­tic­u­lar pair­wise dis­tri­bu­tions is closely re­lated to Markov Ran­dom Fields and the Is­ing model. (The re­lax­ation to con­tin­u­ous ran­dom vari­ables is a Gaus­sian Markov Ran­dom Field.) It is eas­ily seen that this max­i­mum en­tropy joint dis­tri­bu­tion must have the form where is the nor­mal­iz­ing con­stant, or par­ti­tion func­tion. This is an ap­peal­ing dis­tri­bu­tion to use, and easy to do con­di­tion­ing on and to add new vari­ables to. Com­put­ing rel­a­tive en­tropy re­duces to find­ing bi­vari­ate marginals and to com­put­ing , and com­put­ing marginals re­duces to com­put­ing , which is in­tractable in gen­eral[^is­trail], though easy if the Markov graph (ie the graph with edges for ) is a for­est.

There have been many ap­proaches to this prob­lem (Wain­wright and Jor­dan[^wain­wright] is a good sur­vey), but the main ways to ex­tend the ap­pli­ca­bil­ity from forests have been:

• de­com­pose com­po­nents of the graph as “junc­tion trees”, ie trees whose nodes are over­lap­ping clusters of nodes from the origi­nal graph; this per­mits ex­act com­pu­ta­tion with cost ex­po­nen­tial in the cluster-sizes, ie in the treewidth[^pearl]

• make use of clever com­bi­na­to­rial work com­ing out of statis­ti­cal me­chan­ics to do ex­act com­pu­ta­tion on “out­er­pla­nar” graphs, or on gen­eral graphs with cost ex­po­nen­tial in the (outer-)graph genus[^schrau­dolph]

• find nodes such that con­di­tion­ing on those nodes greatly sim­plifies the graph (eg makes it singly con­nected), and sum over their pos­si­ble val­ues ex­plic­itly (this has cost ex­po­nen­tial in the num­ber of nodes be­ing con­di­tioned on)

A newer class of mod­els, called sum-product net­works[^poon], gen­er­al­izes these and other mod­els by writ­ing the to­tal joint prob­a­bil­ity as a pos­i­tive polyno­mial in the vari­ables and re­quiring only that this polyno­mial be sim­plifi­able to an ex­pres­sion re­quiring a tractable num­ber of ad­di­tions and mul­ti­pli­ca­tions to eval­u­ate. This al­lows easy com­pu­ta­tion of marginals, con­di­tion­als, and KL di­ver­gence, though it will likely be nec­es­sary to do some ap­prox­i­mate sim­plifi­ca­tion ev­ery so of­ten (oth­er­wise the com­plex­ity may ac­cu­mu­late, even with a fixed max­i­mum num­ber of sen­tences be­ing con­sid­ered at a time).

How­ever, if we want to stay close to the con­text of the Non-Om­ni­science pa­per, we can do ap­prox­i­mate calcu­la­tions of the par­ti­tion func­tion on the com­plete graph—in par­tic­u­lar, the Bethe par­ti­tion func­tion[^weller] has been widely used in prac­tice, and while it’s not log­con­vex like is, it’s of­ten a bet­ter ap­prox­i­ma­tion to the par­ti­tion func­tion than well-known con­vex ap­prox­i­ma­tions such as TRW.

[^is­trail]: Is­trail, Sorin. “Statis­ti­cal me­chan­ics, three-di­men­sion­al­ity and NP-com­plete­ness: I. Univer­sal­ity of in­tractabil­ity for the par­ti­tion func­tion of the Is­ing model across non-pla­nar sur­faces.” In Pro­ceed­ings of the thirty-sec­ond an­nual ACM sym­po­sium on The­ory of com­put­ing, pp. 87-96. ACM, 2000.

[^weller]: Wel­ler, Adrian. “Bethe and Re­lated Pair­wise En­tropy Ap­prox­i­ma­tions.

[^pearl]: Pearl, Judea. “Prob­a­bil­is­tic rea­son­ing in in­tel­li­gent sys­tems: Net­works of plau­si­ble rea­son­ing.” (1988).

[^schrau­dolph]: Schrau­dolph, Ni­col N., and Dmitry Kamenet­sky. “Effi­cient Ex­act In­fer­ence in Pla­nar Is­ing Models.” arXiv preprint arXiv:0810.4401 (2008).

[^wain­wright]: Wain­wright, Martin J., and Michael I. Jor­dan. “Graph­i­cal mod­els, ex­po­nen­tial fam­i­lies, and vari­a­tional in­fer­ence.” Foun­da­tions and Trends® in Ma­chine Learn­ing 1, no. 1-2 (2008): 1-305.

[^poon]: Poon, Hoifung, and Pe­dro Dom­in­gos. “Sum-product net­works: A new deep ar­chi­tec­ture.” In Com­puter Vi­sion Work­shops (ICCV Work­shops), 2011 IEEE In­ter­na­tional Con­fer­ence on, pp. 689-690. IEEE, 2011.

• An easy way to get rid of the prob­a­bil­ities-out­side-[0,1] prob­lem in the con­tin­u­ous re­lax­ation is to con­strain the “con­di­tional”/​up­dated dis­tri­bu­tion to have (which is a con­vex con­straint; it’s equiv­a­lent to ), and then min­i­mize KL-di­ver­gence ac­cord­ingly.

The two ob­vi­ous flaws are that the re­sult of up­dat­ing be­comes or­der­ing-de­pen­dent (though this may not be a prob­lem in prac­tice), and that the up­dated dis­tri­bu­tion will some­times have , and it’s not clear how to in­ter­pret that.

• Ac­tu­ally, on fur­ther thought, I think the best thing to use here is a log-bil­in­ear dis­tri­bu­tion over the space of truth-as­sign­ments. For these, it is easy to effi­ciently com­pute ex­act nor­mal­iz­ing con­stants, con­di­tional dis­tri­bu­tions, marginal dis­tri­bu­tions, and KL di­ver­gences; there is no impedance mis­match. KL di­ver­gence min­i­miza­tion here is still a con­vex min­i­miza­tion (in the nat­u­ral parametriza­tion of the ex­po­nen­tial fam­ily).

The only short­com­ing is that 0 is not a prob­a­bil­ity, so it won’t let you eg say that ; but this can be reme­died us­ing a real or hy­per­real ap­prox­i­ma­tion.