A tractable, interpretable formulation of approximate conditioning for pairwise-specified probability distributions over truth values

Th­ese re­sults from my con­ver­sa­tions with Char­lie Steiner at the May 29-31 MIRI Work­shop on Log­i­cal Uncer­tainty will pri­mar­ily be of in­ter­est to peo­ple who’ve read sec­tion 2.4 of Paul Chris­ti­ano’s Non-Om­ni­science pa­per.

If we write a rea­soner that keeps track of prob­a­bil­ities of a col­lec­tion of sen­tences (that grows and shrinks as the rea­soner ex­plores), we need some way of track­ing known re­la­tion­ships be­tween the sen­tences. One way of do­ing this is to store the pair­wise prob­a­bil­ity dis­tri­bu­tions, ie not only for all but also for all .

If we do this, a nat­u­ral ques­tion to ask is: how can we up­date this data struc­ture if we learn that eg is true?

We’ll re­fer to the up­dated prob­a­bil­ities as .

It’s fairly rea­son­able for us to want to set ; how­ever, it’s less clear what val­ues to as­sign to , be­cause we haven’t stored .

One op­tion would be to find the max­i­mum en­tropy dis­tri­bu­tion over truth as­sign­ments to un­der the con­straint that the stored pair­wise dis­tri­bu­tions are cor­rect. This seems in­tractable for large ; how­ever, in the spirit of lo­cal­ity, we could re­strict our at­ten­tion to the joint truth value dis­tri­bu­tion of . Max­i­miz­ing its en­tropy is sim­ple (it boils down to ei­ther con­vex op­ti­miza­tion or solv­ing a cu­bic), and yields a plau­si­ble can­di­date for that we can de­rive from. I’m not sure what global prop­er­ties this has, for ex­am­ple whether it yields a pos­i­tive semidefinite ma­trix .

A differ­ent op­tion, as noted in sec­tion 2.4.2, is to ob­serve that the ma­trix must be pos­i­tive semidefinite un­der any joint dis­tri­bu­tion for the truth val­ues. This means we can con­sider a zero-mean mul­ti­vari­ate nor­mal dis­tri­bu­tion with this ma­trix as its co­var­i­ance; then there’s a closed-form ex­pres­sion for the Kul­lback-Leibler di­ver­gence of two such dis­tri­bu­tions, and this can be used to define a sort of con­di­tional dis­tri­bu­tion, as is done in sec­tion 2.4.3.

How­ever, as the pa­per re­marks, this isn’t a very fa­mil­iar way of defin­ing these up­dated prob­a­bil­ities. For ex­am­ple, it lacks the de­sir­able prop­erty that .

For­tu­nately, there is a nat­u­ral con­struc­tion that com­bines these ideas: namely, if we con­sider the max­i­mum-en­tropy dis­tri­bu­tion for the truth as­sign­ment vec­tor with the given sec­ond mo­ments , but re­lax the re­quire­ment that their val­ues be in , then we find a mul­ti­vari­ate nor­mal dis­tri­bu­tion If we wish to up­date this dis­tri­bu­tion af­ter ob­serv­ing by find­ing the can­di­date dis­tri­bu­tion of high­est rel­a­tive en­tropy with , as pro­posed in the pa­per, then we will get the mul­ti­vari­ate nor­mal con­di­tional dis­tri­bu­tion

Note that this gen­er­ally has , which is a mis­match; this is re­lated to the fact that a con­di­tional var­i­ance in a mul­ti­vari­ate nor­mal is never higher than the marginal var­i­ance, which is an un­de­sir­able fea­ture for a dis­tri­bu­tion over truth-val­ues.

This is also re­lated to other un­de­sir­able fea­tures; for ex­am­ple, if we con­di­tion on more than one sen­tence, we can ar­rive at con­di­tional prob­a­bil­ities out­side of . (For ex­am­ple if 3 sen­tences have then this yields ; this makes sense be­cause this prior is very con­fi­dent that , with stan­dard de­vi­a­tion .)

In­ter­me­di­ate re­lax­ations that lack these par­tic­u­lar short­com­ings are pos­si­ble, such as the ones that re­strict the re­laxed to the sphere or ball . Then the max­i­mum en­tropy dis­tri­bu­tion, similarly to a mul­ti­vari­ate nor­mal dis­tri­bu­tion, has quadratic log­den­sity, though the Hes­sian of the quadratic may have non­nega­tive eigen­val­ues (un­like in the nor­mal case). In the spher­i­cal case, this is known as a Fisher-Bing­ham dis­tri­bu­tion.

Both of these re­lax­ations seem difficult to work with, eg to com­pute nor­mal­iz­ing con­stants for; fur­ther­more I don’t think the analo­gous up­dat­ing pro­cess will share the de­sir­able prop­erty that . How­ever, the fact that these dis­tri­bu­tions al­low up­dat­ing by re­laxed con­di­tion­ing, keep (fully con­di­tioned) truth-val­ues be­tween 0 and 1, and have rea­son­able (at least, pos­si­bly-in­creas­ing) be­hav­ior for con­di­tional var­i­ances, makes them seem po­ten­tially ap­peal­ing.