Our task in this post will be to develop the basic theory and notation for inframeasures and sa-measures. The proofs and concepts require some topology and functional analysis. We assume the reader is familiar with topology and linear algebra but not functional analysis, and will explain the functional analysis concepts more. If you wish to read through these posts, PM me to get a link to MIRIxDiscord, we’ll be doing a group readthrough where I or Vanessa can answer questions. Here’s the previous post and here are the proof sections. Beware, the proof sections are hard.

Notation Reference

Feel free to skip this segment and refer back to it when needed. Duplicate the tab, keep one of them on this section, and you can look up notation here.

X,Y: some compact metric space. Points in this are denoted by x or y.

d: Some distance metric, the usual one is the KR-metric between signed measures, defined as d(m,m′):=supflip∣∣∫Xflip(x)dm−∫Xflip(x)dm′∣∣ where flip is a function X→[−1,1] that has a Lipschitz constant of 1 or less. In other words, what’s the biggest distance between the values that the measures output when you feed in a function that’s limited in its ability to distinguish measures that are close to each other due to having a small Lipschitz constant?

M±(X): The Banach space of finite signed measures over X equipped with the KR-metric/norm, defined as above (the norm is derived from the metric by assessing the distance between the signed measure and the 0 measure). Elements are denoted by m. By Jordan decomposition, we can uniquely split it into m++m− where the former is all-positive and the latter is all-negative.

C(X),C(X,[0,1]): The Banach space of continuous functions X→R. The latter one is the space of continuous functions bounded in [0,1]. Elements of C(X,[0,1]) are typically denoted by f.

m(f): We can interpret signed measures as continuous linear functionals on C(X). This is given by ∫Xf(x)dm. If m was actually a probability distribution, this would just be Eμ(f). They’re generalized expectations.

b: used to refer to the number component of an a-measure or sa-measure.

Msa(X): The closed convex cone of sa-measures. An sa-meaure is a pair (m,b) where b+m−(1)≥0. Elements of this (sa-measures) are denoted by M.

f+: A positive functional. A continuous linear functional that’s nonnegative for all sa-measures.

B: A set of sa-measures.

EB(f): The expectation of a function f relative to a set of sa-measures. Defined as inf(m,b)∈B(m(f)+b).

Bmin,Buc: The set of minimal points of B (points that can’t be written as a different point in B plus a nonzero sa-measure), and the upper completion of B (B+Msa(X)), respectively.

Ma(X): The closed convex cone of a-measures. An a-measure is a pair (m,b) where m is a measure (no negative component) and b≥0. It can also be written as (λμ,b) where λ≥0, and μ is a probability distribution.

λ: Given an a-measure, its λ value is the λ from writing the a-measure as (λμ,b). At the end, it’s used for lambda-notation to describe complicated functions. Context distinguishes.

λ⊙: Either the minimal upper bound on the λ values of the minimal points of a set B, or the Lipschitz constant of a function h (there’s a close link between the two).

¯¯¯¯B,c.h(B): The closure and convex hull of a set, respectively.

H: An infradistribution. A set of sa-measures fulfilling the properties of nonemptiness, closure, convexity, upper-completeness, positive-minimals, (weak)-bounded minimals, and normalization.

□X: The set of infradistributions over X. □bX is the set of bounded infradistributions over X.

ΔX: The space of probability distributions over some set X.

h: The function induced by an H that goes f↦EH(f). Or, just a function C(X,[0,1])→[0,1] that’s concave, monotone, uniformly continuous, and normalized, there’s a link with infradistributions.

g,g∗: If you’re seeing it in the context of a pushforward, g is a continuous function X→Y, and g∗ is the induced function □X→□Y. If you’re seeing g in the context of updating, it’s a continuous function in C(X,[0,1]).

ζ: Used to denote probability distributions used for mixing stuff together, a probability distribution over the natural numbers or a finite subset of them in all cases.ζi is the probability on the number i.

i: The index variable for mixing, like indexing infradistributions or points or functions.

EζHi: A mix of sets, defined as the set of every point that can be constructed by selecting a point from each Hi and mixing according to ζ.

L: A function in C(X,[0,1]), thought of as the indicator function for a fuzzy set, that we use for updating.

f★Lg: The function made by gluing f and g together via L, defined as (f★Lg)(x):=L(x)f(x)+(1−L(x))g(x).

supp(f): The support of a function f, the set of x where f(x)>0.

1E: The indicator function that’s 1 on set E, 0 everywhere else.

PgH(L): The probability of L relative to the function g according to the infradistribution H, it’s defined as EH(1★Lg)−EH(0★Lg)

H|gL: The infradistribution H updated on the fuzzy set L relative to the function g.

K: An infrakernel, a function fulfilling some continuity properties, of type signature X→□Y.

Basic concepts

Time to start laying our mathematical groundwork, like the spaces we’ll be working in.

X is some compact metric space, equipped with the Borel σ-algebra (which, in this case, is the same as the Baire σ-algebra) We could probably generalize this further to complete metric spaces, but things get significantly trickier (one of many directions for future research), and compact metric spaces are quite well-behaved.

Concrete examples of compact metric spaces include the space of infinite bitstrings, the color space for a mantis shrimp, the surface of a sphere, a set of finitely many points, the unit interval, the space of probability distributions over a compact metric space, and countable products of compact metric spaces.

Let’s recap some functional analysis terminology for those seeing it for the first time, and a bit of the notation we’re using, you can skip this part if you already know it. Vector spaces in full generality may lack much of the nice structure present in Rn that’s used in Linear Algebra. Going from the strongest and most useful structure to the weakest, there’s a chain of implication that goes inner product, norm, metric, topology. If you have an inner product, you can get a norm. If you have a norm, you can get a metric from that via d(x,y)=||x−y||, and if you have a metric, you can get a topology from that (with a basis of open balls centered at points). The structure must be imposed on the vector space, and there may be some freedom in doing so, like how Rn can have the L1, L2, or L∞ norm.

A Banach space is a vector space equipped with a norm (a notion of size for vectors), that’s also closed under taking limits, just like R is.

The term “functional” is used for a function to R. So, a continuous linear functional on a vector space V is a function that’s: linear, continuous, and has type signature V→R.

The term “dual space of V” is used for the vector space of continuous linear functionals V→R.

The space C(X) is the Banach space of continuous functions X→R. We’ll also use C(X,[0,1]) to denote the subset that just consists of continuous functions bounded in [0,1]. The dual space of C(X) is M±(X), the Banach space of finite signed measures over X.

Moving on from the functional analysis terminology, let’s consider finite signed measures m, an element of M±(X). A signed measure m can be uniquely split into a positive part and a negative part m++m−, by the Jordan Decomposition Theorem. The “finite” part just means that m+ doesn’t assign any set ∞ measure and m− doesn’t asssign any set −∞ measure.

Now, we said that the space of finite signed measures was the dual space of C(X) (continuous functions X→R). So… how does a signed measure correspond to a continuous linear functional over C(X)? Well, that corresponds to m(f):=∫Xf(x)dm

If m was a probability distribution μ, then μ(f) would be Eμ(f), so this is just like taking the expected value, but generalizing to negative regions in m. We’ll be using this notation m(f) a whole lot. Because finite signed measures perfectly match up with continuous linear functionals C(X)→R, we can toggle back and forth between whichever view is the most convenient in the moment, viewing continuous linear functionals on C(X) as finite signed measures, and vice-versa.

Our ambient vector space we work in is M±(X)⊕R, a pair of a signed measure and a number.⊕ is the direct sum, which is basically Cartesian product for vector spaces. The direct sum of Banach spaces is a Banach space, with the norm defined in the obvious way as ||(m,b)||=||m||+|b|.

We should take a moment to talk about what norm/metric we’re using. The norm/metric we’re using on M±(X) is the KR(Kantorovich-Rubinstein)-norm/metric.

Definition 1: KR-metric

The metric defined by d(m,m′):=supflip|m(flip)−m′(flip)| where flip is a a continuous function X→[−1,1] with a Lipschitz constant of 1 or less. The KR-norm is ||m||=d(m,0)

But why are we using this unfamiliar KR-metric, instead of the trusty old total variation distance? Well, the KR-metric, since it can only query the measure with functions that aren’t too steep or big, says that two distributions are close if, y’know, they’re close in the intuitive sense. If we have a bunch of Dirac-delta distributions at 0.9, 0.99, 0.999..., then according to the KR-metric, they’d limit to a Dirac-delta distribution at 1. According to total variation distance, all these distributions at distance 2 from each other and don’t converge at all. Similarly, if we’ve got two probability distributions over histories for an environment we’re in, and they behave very similarly and then start diverging after the gazillionth timestep, the KR-metric would go “hm, those two distributions are very close to each other”, while total variation distance says they’re very different. Also, if we’ve got finitely many points at distance 1 from each other, the KR-metric and total variation distance match up with each other (up to a constant). But total variation distance is a bit too restrictive for the continuous case.

There’s a sense in which convergence in total variation distance too strict because it requires a “perfect match” to exist in your hypothesis space, while convergence in KR-distance is just right for nonrealizability because, instead of requiring a “perfect match”, it just requires that you get sufficiently close. Instead of getting accurate predictions for the rest of forever, it’s a requirement that’s something more like “you’ll be accurate for the next zillion timesteps, and the time horizon where you start being inaccurate gets further and further away over time”. You can draw an analogy to how utility functions with the time discount don’t care that much about what happens at very late times.

Going with the KR-metric means we’ve got very nice dual spaces and compactness properties, while with total variation distance, Wikipedia doesn’t even know what the dual space is.

So, tl;dr, the KR-metric is a much better choice for our setting, and we’re working in M±(X)⊕R as our vector space, which is equipped with the KR-norm and is closed under limits.

Definition 2: Sa-Measure

A point M∈M±(X)⊕R, which, when written as a pair of a signed measure and a number (m,b), has b+m−(1)≥0. The set of sa-measures is denoted by Msa(X).

Definition 3: A-Measure

A point M∈M±(X)⊕R, which, when written as a pair of a signed measure and a number (m,b), has m as a measure, and b≥0. The set of a-measures is denoted by Ma(X).

Note that Ma(X) and Msa(X) are both closed convex cones. A closed convex cone is a subset of a vector space that: Is closed under multiplication by any a≥0, is closed under addition, and is closed under limits. For visual intuition, imagine a literal cone with its point at 0 in R3 that’s heading off in some direction, and see how it fulfills those 3 properties.

Basic Inframeasure Conditions

Before proceeding further, we should mention that Theorems 1, 2, and 3 are fairly elementary and have probably been proved in more generality in some different paper on convex analysis. We just call them theorems because they’re important, not necessarily original. Past that, things get more novel. Sets of distributions instead of distributions have been considered before under the name “Imprecise Probability”, as have nonlinear expectations and some analogues to probability theory, Shige Peng wrote a book on the latter. We found out about this after coming up with it independently. The main innovations that have not been found elsewhere are augmenting the sets of probability distributions with extra data (ie, our sa-measures) to get a dynamically consistent update rule, and how to deal with environments/link the setting to reinforcement learning. Let’s move on.

Let B be some arbitrary set of sa-measures. We’re obviously nowhere near calling it an infradistribution, because we haven’t imposed any properties on it. And different B may have the same behavior, we’re nowhere near our second desideratum of collapsing equivalence classes of B’s with the same behavior. Well, nonemptiness should be a fairly obvious property to add.

Condition 1: Nonemptiness:B≠∅

From here, let’s see how we can enlarge B without affecting its behavior. But hang on, what do we even mean by “behavior”??

Definition 4: Expectation w.r.t. a Set of Sa-Measures

EB(f):=inf(m,b)∈B(m(f)+b) where f∈C(X,[0,1]) and B is nonempty.

This is what we mean by “behavior”, all these values should be left unchanged, regardless of f.

Proposition 1:If f∈C(X,[0,1]) then f+:(m,b)↦m(f)+b is a positive functional for Msa(X).

A positive functional for Msa(X) is a continuous linear function M±(X)⊕R→R that is nonnegative everywhere on Msa(X).

This suggests two more conditions besides Nonemptiness.

Condition 2: Closure:B=¯¯¯¯B

Condition 3: Convexity:B=c.h(B)

Why can we impose closure and convexity? Taking the closure of B wouldn’t affect any expectation values, it’d only swap inf for min in some cases because (m,b)↦m(f)+b is continuous. Also, since everything we’re querying our set B with is inf of a linear functional by Proposition 1, we can take the convex hull without changing anything. So, swapping B out for its closed convex hull, no expectation values change at all.

But wait, we aren’t querying B with all linear functionals, we’re only querying it with positive functionals that are constructed from a f in C(X,[0,1]). Or does this class of positive functionals go further than we think? Yes, it does, actually.

Theorem 1, Functionals To Functions:Every positive functional on Msa(X) can be written as (m,b)↦c(m(f)+b), where c≥0, and f∈C(X,[0,1])

Nice! We actually are querying our set with all positive functionals, because we’ve pretty much got everything with just f∈C(X,[0,1]), and everything else is just a scalar multiple of that.

Upper Completion and Minimal Points

If you have a point M∈B, and some other point M∗ that’s an sa-measure, we might as well add M+M∗ to B. Why? Well, given some positive functional f+ (and everything we’re querying our set B with is a positive functional by Proposition 1,

f+(M+M∗)=f+(M)+f+(M∗)≥f+(M)

By linearity and positive functionals being nonnegative on sa-measures, your new point M+M∗ has equal or greater value than M, so when we do infM∈Bf+(M), the addition of the new point didn’t change anything at all, regardless of which positive functional/continuous function (by Theorem 1) we’re using. So then, let’s add in all the points like this! It’s free. This would be done via Minkowski sum.

B+Msa(X)={M|M=MB+M∗,MB∈B,M∗∈Msa(X)}

Definition 5: Upper Completion

The upper completion of a set B, Buc, is Buc:=B+Msa(X)

Condition 4: Upper Completeness:B=B+Msa(X)

Ok, so we add in all those points. Since we’re adding two nonempty convex sets, the result is also convex. As for closure...

Lemma 2:The upper completion of a closed set of sa-measures is closed.

However, B+Msa(X) isn’t quite what we wanted. Maybe there’s more points we could add! We want to add in every sa-measure we possibly can to our set as long as it doesn’t affect the essential “behavior”/worst-case values. So, we should be able to add in a point if every positive functional/continuous function (Proposition 1 and Theorem 1) goes “the value of the point you’re looking is undershot by this preexisting point over here”. This more inclusive notion of adding points to make B as big as possible (adding any more points would start affecting the “behavior” of our set) would be:

Add a point (m′,b′) to B if, for all f in C(X,[0,1]), there’s a (m,b) in B where m′(f)+b′≥m(f)+b

Actually, this gets us nothing over just taking the upper completion/adding Msa(X)! Check out the next result.

Proposition 2:For closed convex nonemptyB,

B+Msa(X)={M|∀f+∃M′∈B:f+(M)≥f+(M′)}

Combining Proposition 2 and Theorem 1, our notion of upper closure is exactly the same as “add all the points you possibly can that don’t affect the inf(m,b)∈B(m(f)+b) value for any f”

Along with the the notion of the upper completion comes the notion of a minimal point.

Definition 6: Minimal Point

A minimal point of a closed nonempty set of sa-measures B is a point M∈B where, if M=MB+M∗, and MB∈B, and M∗∈Msa(X) then M∗=0. The set of minimal points is denoted Bmin

So, minimal points can’t be written as a different point in the same set plus a nonzero sa-measure. It’s something that can’t possibly have been added by the upper-completion if it wasn’t there originally. We’ll show a picture of the upper completion and minimal points (these two notions generalize to any closed subset of any closed cone), to make things more concrete.

Theorem 2, Minimal Decomposition:Given a nonempty closed set of sa-measures B, the set of minimal points Bmin is nonempty and all points in B are above a minimal point.

This means that we can take any point M∈B and decompose it into Mmin+M∗, where Mmin∈Bmin, and M∗ is an sa-measure. We use this a whole lot in proofs. The proof of this uses the axiom of choice in the form of Zorn’s Lemma, but separability may let us find some way to dodge the use of the full axiom of choice.

Proposition 3:Given a f∈C(X,[0,1]), and a B that is nonempty closed, inf(m,b)∈B(m(f)+b)=inf(m,b)∈Bmin(m(f)+b)

So when evaluating EB(f), we can just minimize within the set of minimal points. Minimal points are the only thing that matters for the “behavior” of B.

Proposition 4: Given a nonempty closed convex B, Bmin=(Buc)min and (Bmin)uc=Buc

The set of minimal points is left unchanged when you take the upper completion, and taking the upper completion of the set of minimal points equals taking the upper completion of B. This is fairly intuitive from the picture.

Theorem 3, Full Characterization:If the nonempty closed convex sets A and B have Amin≠Bmin, then there is some f∈C(X,[0,1]) where EA(f)≠EB(f)

Corollary 1:If two nonempty closed convex upper-complete sets A and B are different, then there is some f∈C(X,[0,1]) where EA(f)≠EB(f)

Looking back at our second desideratum, it says “Our notion of a hypothesis in this setting should collapse “secretly equivalent” sets, such that any two distinct hypotheses behave differently in some relevant aspect. This will require formalizing what it means for two sets to be “meaningfully different”, finding a canonical form for an equivalence class of sets that “behave the same in all relevant ways”, and then proving some theorem that says we got everything.”

And we did exactly that. Also, Theorem 3 and the other results justify the view of the minimal points as the “unique identifier” of a closed convex set. If two closed convex sets have the same minimal-point ID, then taking the upper completion gets you the same set, and they behave the same w.r.t all the queries we can throw at them. If two sets have a different minimal-point ID, then when we take the upper completion, they’re different, and there’s some query that distinguishes them.

Minimal Point Conditions

Well, this is the basics. But we can impose some more conditions. We don’t really like these signed measures, it’d be nice to work exclusively with positive measures, if possible. The minimal points are all we really need to care about by Proposition 3, so let’s require that they’re all in the smaller cone Ma(X), which has no negative-measure shenanigans afoot. Renormalization may fail if there’s minimal points that have negative parts, which is analogous to how you can renormalize a positive measure back to a probability distribution, but a signed measure may not be able to be renormalized back to 1.

Condition 5: Minimal-positivity:Bmin⊆Ma(X)

Further, things can get a bit tricky in various places if the minimal points don’t lie in some compact set. Applying compactness arguments lets you show that you don’t have to close your set after updating, and show up a lot in our proofs of properties of belief functions. However, this next condition isn’t essential, just convenient, and it’s worthwhile looking at what happens when it’s dropped, for future research.

Condition 6a: Minimal-boundedness:There is a compact set C s.t. Bmin⊆C.

Proposition 5:Let μ denote an arbitrary probability distribution. If Bmin⊆Ma(X), then the condition “there is a λ⊙ where, ∀(λμ,b)∈Bmin:λ≤λ⊙” is equivalent to “there is a compact C s.t. Bmin⊆C”

We mostly use this formulation of Minimal-boundedness instead of the compact set requirement. We only have to bound the scale-terms on the minimal points and we have this property. Again, it’s not essential, but very convenient.

Is there a weakening of bounded-minimals? Yes there is. I haven’t figured out what it means for the set of minimal points (EDIT 8/28: got a clean iff result about what it means for minimal points while trying to prove something else), but it’s more mathematically essential, and I don’t think it can be dropped if some other post wants to go further than we did. It can’t be motivated at this point, we’ll have to wait until we get to Legendre-Fenchel Duality.

Condition 6b: Weak minimal-boundedness:f↦EB(f) is uniformly continuous.

Normalization

So, we have almost everything. Nonemptiness, closure, convexity, upper-completion, bounded-minimals, minimal-positivity, and minimal-boundedness/weak minimal-boundedness are our conditions so far.

However, there’s one more condition. What’s the analogue of renormalizing a measure back to 1 in this situation? Well, for standard probability distributions μ, Eμ(0)=0, and Eμ(1)=1. This can be cleanly ported over. We shall require that EB(0)=0, EB(1)=1, that’s our analogue of normalization. Unpacking what the expectation means, this corresponds to: There’s minimal points (λμ,b) where b is arbitrarily close to 0, and there’s some minimal point (λμ,b) where λ+b=1, and there’s no points with a lower λ+b value.

Condition 7: Normalization:EB(1)=1,EB(0)=0

Let’s recap all the conditions. We’ll be using H for something fulfilling all of the following conditions except maybe 6a.

1: Nonemptiness:H≠∅

2: Closure:H=¯¯¯¯¯H

3: Convexity:H=c.h(H)

4: Upper Completeness:H=H+Msa(X)

5: Positive-Minimals:Hmin⊆Ma(X)

6a: Bounded-Minimals:∃λ⊙:(λμ,b)∈Hmin→λ≤λ⊙

6b: Weak Bounded-Minimals: The function f↦EH(f) is uniformly continuous.

An inframeasure is a set of sa-measures that fulfills conditions 1-5 and 6b. An infradistribution H is a set of sa-measures that fulfills conditions 1-5, 6b, and 7. The “bounded” prefix refers to fulfilling condition 6a. The set of infradistributions is denoted as □X, the set of bounded infradistributions is denoted as □bX.

Now, how do we get normalization if it isn’t already present? Closure, Convexity, and Upper Completeness can be introduced by closure, convex hull, and upper completion, respectively. How do we turn an inframeasure into an infradistribution?

Well, just take every (m,b) in your set, and map it to: 1EB(1)−EB(0)(m,b−EB(0))

This may seem a bit mysterious. This normalization can be thought of as analogous to rescaling a utility function to be in [0,1] via scale-and-shift.

What we’re doing first, is shifting everything down by EB(0), which is as much as we possibly can manage without making b go negative anywhere. The utility function analogue would be, if your expected utilities are (0.4,0.5,0.6), this is like shifting them down to (0,0.1,0.2).

The second thing we do is go “ok, what’s the lowest value at 1? Let’s scale that back up to 1”. Well, it’d be EB(1)−EB(0) (remember, we shifted first), so we multiply everything by (EB(1)−EB(0))−1 to get our set of interest.

Hang on, what if there’s a divide-by-zero error? Well.. yes, that can happen. In order for it to happen, EB(0)=EB(1). Does this correspond to any sensible condition which hopefully doesn’t happen often?

Proposition 6:EB(0)=EB(1) occurs iff there’s only one minimal a-measure, of the form (0,b).

Put another way, divide by zero errors occur exactly when Murphy is like “oh cool, no matter what function they pick, the worst thing I can do is give them b value, and then nothing happens so they can’t pick up any more value than that”, so nothing matters at all. This is exactly like Bayesian renormalization failing when you condition on a probability-0 event (note that the measure component is 0 before rescaling). You give up and cry because, in the worst case, nothing you do matters.

Proposition 7: Renormalizing a (bounded)inframeasure produces a (bounded) infradistribution, if renormalization doesn’t fail.

And we’re done for now, we’ve made it up to infradistributions. Now, how can we analyze them?

Legendre-Fenchel Duality

There’s an powerful way of transforming an infradistribution to another form, to look at the same thing in two completely different mathematical contexts, and we can build up a sort of dictionary of what various different concepts are in two completely different settings, or develop concepts in one setting and figure out what they correspond to in the other setting.

An example of this sort of thing is Stone Duality, where you can represent a topological space as its lattice of open sets, to translate a huge array of concepts back and forth between topology and lattice theory and work in whichever setting is more convenient. And, working with special lattices that can’t always translate to topological spaces, you get locales and pointless topology! Duality theorems are highly fruitful.

This result will be inspired by the well-known fact that signed measures correspond to continuous linear functionals on C(X) via f↦m(f), and probability distributions, when translated over, have extra properties like monotonicity, being 1-Lipschitz, and μ(1)=1.

Theorem 4, LF-duality, Sets to Functionals:If H is an infradistribution/bounded infradistribution, then h:f↦EH(f) is concave, monotone, uniformly continuous/Lipschitz over C(X,[0,1]), h(0)=0,h(1)=1, and range(f)⊈[0,1]→h(f)=−∞

So, expectation w.r.t an infradistribution is concave (not linear, as probability distributions are) and if f≥f′ then EH(f)≥EH(f′) (monotonicity). Paired with normalization, this means every appropriate f has EH(f)∈[0,1].

You get concavity from convexity, the −∞ thing from upper-completeness, monotonicity matches up with “all minimal points are a-measures”, Lipschitz corresponds to “all minimal points have λ≤λ⊙”, uniform continuity corresponds to the weak-minimal-bound condition, and h(0)=0,h(1)=1 obviously corresponds to normalization. This is moderately suggestive, our conditions we’re imposing on the set side are manifesting as natural conditions on the concave functional we get from H.

Is there a reverse direction? How do we start with a h:C(X,[0,1])→[0,1] that fulfills the conditions, and get an infradistribution from that?

Theorem 5, LF-Duality, Functionals to Sets:If h is a function C(X)→R that is concave, monotone, uniformly-continuous/Lipschitz, h(0)=0,h(1)=1, and range(f)⊈[0,1]→h(f)=−∞, then it specifies a infradistribution/bounded infradistribution by: {(m,b)|b≥(h′)∗(m)}

Where h′ is the function given by h′(−f)=−h(f), and (h′)∗ is the convex conjugate of h′. Also, going from a infradistribution to an h and back recovers exactly the infradistribution, and going from an h to a infradistribution and back recovers exactly h.

Another name for the convex conjugate is the Legendre-Fenchel transformation, so that’s where we get the term Legendre-Fenchel Duality from. If you want, you can shorten it as LF-duality.

So, (bounded)-infradistributions are isomorphic to concave monotone normalized Lipschitz/uniformly continuous functions h:C(X,[0,1])→R. This is the LF-Duality. We can freely translate concepts back and forth between “sets of sa-measures that fulfill some conditions” and “concave functionals on C(X,[0,1]) that fulfill some conditions”.

In particular, actual probability distributions correspond to a: linear, monotone, 1-Lipschitz, normalized functional C(X,[0,1])→R. So, in the other half of the LF-duality, probability distributions and infradistributions are very nearly the same sort of thing, the only difference between them is that the former is linear and the latter is concave! This is the essential reason why so many probability theory concepts have analogues for infradistributions.

But… what does the LF-transformation actually do? Well, continuous linear functionals C(X)→R are equivalent to signed measures, and continuous linear functionals M±(X)→R (with our KR-norm) correspond to continuous functions over X. A hyperplane in M±(X)⊕R corresponds to a point in C(X)⊕R and hyperplanes in the latter correspond to points in the former. Points in H turn into hyperplanes above h, points on-or-below the graph of h turn into hyperplanes below H.

What this transform basically does, is take each suitable f, check its value w.r.t h, convert the (f,h(f)) pair into a hyperplane in M±(X)⊕R, and go “alright, whichever set h came from must be above this hyperplane”. Eventually all the hyperplanes are drawn and you’ve recovered your infradistribution as the region above the graph of the hyperplanes.

Viewing □X as suitable concave functionals on C(X,[0,1]), we now have a natural notion of distance between infradistributions analogous to total variation distance

d(H1,H2):=supf∈C(X,[0,1])|EH1(f)−EH2(f)|

At this point, the use of our uniform-continuity condition is clearer. A uniform limit of Lipschitz functions may not be Lipschitz. However, a uniform limit of uniformly continuous functions is uniformly continuous, so the space □X is complete (has all its limit points).

A lot of probability-theory concepts carry over to infradistributions. We have analogues of products, markov kernels, pushforwards, semidirect products, expectations, probabilities, updating, and mixtures. These are most naturally defined in the concave-functional side of the duality first, and then you can conjecture how they work on sets, and you know you got the right set (modulo closure, convex hull, and upper completion) if the expectations w.r.t your set match up with the defining property in the concave functional picture. The post was getting long-enough as is, so we won’t cover most of them in detail, and leave developing inframeasure theory more fully till later posts.

Pushforwards and Mixing

Let’s look at the first one, pushforwards of a (bounded) infradistribution via a continuous g:X→Y. The standard probability theory analogue of a pushforward is starting with a probability distribution over X, and going “what probability distribution over Y is generated by selecting a point from X according to its probability distribution and applying g?”

On the positive functional level, this is defined by: (g∗(h))(f):=h(f∘g)

Let’s take a guess as to what this is on the set level. The obvious candidate is: Take the (m,b) in H, do the usual pushforward of the measure via g to get a signed measure over Y, and keep the b term the same, getting a function g∗:Msa(X)→Msa(Y). If g is something that maps everything to the same point, our resulting measures will only be supported on one point, so we’d have to take upper-completion again in that case. But maybe if g is surjective we don’t need to do that?

Let g∗(H) be the set produced by applying g∗ to H and taking the upper completion.

Proposition 8:If f∈C(X,[0,1]) and g:X→Y is continuous, then Eg∗(H)(f)=EH(f∘g)

Proposition 9:g∗(H) is a (bounded) inframeasure if H is, and the image of H doesn’t require upper completion if g is surjective.

Proposition 8 certifies that our unique defining (g∗(h))(f)=h(f∘g) property is fulfilled by doing this to the set of sa-measures, so we have the proper set-analogue of the pushforward. And proposition 9 says we were right, the only thing that may go wrong is upper completion. Generally, doing operations like may not exactly make an inframeasure, but you’re only a closed convex hull and upper completion away from your canonical form as long as you check that the expectations match up with what you’d expect on the concave functional side.

Alright now, what about mixing? Like, mixing hypotheses to make a prior, according to some distribution ζ on the numbers. The concave functional analogue is (Eζhi)(f):=Eζ(hi(f))

This works just fine with no extra conditions if we have uniform continuity, but for Lipschitzness, we need an extra condition. Letting λ⊙i be the Lipschitz constant for the functional hi, we need ∑iζiλ⊙i<∞ in order for the result to be Lipschitz. The set version should just be mixing of sets.

Definition 8: Mixing Sets

Given a countable family of inframeasures Hi where i∈I⊆N, and a probability distribution ζ∈ΔI, then EζHi:={M|∃c∈∏iHi:∑iζic(i)=M}

Ie, EζHi is the set of sa-measures that can be made by choosing one sa-measure from each Hi and mixing them together w.r.t. ζ.

Try sketching out two sets on a piece of paper and figure out what the ^{50}⁄_{50} mix of them would be. This corresponds to “Murphy can pick whatever they want from each set, but they’re constrained to play the points they selected according to the probability assigned to each Hi”

We should note that for bounded inframeasures, letting λ⊙i be the bound on the λ value of the minimal points of Hi by minimal-boundedness, we want ∑iζiλ⊙i<∞ to preserve our minimal-bounded condition for the mix.

Proposition 10:EEζHi(f)=Eζ(EHi(f))

Proposition 11:A mixture of infradistributions is an infradistribution. If it’s a mixture of bounded infradistributions with minimal point λ bounds of λ⊙i, and ∑iζiλ⊙i<∞, then the mixture is a bounded infradistribution.

Again, Proposition 10 certifies that we got the right notion on the set level, and Proposition 11 certifies that we don’t need to do any closure or upper completion cleanup. Now, how do mixtures interact with pushforwards?

Proposition 12:g∗(Eζ(Hi))=Eζ(g∗(Hi))

So, it doesn’t matter whether you mix before or after pushing your infradistribution through a continuous function. The proof of this is quite nontrivial if we were to do it on the set level because of exhaustive verification of each of the conditions, but now that we’ve shown that we have the right set-level analogue of mixing and pushforwards, we can just work entirely in the concave-functional picture and knock the proof out in a couple of lines, so our duality is doing useful work.

Updating

Let’s move on to updates. We should mention that we’ll be taking a two-step view of updating. First, there is chopping down the measure so only measure remaining within the set we’re updating on remains, and then there’s renormalizing the measure back up to 1. Thinking of updating as a unitary process instead of separated into these two phases will confuse you.

First, what sorts of events can we update on? Well, we only have expectations of continuous functions, and in a classical setting, we can get probabilities from expectations (and then updates from that) by considering the expectation of an indicator function that’s 1 on a set and 0 everywhere else. Sadly, in this setting, we can only take the expectation of a continuous function. A sharp indicator function for an event will be discontinuous unless the set we’re updating on is clopen. Fortunately, the specific application of “conditioning on a finite history” (with discrete action and observation spaces) only involves conditioning on clopen sets, because the set of histories which have a certain finite history as a prefix are clopen. Similarly, if you’ve got a finite space of observations, observing “the true thing is in this subset of observations” is a clopen update.

This seems like a very harsh restriction, but we can generalize further to get something adequate for most applications. I had to rewrite this section, though, because the original motivation was too complicated and had too many moving parts, so I figured it’d be more sensible to motivate things from an entirely different direction.

Let’s call our clopen set that we’re updating on Z. Our task for the post-update infradistribution is to specify the expectation values of functions that are only defined on Z. And we only have expectation values for functions that are defined on all of X. So, we need some way to extend a function f that’s only defined on Z to be defined on all of X. The obvious way is to specify some single function g that tells you what happens outside of Z, and look at the expectation of 1Zf+(1−1Z)g , ie, “look at the function that’s f on Z, and g outside of Z”. Admittedly, the obvious way to do this is to have g=0, and that produces the best analogies to probability theory for infradistributions, and is what standard updates do. However, it’s worth looking at things in more generality and seeing what comes out.

So, our first stab (on the positive functional level) would be something like:

hgZ(f):=h(1Zf+(1−1Z)g)

This is just”extend your function on Z to all of X by extending it with g” But, lamentably, this isn’t normalized. This is more like a “raw update” without normalizing afterwards. But, we know how to renormalize. So, attempt 2:

That’s the renormalized form. As a sanity check, we should make sure that it recovers standard Bayesian updates. So, let’s say our infradistribution is actually just μ, a standard probability distribution. Then,

Ok, so this recovers standard updating for actual probability distributions. But why did we introduce that g? Why does it matter how we extend our function to outside the area we’re updating on? Just use g=0, we don’t care about what’s happening outside the area we’re updating on. Well, that’s exactly the problem. We actually do care about what’s happening outside of what we’re updating on. The whole reason dynamic inconsistency comes about for standard Bayesian updating on problems like counterfactual mugging is that past-you cares about what happens down both branches of the coinflip, but once Omega tells you what the coin comes up, you stop caring about what happens own the other branch of the coinflip and so you don’t pay up. And that’s why having g≠0 can permit you to get dynamic consistency, past-you and future-you don’t have to have mismatches in what you care about.

For a single probability distribution, the reason why g doesn’t really matter is because, for probability distributions, we can split into “expected value coming from Z” and “expected value coming from outside Z” and cleanly add them up. So, specifying how well you do outside Z just adds the same constant to everything, and then the shift part of scale-and-shift deletes it.

However, for infradistributions in general, you can’t split things up like that! Your choice of g affects how different f’s score relative to each other, so it matters quite a bit how you extend functions from Z to all of X.

So, there’s the motivation for updates. Really, it’s a question of “how do we extend expectations of functions in Z to a function over all of X so we can take expectations”, and it turns out that different choices of g (to extend to all of X) produce different results, while this isn’t the case with conventional probability distributions. And, while g=0 produces the best analogies to conventional probability theory and updating and has the most nice properties, g=0 pretty much corresponds to “I don’t care about what happens outside of Z”, and dynamic consistency mandates that you do care about what happens outside of the events you update on.

But still, only being able to update on clopen sets really sucks. Can we generalize it? Well, using a theorem not shown here (because this section is a later-edited addition), we can go:

“Lets say we’ve got an infradistribution h on the space X, and a continuous function L:X→[0,1] that gives the probability of a particular point producing the observation we got. We use h and L to get a joint infradistribution over X×{0,1} (ie, did we see the observation or not). Then we update on 1 (ie, we saw the observation, this is a clopen update), and our function g tells us how good things would be for a given x conditional on not getting the observation. Our new infradistribution is over X×{1}, which is isomorphic to X. This new infradistribution has thus-and-such form.”

With that, we can update on observations where different points in X have different odds of producing your observation of interest.L is your function telling you how likely it is that a point would produce your observation you saw, g is your function telling you how well things do for a given point x if you didn’t see the observation you saw. This is basically updating on any fuzzy set with a continuous indicator function (L can be considered the indicator function for a fuzzy set), instead of just updating on clopen sets. Much more general! Also note how the logical induction paper had to use continuous indicator functions instead of sharp discontinuous ones, this is analogous. How do updates work when we’re updating on fuzzy sets?

Well, a nifty feature of a continuous likelihood function L is that it lets us glue two continuous functions f,g∈C(X,[0,1]) into another continuous function, in a similar way to our original attempt of “f in the region we’re restricting to, g outside it”

Definition 9: L-Gluing

Given three continuous functions L,f,g:X→[0,1], f glued to g via L, f★Lg, is the function defined as: f★Lg:=Lf+(1−L)g

So, f★Lg behaves like f inside our region we’re updating on, and g outside of our region we’re we’re updating on, and get mixed in regions we’re unsure of.

Going back to the positive functional picture with h (our analogue of an expectation), we can now define the raw update relative to L (likelihood function) and g (off-observation value). The following form is derived from our theorem about reducing updates on fuzzy sets to sharp updates on “did I get this observation”.

hgL(f):=h(f★Lg), where f∈C(¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯supp(L),[0,1])

And then the full version of updating on fuzzy sets in all its generality (on the concave functional side of the duality) is:

Squinting at this, this is: “do a raw update. Then subtract the excess value off (first step of renormalizing), and then rescale according to h(1★Lg)−h(0★Lg) which… well, ignoring the g part, looks kinda like “expectation of 1 - expectation of 0″… oh, it’s like using the gap between the measure on X and the measure on ∅ to define how much you need to scale up a measure to be a probability distribution! That term on the bottom is the analogue of the probability of the event you’re updating on, relative to the function g!”

Let’s flesh this out a bit more in preparation for defining updates on the set side of the duality.

Backing up to our definition of renormalization, we required that EH(1)=1, and EH(0)=0. If we pivot to viewing the function inside the expectation as our L that gives an indicator function for something, the normalization condition says something like “probability of X is 1, probability of ∅ is 0”.

Let’s generalize this and define probabilities of fuzzy events relative to a function g.

Definition 10: Probabilities of Fuzzy Sets

Given two functions g,L∈C(X,[0,1]), and a nonempty set of sa-measures B, the probability of L (interpreted as a fuzzy set) w.r.t B and g is: PgB(L):=EB(1★Lg)−EB(0★Lg)

If g is 0, then for an infradistribution H, by normalization, and unpacking what that ★ means, we get P0H(L)=EH(L). This is much like how the probability of a set is the expectation of the indicator function for the set, just interpret the L as the indicator function for a fuzzy set on the right, and as a fuzzy set on the left. So, for g=0, it behaves like an actual probability.

However, when g≠0, then this is better interpreted as caring-measure.PgB(L) is the difference between the best score you can get vs Murphy and the worst score you can get vs Murphy if you know how things go outside of L (interpreted as fuzzy set). This g-dependent “probability” is actually “how much value is at stake here/how much do I care about what happens inside set L given that I know what happens outside of it.”

And, further, our scaling term for renormalizing an inframeasure B that isn’t an infradistribution yet was (EB(1)−EB(0))−1. So, using this reformulation, our rescaling term turns into (PgB(1X))−1 regardless of g. So, our renormalization term is “rescale according to the probability you assign to any event at all occuring”

Alright, we have enough tools to define updating on the set level. For the raw update (no rescaling yet), we need to chop the measure down according to L. We should also fold in the off-L value (requires specifying g) to the b term, by our dynamic consistency example. And then we do appropriate scale-and-shift terms to subtract as much as we can off the b term, and rescale according to the probability of L relative to the g we’re also updating on.

Let’s use m⋅L to denote the original finite signed measure but scaled down by the function L. If we view L as an indicator function, this is just slicing out all of the measure outside of L, ie, usual conditioning.

Definition 11: Updating

H|gL, Hupdated on g and L, is the set made by mapping H through the following function and taking closure and upper-completion.(m,b)↦1PgH(L)(m⋅L,b+m(0★Lg)−EH(0★Lg))

Closure is unnecessary if H is a bounded inframeasure, and upper-completion is uneccesary if your L is the indicator function for a clopen set.

Roughly, this is: Chop down the measure according to your fuzzy set, m(0★Lg) is the fragment of expected value you get outside of your fuzzy set so you fold that into the b term. For rescaling, when we unpack EH(0★Lg), it’s just inf(m′,b′)∈H(m′(0★Lg)+b′), so that’s the maximum amount of value we can take away from the second vector component without ever making it go negative. And then rescale “how much do I care about this situation” (PgH(L)) back up to 1.

Note that this updating process lands us in a different vector space. Our new vector space is M±(¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯supp(L))⊕R. It still fulfills the nice properties we expect, because a closed subset of a compact space is compact so every nice property still carries over. And it still has a closed convex cone of sa-measures w.r.t. the new space, abbreviate that as Msa(L).

What properties can we show about updates?

Proposition 13:When updating a bounded infradistribution over Msa(X), if the renormalization doesn’t fail, you get a bounded infradistribution over the set Msa(L). (for infradistributions in general, you may have to take the closure)

Proposition 14: EH(f★Lg)=EH(0★Lg)+PgH(L)EH|gL(f)

Ok, so it’s sensible on the set level. And for proposition 14, it means we can break down the expectation of two functions glued together by L into the expectation of g outside of L, and the probability of L relative to g times the updated expectation of f. We get something interesting when we reshuffle this. It rearranges to:

EH|gL(f)=1PgH(L)(EH(f★Lg)−EH(0★Lg))

Further unpacking the probability, we get

EH|gL(f)=EH(f★Lg)−EH(0★Lg)EH(1★Lg)−EH(0★Lg)

And then, translating to concave functionals via LF-duality, we get:

(h|gL)(f)=h(f★Lg)−h(0★Lg)h(1★Lg)−h(0★Lg)

So, this shows that we got the right update! What happens when we update twice?

Proposition 15: (H|gL)|g′L′=H|⎛⎝g★1−L1−LL′g′⎞⎠LL′

So, updating twice in a row produces the same effect as one big modified update. It may be a bit clearer if we express it as:

Corollary 2: Regardless of L and L′ and g, (H|gL)|gL′=H|g(LL′)

Corollary 3:If X and Y are clopen sets, then, glossing over the difference between indicator functions and sets, (H|gY)|gZ=H|g(Y∩Z)

Now, what happens when you update a prior/mixture of infradistributions? We get something that recovers Bayesian updating.

Theorem 6, InfraBayes:

(EζHi)|gL=Eζ(PgHi(L)⋅(Hi|gL))Eζ(PgHi(L)) if the update doesn’t fail.

This means that when we update a prior, it’s the same as updating everything individually, and then mixing those with probabilities weighted by the probability the infradistribution assigned to L, just like standard Bayes!

In particular, if some hypothesis goes “nothing matters anymore” and gives up and cries after seeing L, then its probability term is 0, so it will drop out of the updated prior entirely, and now you’re only listening to hypotheses that think what you do matters. Thus, with a fairly broad prior, we don’t have to worry about the agent giving up on life because nothing matters post-update, just as long as some component in its prior gives it the will to continue living/says different policies have different values. Well, actually, we need to show an analogue of this for belief functions, but it pretty much works there too.

Additional Constructions

There are more probability theory analogues, but they are primarily material for a future post. We’ll just give their forms in the concave functional view. If they look unmotivated, just note that they match up with the standard probability-theory notions if we use infradistributions corresponding to actual probability distributions. We’ll be using lambda-notation for functions. If you haven’t seen it before, λx.f(x,2) is the function that takes in an x, and returns f(x,2). λz.(λa.a+z) is the function that maps z to the function that maps a to a+z.

Definition 12: Product

If h1∈□X and h2∈□Y, the product h1×h2∈□(X×Y) is given by: (h1×h2)(f):=h1(λx.h2(λy.f(x,y)))

These products are noncommutative! The product of bounded infradistributions is a bounded infradistribution.

There’s also infrakernels, the infra-analogue of Markov kernels. A Markov kernel is a function X\to\Delta Y that maps each x to some probability distribution over Y. Concrete example: The function mapping income to a probability distribution over house size.

Definition 13: Infrakernel

An infrakernel is a function X→□Y that is:

1: Pointwise convergent. For all sequences xn limiting to x and all f∈C(Y,[0,1]), limn→∞K(xn)(f)=K(x)(f)

2: Uniformly equicontinuous. For all ϵ, there is a δ where if |f−f′|<δ, then ∀x:|K(x)(f)−K(x)(f′)|<ϵ

If there is some λ⊙>0 where property 2 works with δ=ϵλ⊙, then it is a Lipschitz infrakernel.

The first two conditions let you preserve uniform continuity, and the strengthening of the second condition lets you preserve Lipschitzness/being a bounded inframeasure.

Now, we can define the semidirect product, h⋉K. The semidirect product in the probability-theory case is… Consider the aforementioned Markov kernel mapping income to a probability distribution over house size. Given a starting probability distribution over income, the semidirect product of (income distribution) with (house size kernel) would be the joint distribution over income and house size. It’s a critically important concept to know that isn’t discussed much.

Definition 14: Semidirect Product

If h∈□X, and K is an infrakernel X→□Y, the semidirect product h⋉K∈□(X×Y) is given by (h⋉K)(f):=h(λx.K(x)(λy.f(x,y)))

Products are just a special case of this, where K(x)=h2, regardless of x. If h is a bounded infradistribution and K is a Lipschitz infrakernel, then the semidirect product is a bounded infradistribution.

The pushforward of a probability distribution is… given the house size Markov kernel, and a distribution over income, the pushforward is the induced distribution over house size. We earlier gave pushforwards for a continuous function g:X→Y. What’s the analogue of the pushforward for an infrakernel? Or, can you do a pushforward w.r.t. a Markov kernel?

Definition 15: Pushforward

If h∈□X, and K is an infrakernel X→□Y, the pushforward K∗(h)∈□Y is given by K∗(h)(f):=h(λx.K(x)(f)) And if k is a continuous (in the KR-metric) Markov kernel X→ΔY, the pushforward k∗(h)∈□Y is given by k∗(h)(f):=h(λx.Ek(x)(f))

An interesting note about this is, if h is a bounded infradistribution, then we need Lipschitz infrakernels to preserve that property for pushforwards, but we do not need any additional condition on a Markov kernel to preserve boundedness besides continuity. Exercise: Try to figure out why.

Really, everything originates from the semidirect product. The product is the special case of a semidirect product for a constant infrakernel, the pushforward is a semidirect product that’s projected down to the Y coordinate, the pushforward w.r.t a Markov kernel is a special case of pushforward w.r.t. an infrakernel, and the pushforward w.r.t. a continuous function is a special case of pushforward w.r.t. a Markov kernel.

And that’s about it for now, see you in the next post!

## Basic Inframeasure Theory

Our task in this post will be to develop the basic theory and notation for inframeasures and sa-measures. The proofs and concepts require some topology and functional analysis. We assume the reader is familiar with topology and linear algebra but not functional analysis, and will explain the functional analysis concepts more. If you wish to read through these posts, PM me to get a link to MIRIxDiscord, we’ll be doing a group readthrough where I or Vanessa can answer questions. Here’s the previous post and here are the proof sections. Beware, the proof sections are hard.

Notation ReferenceFeel free to skip this segment and refer back to it when needed. Duplicate the tab, keep one of them on this section, and you can look up notation here.

X,Y: some compact metric space. Points in this are denoted by x or y.

d: Some distance metric, the usual one is the KR-metric between signed measures, defined as d(m,m′):=supflip∣∣∫Xflip(x)dm−∫Xflip(x)dm′∣∣ where flip is a function X→[−1,1] that has a Lipschitz constant of 1 or less. In other words, what’s the biggest distance between the values that the measures output when you feed in a function that’s limited in its ability to distinguish measures that are close to each other due to having a small Lipschitz constant?

M±(X): The Banach space of finite signed measures over X equipped with the KR-metric/norm, defined as above (the norm is derived from the metric by assessing the distance between the signed measure and the 0 measure). Elements are denoted by m. By Jordan decomposition, we can uniquely split it into m++m− where the former is all-positive and the latter is all-negative.

C(X),C(X,[0,1]): The Banach space of continuous functions X→R. The latter one is the space of continuous functions bounded in [0,1]. Elements of C(X,[0,1]) are typically denoted by f.

m(f): We can interpret signed measures as continuous linear functionals on C(X). This is given by ∫Xf(x)dm. If m was actually a probability distribution, this would just be Eμ(f). They’re generalized expectations.

b: used to refer to the number component of an a-measure or sa-measure.

Msa(X): The closed convex cone of sa-measures. An sa-meaure is a pair (m,b) where b+m−(1)≥0. Elements of this (sa-measures) are denoted by M.

f+: A positive functional. A continuous linear functional that’s nonnegative for all sa-measures.

B: A set of sa-measures.

EB(f): The expectation of a function f relative to a set of sa-measures. Defined as inf(m,b)∈B(m(f)+b).

Bmin,Buc: The set of minimal points of B (points that can’t be written as a different point in B plus a nonzero sa-measure), and the upper completion of B (B+Msa(X)), respectively.

Ma(X): The closed convex cone of a-measures. An a-measure is a pair (m,b) where m is a measure (no negative component) and b≥0. It can also be written as (λμ,b) where λ≥0, and μ is a probability distribution.

λ: Given an a-measure, its λ value is the λ from writing the a-measure as (λμ,b). At the end, it’s used for lambda-notation to describe complicated functions. Context distinguishes.

λ⊙: Either the minimal upper bound on the λ values of the minimal points of a set B, or the Lipschitz constant of a function h (there’s a close link between the two).

¯¯¯¯B,c.h(B): The closure and convex hull of a set, respectively.

H: An infradistribution. A set of sa-measures fulfilling the properties of nonemptiness, closure, convexity, upper-completeness, positive-minimals, (weak)-bounded minimals, and normalization.

□X: The set of infradistributions over X. □bX is the set of bounded infradistributions over X.

ΔX: The space of probability distributions over some set X.

h: The function induced by an H that goes f↦EH(f). Or, just a function C(X,[0,1])→[0,1] that’s concave, monotone, uniformly continuous, and normalized, there’s a link with infradistributions.

g,g∗: If you’re seeing it in the context of a pushforward, g is a continuous function X→Y, and g∗ is the induced function □X→□Y. If you’re seeing g in the context of updating, it’s a continuous function in C(X,[0,1]).

ζ: Used to denote probability distributions used for mixing stuff together, a probability distribution over the natural numbers or a finite subset of them in all cases.ζi is the probability on the number i.

i: The index variable for mixing, like indexing infradistributions or points or functions.

EζHi: A mix of sets, defined as the set of every point that can be constructed by selecting a point from each Hi and mixing according to ζ.

L: A function in C(X,[0,1]), thought of as the indicator function for a fuzzy set, that we use for updating.

f★Lg: The function made by gluing f and g together via L, defined as (f★Lg)(x):=L(x)f(x)+(1−L(x))g(x).

supp(f): The support of a function f, the set of x where f(x)>0.

1E: The indicator function that’s 1 on set E, 0 everywhere else.

PgH(L): The probability of L relative to the function g according to the infradistribution H, it’s defined as EH(1★Lg)−EH(0★Lg)

H|gL: The infradistribution H updated on the fuzzy set L relative to the function g.

K: An infrakernel, a function fulfilling some continuity properties, of type signature X→□Y.

Basic conceptsTime to start laying our mathematical groundwork, like the spaces we’ll be working in.

X is some compact metric space, equipped with the Borel σ-algebra (which, in this case, is the same as the Baire σ-algebra) We could probably generalize this further to complete metric spaces, but things get significantly trickier (one of many directions for future research), and compact metric spaces are quite well-behaved.

Concrete examples of compact metric spaces include the space of infinite bitstrings, the color space for a mantis shrimp, the surface of a sphere, a set of finitely many points, the unit interval, the space of probability distributions over a compact metric space, and countable products of compact metric spaces.

Let’s recap some functional analysis terminology for those seeing it for the first time, and a bit of the notation we’re using, you can skip this part if you already know it. Vector spaces in full generality may lack much of the nice structure present in Rn that’s used in Linear Algebra. Going from the strongest and most useful structure to the weakest, there’s a chain of implication that goes inner product, norm, metric, topology. If you have an inner product, you can get a norm. If you have a norm, you can get a metric from that via d(x,y)=||x−y||, and if you have a metric, you can get a topology from that (with a basis of open balls centered at points). The structure must be imposed on the vector space, and there may be some freedom in doing so, like how Rn can have the L1, L2, or L∞ norm.

A Banach space is a vector space equipped with a norm (a notion of size for vectors), that’s also closed under taking limits, just like R is.

The term “functional” is used for a function to R. So, a continuous linear functional on a vector space V is a function that’s: linear, continuous, and has type signature V→R.

The term “dual space of V” is used for the vector space of continuous linear functionals V→R.

The space C(X) is the Banach space of continuous functions X→R. We’ll also use C(X,[0,1]) to denote the subset that just consists of continuous functions bounded in [0,1]. The dual space of C(X) is M±(X), the Banach space of finite signed measures over X.

Moving on from the functional analysis terminology, let’s consider finite signed measures m, an element of M±(X). A signed measure m can be uniquely split into a positive part and a negative part m++m−, by the Jordan Decomposition Theorem. The “finite” part just means that m+ doesn’t assign any set ∞ measure and m− doesn’t asssign any set −∞ measure.

Now, we said that the space of finite signed measures was the dual space of C(X) (continuous functions X→R). So… how does a signed measure correspond to a continuous linear functional over C(X)? Well, that corresponds to m(f):=∫Xf(x)dm

If m was a probability distribution μ, then μ(f) would be Eμ(f), so this is just like taking the expected value, but generalizing to negative regions in m. We’ll be using this notation m(f) a whole lot. Because finite signed measures perfectly match up with continuous linear functionals C(X)→R, we can toggle back and forth between whichever view is the most convenient in the moment, viewing continuous linear functionals on C(X) as finite signed measures, and vice-versa.

Our ambient vector space we work in is M±(X)⊕R, a pair of a signed measure and a number.⊕ is the direct sum, which is basically Cartesian product for vector spaces. The direct sum of Banach spaces is a Banach space, with the norm defined in the obvious way as ||(m,b)||=||m||+|b|.

We should take a moment to talk about what norm/metric we’re using. The norm/metric we’re using on M±(X) is the KR(Kantorovich-Rubinstein)-norm/metric.

Definition 1: KR-metricThe metric defined byd(m,m′):=supflip|m(flip)−m′(flip)|whereflipis a a continuous functionX→[−1,1]with a Lipschitz constant of 1 or less. The KR-norm is||m||=d(m,0)But why are we using this unfamiliar KR-metric, instead of the trusty old total variation distance? Well, the KR-metric, since it can only query the measure with functions that aren’t too steep or big, says that two distributions are close if, y’know, they’re close in the intuitive sense. If we have a bunch of Dirac-delta distributions at 0.9, 0.99, 0.999..., then according to the KR-metric, they’d limit to a Dirac-delta distribution at 1. According to total variation distance, all these distributions at distance 2 from each other and don’t converge at all. Similarly, if we’ve got two probability distributions over histories for an environment we’re in, and they behave

verysimilarly and then start diverging after the gazillionth timestep, the KR-metric would go “hm, those two distributions are very close to each other”, while total variation distance says they’re very different. Also, if we’ve got finitely many points at distance 1 from each other, the KR-metric and total variation distance match up with each other (up to a constant). But total variation distance is a bit too restrictive for the continuous case.There’s a sense in which convergence in total variation distance too strict because it requires a “perfect match” to exist in your hypothesis space, while convergence in KR-distance is just right for nonrealizability because, instead of requiring a “perfect match”, it just requires that you get sufficiently close. Instead of getting accurate predictions for the rest of forever, it’s a requirement that’s something more like “you’ll be accurate for the next zillion timesteps, and the time horizon where you start being inaccurate gets further and further away over time”. You can draw an analogy to how utility functions with the time discount don’t care that much about what happens at very late times.

Going with the KR-metric means we’ve got very nice dual spaces and compactness properties, while with total variation distance, Wikipedia doesn’t even know what the dual space

is.So, tl;dr, the KR-metric is a much better choice for our setting, and we’re working in M±(X)⊕R as our vector space, which is equipped with the KR-norm and is closed under limits.

Definition 2: Sa-MeasureA pointM∈M±(X)⊕R, which, when written as a pair of a signed measure and a number(m,b), hasb+m−(1)≥0. The set of sa-measures is denoted byMsa(X).Definition 3: A-MeasureA pointM∈M±(X)⊕R, which, when written as a pair of a signed measure and a number(m,b), hasmas a measure, andb≥0. The set of a-measures is denoted byMa(X).Note that Ma(X) and Msa(X) are both closed convex cones. A closed convex cone is a subset of a vector space that: Is closed under multiplication by any a≥0, is closed under addition, and is closed under limits. For visual intuition, imagine a literal cone with its point at 0 in R3 that’s heading off in some direction, and see how it fulfills those 3 properties.

Basic Inframeasure ConditionsBefore proceeding further, we should mention that Theorems 1, 2, and 3 are fairly elementary and have probably been proved in more generality in some different paper on convex analysis. We just call them theorems because they’re important, not necessarily original. Past that, things get more novel. Sets of distributions instead of distributions have been considered before under the name “Imprecise Probability”, as have nonlinear expectations and some analogues to probability theory, Shige Peng wrote a book on the latter. We found out about this after coming up with it independently. The main innovations that have not been found elsewhere are augmenting the sets of probability distributions with extra data (ie, our sa-measures) to get a dynamically consistent update rule, and how to deal with environments/link the setting to reinforcement learning. Let’s move on.

Let B be some arbitrary set of sa-measures. We’re obviously nowhere near calling it an infradistribution, because we haven’t imposed any properties on it. And different B may have the same behavior, we’re nowhere near our second desideratum of collapsing equivalence classes of B’s with the same behavior. Well, nonemptiness should be a fairly obvious property to add.

Condition 1: Nonemptiness:B≠∅From here, let’s see how we can enlarge B without affecting its behavior. But hang on, what do we even mean by “behavior”??

Definition 4: Expectation w.r.t. a Set of Sa-MeasuresEB(f):=inf(m,b)∈B(m(f)+b)

wheref∈C(X,[0,1])andBis nonempty.This is what we mean by “behavior”, all these values should be left unchanged, regardless of f.

Proposition 1:Iff∈C(X,[0,1])thenf+:(m,b)↦m(f)+bis a positive functional forMsa(X).A positive functional for Msa(X) is a continuous linear function M±(X)⊕R→R that is nonnegative everywhere on Msa(X).

This suggests two more conditions besides Nonemptiness.

Condition 2: Closure:B=¯¯¯¯BCondition 3: Convexity:B=c.h(B)Why can we impose closure and convexity? Taking the closure of B wouldn’t affect any expectation values, it’d only swap inf for min in some cases because (m,b)↦m(f)+b is continuous. Also, since everything we’re querying our set B with is inf of a linear functional by Proposition 1, we can take the convex hull without changing anything. So, swapping B out for its closed convex hull, no expectation values change at all.

But wait, we aren’t querying B with

alllinear functionals, we’reonlyquerying it with positive functionals that are constructed from a f in C(X,[0,1]). Or does this class of positive functionals go further than we think? Yes, it does, actually.Theorem 1, Functionals To Functions:Every positive functional onMsa(X)can be written as(m,b)↦c(m(f)+b), wherec≥0, andf∈C(X,[0,1])Nice! We actually are querying our set with all positive functionals, because we’ve pretty much got everything with just f∈C(X,[0,1]), and everything else is just a scalar multiple of that.

Upper Completion and Minimal PointsIf you have a point M∈B, and some other point M∗ that’s an sa-measure, we might as well add M+M∗ to B. Why? Well, given some positive functional f+ (and everything we’re querying our set B with is a positive functional by Proposition 1,

f+(M+M∗)=f+(M)+f+(M∗)≥f+(M)

By linearity and positive functionals being nonnegative on sa-measures, your new point M+M∗ has equal or greater value than M, so when we do infM∈Bf+(M), the addition of the new point didn’t change anything at all, regardless of which positive functional/continuous function (by Theorem 1) we’re using. So then, let’s add in

allthe points like this! It’s free. This would be done via Minkowski sum.B+Msa(X)={M|M=MB+M∗,MB∈B,M∗∈Msa(X)}

Definition 5: Upper CompletionThe upper completion of a setB,Buc, isBuc:=B+Msa(X)Condition 4: Upper Completeness:B=B+Msa(X)Ok, so we add in all those points. Since we’re adding two nonempty convex sets, the result is also convex. As for closure...

Lemma 2:The upper completion of a closed set of sa-measures is closed.However, B+Msa(X) isn’t

quitewhat we wanted. Maybe there’s more points we could add! We want to add in every sa-measure we possibly can to our set as long as it doesn’t affect the essential “behavior”/worst-case values. So, weshouldbe able to add in a point if every positive functional/continuous function (Proposition 1 and Theorem 1) goes “the value of the point you’re looking is undershot by this preexisting point over here”. This more inclusive notion of adding points to make B as big as possible (adding any more points would start affecting the “behavior” of our set) would be:Add a point (m′,b′) to B if, for all f in C(X,[0,1]), there’s a (m,b) in B where m′(f)+b′≥m(f)+b

Actually, this gets us

nothingover just taking the upper completion/adding Msa(X)! Check out the next result.Proposition 2:For closed convex nonemptyB,B+Msa(X)={M|∀f+∃M′∈B:f+(M)≥f+(M′)}

Combining Proposition 2 and Theorem 1, our notion of upper closure is exactly the same as “add all the points you possibly can that don’t affect the inf(m,b)∈B(m(f)+b) value for any f”

Along with the the notion of the upper completion comes the notion of a minimal point.

Definition 6: Minimal PointA minimal point of a closed nonempty set of sa-measuresBis a pointM∈Bwhere, ifM=MB+M∗, andMB∈B, andM∗∈Msa(X)thenM∗=0. The set of minimal points is denotedBminSo, minimal points can’t be written as a different point in the same set plus a nonzero sa-measure. It’s something that can’t possibly have been added by the upper-completion if it wasn’t there originally. We’ll show a picture of the upper completion and minimal points (these two notions generalize to any closed subset of any closed cone), to make things more concrete.

Theorem 2, Minimal Decomposition:Given a nonempty closed set of sa-measuresB, the set of minimal pointsBminis nonempty and all points inBare above a minimal point.This means that we can take any point M∈B and decompose it into Mmin+M∗, where Mmin∈Bmin, and M∗ is an sa-measure. We use this a whole lot in proofs. The proof of this uses the axiom of choice in the form of Zorn’s Lemma, but separability may let us find some way to dodge the use of the full axiom of choice.

Proposition 3:Given af∈C(X,[0,1]), and aBthat is nonempty closed,inf(m,b)∈B(m(f)+b)=inf(m,b)∈Bmin(m(f)+b)So when evaluating EB(f), we can just minimize within the set of minimal points. Minimal points are the

onlything that matters for the “behavior” of B.Proposition 4:Given a nonempty closed convex B, Bmin=(Buc)min and (Bmin)uc=BucThe set of minimal points is left unchanged when you take the upper completion, and taking the upper completion of the set of minimal points equals taking the upper completion of B. This is fairly intuitive from the picture.

Theorem 3, Full Characterization:If the nonempty closed convex setsAandBhaveAmin≠Bmin, then there is somef∈C(X,[0,1])whereEA(f)≠EB(f)Corollary 1:If two nonempty closed convex upper-complete setsAandBare different, then there is somef∈C(X,[0,1])whereEA(f)≠EB(f)Looking back at our second desideratum, it says “Our notion of a hypothesis in this setting should collapse “secretly equivalent” sets, such that any two distinct hypotheses behave differently in

somerelevant aspect. This will require formalizing what it means for two sets to be “meaningfully different”, finding a canonical form for an equivalence class of sets that “behave the same in all relevant ways”, and then proving some theorem that says we got everything.”And we did exactly that. Also, Theorem 3 and the other results justify the view of the minimal points as the “unique identifier” of a closed convex set. If two closed convex sets have the same minimal-point ID, then taking the upper completion gets you the same set, and they behave the same w.r.t all the queries we can throw at them. If two sets have a different minimal-point ID, then when we take the upper completion, they’re different, and there’s some query that distinguishes them.

Minimal Point ConditionsWell, this is the basics. But we can impose some more conditions. We don’t really like these signed measures, it’d be nice to work

exclusivelywith positive measures, if possible. The minimal points are all we really need to care about by Proposition 3, so let’s require that they’re all in the smaller cone Ma(X), which has no negative-measure shenanigans afoot. Renormalization may fail if there’s minimal points that have negative parts, which is analogous to how you can renormalize a positive measure back to a probability distribution, but a signed measure may not be able to be renormalized back to 1.Condition 5: Minimal-positivity:Bmin⊆Ma(X)Further, things can get a bit tricky in various places if the minimal points don’t lie in some compact set. Applying compactness arguments lets you show that you don’t have to close your set after updating, and show up a lot in our proofs of properties of belief functions. However, this next condition isn’t essential, just convenient, and it’s worthwhile looking at what happens when it’s dropped, for future research.

Condition 6a: Minimal-boundedness:There is a compact setCs.t.Bmin⊆C.Proposition 5:Letμdenote an arbitrary probability distribution. IfBmin⊆Ma(X), then the condition “there is aλ⊙where,∀(λμ,b)∈Bmin:λ≤λ⊙” is equivalent to “there is a compactCs.t.Bmin⊆C”We mostly use this formulation of Minimal-boundedness instead of the compact set requirement. We only have to bound the scale-terms on the minimal points and we have this property. Again, it’s not

essential, but very convenient.Is there a weakening of bounded-minimals? Yes there is. I haven’t figured out what it means for the set of minimal points

(EDIT 8/28: got a clean iff result about what it means for minimal points while trying to prove something else), but it’s more mathematically essential, and I don’t think it can be dropped if some other post wants to go further than we did. It can’t be motivated at this point, we’ll have to wait until we get to Legendre-Fenchel Duality.Condition 6b: Weak minimal-boundedness:is uniformly continuous.NormalizationSo, we have almost everything. Nonemptiness, closure, convexity, upper-completion, bounded-minimals, minimal-positivity, and minimal-boundedness/weak minimal-boundedness are our conditions so far.

However, there’s one more condition. What’s the analogue of renormalizing a measure back to 1 in this situation? Well, for standard probability distributions μ, Eμ(0)=0, and Eμ(1)=1. This can be cleanly ported over. We shall require that EB(0)=0, EB(1)=1, that’s our analogue of normalization. Unpacking what the expectation means, this corresponds to: There’s minimal points (λμ,b) where b is arbitrarily close to 0, and there’s some minimal point (λμ,b) where λ+b=1, and there’s no points with a lower λ+b value.

Condition 7: Normalization:EB(1)=1,EB(0)=0Let’s recap all the conditions. We’ll be using H for something fulfilling all of the following conditions except maybe 6a.

1: Nonemptiness:H≠∅2: Closure:H=¯¯¯¯¯H3: Convexity:H=c.h(H)4: Upper Completeness:H=H+Msa(X)5: Positive-Minimals:Hmin⊆Ma(X)6a: Bounded-Minimals:∃λ⊙:(λμ,b)∈Hmin→λ≤λ⊙6b: Weak Bounded-Minimals:The function f↦EH(f) is uniformly continuous.7: Normalization:EH(0)=0,EH(1)=1Definition 7: (Bounded)-Infradistribution/InframeasureAn inframeasure is a set of sa-measures that fulfills conditions 1-5 and 6b. An infradistributionHis a set of sa-measures that fulfills conditions 1-5, 6b, and 7. The “bounded” prefix refers to fulfilling condition 6a. The set of infradistributions is denoted as□X, the set of bounded infradistributions is denoted as□bX.Now, how do we get normalization if it isn’t already present? Closure, Convexity, and Upper Completeness can be introduced by closure, convex hull, and upper completion, respectively. How do we turn an inframeasure into an infradistribution?

Well, just take every (m,b) in your set, and map it to: 1EB(1)−EB(0)(m,b−EB(0))

This may seem a bit mysterious. This normalization can be thought of as analogous to rescaling a utility function to be in [0,1] via scale-and-shift.

What we’re doing first, is shifting everything down by EB(0), which is as much as we possibly can manage without making b go negative anywhere. The utility function analogue would be, if your expected utilities are (0.4,0.5,0.6), this is like shifting them down to (0,0.1,0.2).

The second thing we do is go “ok, what’s the lowest value at 1? Let’s scale that back up to 1”. Well, it’d be EB(1)−EB(0) (remember, we shifted first), so we multiply everything by (EB(1)−EB(0))−1 to get our set of interest.

Hang on, what if there’s a divide-by-zero error? Well.. yes, that can happen. In order for it to happen, EB(0)=EB(1). Does this correspond to any sensible condition which hopefully doesn’t happen often?

Proposition 6:EB(0)=EB(1)occurs iff there’s only one minimal a-measure, of the form(0,b).Put another way, divide by zero errors occur exactly when Murphy is like “oh cool, no matter what function they pick, the worst thing I can do is give them b value, and then nothing happens so they can’t pick up any more value than that”, so nothing matters at all. This is exactly like Bayesian renormalization failing when you condition on a probability-0 event (note that the measure component is 0 before rescaling). You give up and cry because, in the worst case, nothing you do matters.

Proposition 7:Renormalizing a (bounded)inframeasure produces a (bounded) infradistribution, if renormalization doesn’t fail.And we’re done for now, we’ve made it up to infradistributions. Now, how can we analyze them?

Legendre-Fenchel DualityThere’s an powerful way of transforming an infradistribution to another form, to look at the same thing in two completely different mathematical contexts, and we can build up a sort of dictionary of what various different concepts are in two completely different settings, or develop concepts in one setting and figure out what they correspond to in the other setting.

An example of this sort of thing is Stone Duality, where you can represent a topological space as its lattice of open sets, to translate a huge array of concepts back and forth between topology and lattice theory and work in whichever setting is more convenient. And, working with special lattices that can’t always translate to topological spaces, you get locales and pointless topology! Duality theorems are highly fruitful.

This result will be inspired by the well-known fact that signed measures correspond to continuous linear functionals on C(X) via f↦m(f), and probability distributions, when translated over, have extra properties like monotonicity, being 1-Lipschitz, and μ(1)=1.

Theorem 4, LF-duality, Sets to Functionals:IfHis an infradistribution/bounded infradistribution, thenh:f↦EH(f)is concave, monotone, uniformly continuous/Lipschitz overC(X,[0,1]),h(0)=0,h(1)=1, andrange(f)⊈[0,1]→h(f)=−∞So, expectation w.r.t an infradistribution is concave (not linear, as probability distributions are) and if f≥f′ then EH(f)≥EH(f′) (monotonicity). Paired with normalization, this means every appropriate f has EH(f)∈[0,1].

You get concavity from convexity, the −∞ thing from upper-completeness, monotonicity matches up with “all minimal points are a-measures”, Lipschitz corresponds to “all minimal points have λ≤λ⊙”, uniform continuity corresponds to the weak-minimal-bound condition, and h(0)=0,h(1)=1 obviously corresponds to normalization. This is moderately suggestive, our conditions we’re imposing on the set side are manifesting as natural conditions on the concave functional we get from H.

Is there a reverse direction? How do we start with a h:C(X,[0,1])→[0,1] that fulfills the conditions, and get an infradistribution from that?

Theorem 5, LF-Duality, Functionals to Sets:Ifhis a functionC(X)→Rthat is concave, monotone, uniformly-continuous/Lipschitz,h(0)=0,h(1)=1, andrange(f)⊈[0,1]→h(f)=−∞, then it specifies a infradistribution/bounded infradistribution by:{(m,b)|b≥(h′)∗(m)}Whereh′is the function given byh′(−f)=−h(f), and(h′)∗is theconvex conjugateofh′. Also, going from a infradistribution to anhand back recovers exactly the infradistribution, and going from anhto a infradistribution and back recovers exactlyh.Another name for the convex conjugate is the Legendre-Fenchel transformation, so that’s where we get the term Legendre-Fenchel Duality from. If you want, you can shorten it as LF-duality.

So, (bounded)-infradistributions are isomorphic to concave monotone normalized Lipschitz/uniformly continuous functions h:C(X,[0,1])→R. This is the LF-Duality. We can freely translate concepts back and forth between “sets of sa-measures that fulfill some conditions” and “concave functionals on C(X,[0,1]) that fulfill some conditions”.

In particular, actual probability distributions correspond to a: linear, monotone, 1-Lipschitz, normalized functional C(X,[0,1])→R. So, in the other half of the LF-duality, probability distributions and infradistributions are very nearly the same sort of thing, the only difference between them is that the former is linear and the latter is concave! This is the essential reason why so many probability theory concepts have analogues for infradistributions.

But… what does the LF-transformation actually do? Well, continuous linear functionals C(X)→R are equivalent to signed measures, and continuous linear functionals M±(X)→R (with our KR-norm) correspond to continuous functions over X. A hyperplane in M±(X)⊕R corresponds to a point in C(X)⊕R and hyperplanes in the latter correspond to points in the former. Points in H turn into hyperplanes above h, points on-or-below the graph of h turn into hyperplanes below H.

What this transform basically does, is take each suitable f, check its value w.r.t h, convert the (f,h(f)) pair into a hyperplane in M±(X)⊕R, and go “alright, whichever set h came from must be above this hyperplane”. Eventually all the hyperplanes are drawn and you’ve recovered your infradistribution as the region above the graph of the hyperplanes.

Viewing □X as suitable concave functionals on C(X,[0,1]), we now have a natural notion of distance between infradistributions analogous to total variation distance

d(H1,H2):=supf∈C(X,[0,1])|EH1(f)−EH2(f)|

At this point, the use of our uniform-continuity condition is clearer. A uniform limit of Lipschitz functions may not be Lipschitz. However, a uniform limit of uniformly continuous functions is uniformly continuous, so the space □X is complete (has all its limit points).

A lot of probability-theory concepts carry over to infradistributions. We have analogues of products, markov kernels, pushforwards, semidirect products, expectations, probabilities, updating, and mixtures. These are most naturally defined in the concave-functional side of the duality first, and then you can conjecture how they work on sets, and you know you got the right set (modulo closure, convex hull, and upper completion) if the expectations w.r.t your set match up with the defining property in the concave functional picture. The post was getting long-enough as is, so we won’t cover most of them in detail, and leave developing inframeasure theory more fully till later posts.

Pushforwards and MixingLet’s look at the first one, pushforwards of a (bounded) infradistribution via a continuous g:X→Y. The standard probability theory analogue of a pushforward is starting with a probability distribution over X, and going “what probability distribution over Y is generated by selecting a point from X according to its probability distribution and applying g?”

On the positive functional level, this is defined by: (g∗(h))(f):=h(f∘g)

Let’s take a guess as to what this is on the set level. The obvious candidate is: Take the (m,b) in H, do the usual pushforward of the measure via g to get a signed measure over Y, and keep the b term the same, getting a function g∗:Msa(X)→Msa(Y). If g is something that maps everything to the same point, our resulting measures will only be supported on one point, so we’d have to take upper-completion again in that case. But maybe if g is surjective we don’t need to do that?

Let g∗(H) be the set produced by applying g∗ to H and taking the upper completion.

Proposition 8:Iff∈C(X,[0,1])andg:X→Yis continuous, thenEg∗(H)(f)=EH(f∘g)Proposition 9:g∗(H)is a (bounded) inframeasure ifHis, and the image ofHdoesn’t require upper completion ifgis surjective.Proposition 8 certifies that our unique defining (g∗(h))(f)=h(f∘g) property is fulfilled by doing this to the set of sa-measures, so we have the proper set-analogue of the pushforward. And proposition 9 says we were right, the only thing that may go wrong is upper completion. Generally, doing operations like may not exactly make an inframeasure, but you’re only a closed convex hull and upper completion away from your canonical form as long as you check that the expectations match up with what you’d expect on the concave functional side.

Alright now, what about mixing? Like, mixing hypotheses to make a prior, according to some distribution ζ on the numbers. The concave functional analogue is (Eζhi)(f):=Eζ(hi(f))

This works just fine with no extra conditions if we have uniform continuity, but for Lipschitzness, we need an extra condition. Letting λ⊙i be the Lipschitz constant for the functional hi, we need ∑iζiλ⊙i<∞ in order for the result to be Lipschitz. The set version should just be mixing of sets.

Definition 8: Mixing SetsGiven a countable family of inframeasuresHiwherei∈I⊆N, and a probability distributionζ∈ΔI, thenEζHi:={M|∃c∈∏iHi:∑iζic(i)=M}Ie,EζHiis the set of sa-measures that can be made by choosing one sa-measure from eachHiand mixing them together w.r.t.ζ.Try sketching out two sets on a piece of paper and figure out what the

^{50}⁄_{50}mix of them would be. This corresponds to “Murphy can pick whatever they want from each set, but they’re constrained to play the points they selected according to the probability assigned to each Hi”We should note that for bounded inframeasures, letting λ⊙i be the bound on the λ value of the minimal points of Hi by minimal-boundedness, we want ∑iζiλ⊙i<∞ to preserve our minimal-bounded condition for the mix.

Proposition 10:EEζHi(f)=Eζ(EHi(f))Proposition 11:A mixture of infradistributions is an infradistribution. If it’s a mixture of bounded infradistributions with minimal pointλbounds ofλ⊙i, and∑iζiλ⊙i<∞, then the mixture is a bounded infradistribution.Again, Proposition 10 certifies that we got the right notion on the set level, and Proposition 11 certifies that we don’t need to do any closure or upper completion cleanup. Now, how do mixtures interact with pushforwards?

Proposition 12:g∗(Eζ(Hi))=Eζ(g∗(Hi))So, it doesn’t matter whether you mix before or after pushing your infradistribution through a continuous function. The proof of this is quite nontrivial if we were to do it on the set level because of exhaustive verification of each of the conditions, but now that we’ve shown that we have the right set-level analogue of mixing and pushforwards, we can just work entirely in the concave-functional picture and knock the proof out in a couple of lines, so our duality is doing useful work.

UpdatingLet’s move on to updates. We should mention that we’ll be taking a two-step view of updating. First, there is chopping down the measure so only measure remaining within the set we’re updating on remains, and then there’s renormalizing the measure back up to 1. Thinking of updating as a unitary process instead of separated into these two phases will confuse you.

First, what sorts of events can we update on? Well, we only have expectations of continuous functions, and in a classical setting, we can get probabilities from expectations (and then updates from that) by considering the expectation of an indicator function that’s 1 on a set and 0 everywhere else. Sadly, in this setting, we can only take the expectation of a continuous function. A sharp indicator function for an event will be discontinuous unless the set we’re updating on is clopen. Fortunately, the specific application of “conditioning on a finite history” (with discrete action and observation spaces) only involves conditioning on clopen sets, because the set of histories which have a certain finite history as a prefix are clopen. Similarly, if you’ve got a finite space of observations, observing “the true thing is in this subset of observations” is a clopen update.

This seems like a very harsh restriction, but we can generalize further to get something adequate for most applications. I had to rewrite this section, though, because the original motivation was too complicated and had too many moving parts, so I figured it’d be more sensible to motivate things from an entirely different direction.

Let’s call our clopen set that we’re updating on Z. Our task for the post-update infradistribution is to specify the expectation values of functions that are only defined on Z. And we only have expectation values for functions that are defined on all of X. So, we need some way to extend a function f that’s only defined on Z to be defined on all of X. The obvious way is to specify some single function g that tells you what happens outside of Z, and look at the expectation of 1Zf+(1−1Z)g , ie, “look at the function that’s f on Z, and g outside of Z”. Admittedly, the obvious way to do this is to have g=0, and that produces the best analogies to probability theory for infradistributions, and is what standard updates do. However, it’s worth looking at things in more generality and seeing what comes out.

So, our first stab (on the positive functional level) would be something like:

hgZ(f):=h(1Zf+(1−1Z)g)

This is just”extend your function on Z to all of X by extending it with g” But, lamentably, this isn’t normalized. This is more like a “raw update” without normalizing afterwards. But, we know how to renormalize. So, attempt 2:

(h|gZ)(f):=hgZ(f)−hgZ(0)hgZ(1)−hgZ(0)=h(1Zf+(1−1Z)g)−h((1−1Z)g)h(1Z+(1−1Z)g)−h((1−1Z)g)

That’s the renormalized form. As a sanity check, we should make sure that it recovers standard Bayesian updates. So, let’s say our infradistribution is actually just μ, a standard probability distribution. Then,

μ(1Zf+(1−1Z)g)−μ((1−1Z)g)μ(1Z+(1−1Z)g)−μ((1−1Z)g)=μ(1Zf)μ(1Z)=(μ|Z)(f)

Ok, so this recovers standard updating for actual probability distributions. But why did we introduce that g? Why does it matter how we extend our function to outside the area we’re updating on? Just use g=0, we don’t care about what’s happening outside the area we’re updating on. Well, that’s exactly the problem. We actually

docare about what’s happening outside of what we’re updating on. The whole reason dynamic inconsistency comes about for standard Bayesian updating on problems like counterfactual mugging is that past-you cares about what happens down both branches of the coinflip, but once Omega tells you what the coin comes up, you stop caring about what happens own the other branch of the coinflip and so you don’t pay up. And that’s why having g≠0 can permit you to get dynamic consistency, past-you and future-you don’t have to have mismatches in what you care about.For a single probability distribution, the reason why g doesn’t really matter is because, for probability distributions, we can split into “expected value coming from Z” and “expected value coming from outside Z” and cleanly add them up. So, specifying how well you do outside Z just adds the same constant to everything, and then the shift part of scale-and-shift deletes it.

However, for infradistributions in general, you can’t split things up like that! Your choice of g affects how different f’s score relative to each other, so it matters quite a bit how you extend functions from Z to all of X.

So, there’s the motivation for updates. Really, it’s a question of “how do we extend expectations of functions in Z to a function over all of X so we can take expectations”, and it turns out that different choices of g (to extend to all of X) produce different results, while this isn’t the case with conventional probability distributions. And, while g=0 produces the best analogies to conventional probability theory and updating and has the most nice properties, g=0 pretty much corresponds to “I don’t care about what happens outside of Z”, and dynamic consistency mandates that you

docare about what happens outside of the events you update on.But still, only being able to update on clopen sets really sucks. Can we generalize it? Well, using a theorem not shown here (because this section is a later-edited addition), we can go:

“Lets say we’ve got an infradistribution h on the space X, and a continuous function L:X→[0,1] that gives the probability of a particular point producing the observation we got. We use h and L to get a joint infradistribution over X×{0,1} (ie, did we see the observation or not). Then we update on 1 (ie, we saw the observation, this is a clopen update), and our function g tells us how good things would be for a given x conditional on not getting the observation. Our new infradistribution is over X×{1}, which is isomorphic to X. This new infradistribution has thus-and-such form.”

With that, we can update on observations where different points in X have different odds of producing your observation of interest.L is your function telling you how likely it is that a point would produce your observation you saw, g is your function telling you how well things do for a given point x if you didn’t see the observation you saw. This is basically updating on any fuzzy set with a continuous indicator function (L can be considered the indicator function for a fuzzy set), instead of just updating on clopen sets. Much more general! Also note how the logical induction paper had to use continuous indicator functions instead of sharp discontinuous ones, this is analogous. How do updates work when we’re updating on fuzzy sets?

Well, a nifty feature of a continuous likelihood function L is that it lets us glue two continuous functions f,g∈C(X,[0,1]) into another continuous function, in a similar way to our original attempt of “f in the region we’re restricting to, g outside it”

Definition 9: L-GluingGiven three continuous functionsL,f,g:X→[0,1],fglued togviaL,f★Lg, is the function defined as:f★Lg:=Lf+(1−L)gSo, f★Lg behaves like f inside our region we’re updating on, and g outside of our region we’re we’re updating on, and get mixed in regions we’re unsure of.

Going back to the positive functional picture with h (our analogue of an expectation), we can now define the raw update relative to L (likelihood function) and g (off-observation value). The following form is derived from our theorem about reducing updates on fuzzy sets to sharp updates on “did I get this observation”.

hgL(f):=h(f★Lg), where f∈C(¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯supp(L),[0,1])

And then the full version of updating on fuzzy sets in all its generality (on the concave functional side of the duality) is:

(h|gL)(f):=1hgL(1)−hgL(0)(h(f★Lg)−hgL(0))=h(f★Lg)−h(0★Lg)h(1★Lg)−h(0★Lg)

Squinting at this, this is: “do a raw update. Then subtract the excess value off (first step of renormalizing), and then rescale according to h(1★Lg)−h(0★Lg) which… well, ignoring the g part, looks kinda like “expectation of 1 - expectation of 0″… oh, it’s like using the gap between the measure on X and the measure on ∅ to define how much you need to scale up a measure to be a probability distribution! That term on the bottom is the analogue of the probability of the event you’re updating on, relative to the function g!”

Let’s flesh this out a bit more in preparation for defining updates on the set side of the duality.

Backing up to our definition of renormalization, we required that EH(1)=1, and EH(0)=0. If we pivot to viewing the function inside the expectation as our L that gives an indicator function for something, the normalization condition says something like “probability of X is 1, probability of ∅ is 0”.

Let’s generalize this and define probabilities of fuzzy events relative to a function g.

Definition 10: Probabilities of Fuzzy SetsGiven two functionsg,L∈C(X,[0,1]), and a nonempty set of sa-measuresB, the probability ofL(interpreted as a fuzzy set) w.r.tBandgis:PgB(L):=EB(1★Lg)−EB(0★Lg)If g is 0, then for an infradistribution H, by normalization, and unpacking what that ★ means, we get P0H(L)=EH(L). This is much like how the probability of a set is the expectation of the indicator function for the set, just interpret the L as the indicator function for a fuzzy set on the right, and as a fuzzy set on the left. So, for g=0, it behaves like an actual probability.

However, when g≠0, then this is better interpreted as caring-measure.PgB(L) is the difference between the best score you can get vs Murphy and the worst score you can get vs Murphy if you know how things go outside of L (interpreted as fuzzy set). This g-dependent “probability” is actually “how much value is at stake here/how much do I care about what happens inside set L given that I know what happens outside of it.”

And, further, our scaling term for renormalizing an inframeasure B that isn’t an infradistribution yet was (EB(1)−EB(0))−1. So, using this reformulation, our rescaling term turns into (PgB(1X))−1 regardless of g. So, our renormalization term is “rescale according to the probability you assign to any event at all occuring”

Alright, we have enough tools to define updating on the set level. For the raw update (no rescaling yet), we need to chop the measure down according to L. We should also fold in the off-L value (requires specifying g) to the b term, by our dynamic consistency example. And then we do appropriate scale-and-shift terms to subtract as much as we can off the b term, and rescale according to the probability of L relative to the g we’re also updating on.

Let’s use m⋅L to denote the original finite signed measure but scaled down by the function L. If we view L as an indicator function, this is just slicing out all of the measure outside of L, ie, usual conditioning.

Definition 11: UpdatingH|gL, H

updated ongandL, is the set made by mappingHthrough the following function and taking closure and upper-completion.(m,b)↦1PgH(L)(m⋅L,b+m(0★Lg)−EH(0★Lg))Closure is unnecessary if H is a bounded inframeasure, and upper-completion is uneccesary if your L is the indicator function for a clopen set.

Roughly, this is: Chop down the measure according to your fuzzy set, m(0★Lg) is the fragment of expected value you get outside of your fuzzy set so you fold that into the b term. For rescaling, when we unpack EH(0★Lg), it’s just inf(m′,b′)∈H(m′(0★Lg)+b′), so that’s the maximum amount of value we can take away from the second vector component without ever making it go negative. And then rescale “how much do I care about this situation” (PgH(L)) back up to 1.

Note that this updating process lands us in a different vector space. Our new vector space is M±(¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯supp(L))⊕R. It still fulfills the nice properties we expect, because a closed subset of a compact space is compact so every nice property still carries over. And it still has a closed convex cone of sa-measures w.r.t. the new space, abbreviate that as Msa(L).

What properties can we show about updates?

Proposition 13:When updating a bounded infradistribution overMsa(X), if the renormalization doesn’t fail, you get a bounded infradistribution over the setMsa(L). (for infradistributions in general, you may have to take the closure)Proposition 14:EH(f★Lg)=EH(0★Lg)+PgH(L)EH|gL(f)Ok, so it’s sensible on the set level. And for proposition 14, it means we can break down the expectation of two functions glued together by L into the expectation of g outside of L, and the probability of L relative to g times the updated expectation of f. We get something interesting when we reshuffle this. It rearranges to:

EH|gL(f)=1PgH(L)(EH(f★Lg)−EH(0★Lg))

Further unpacking the probability, we get

EH|gL(f)=EH(f★Lg)−EH(0★Lg)EH(1★Lg)−EH(0★Lg)

And then, translating to concave functionals via LF-duality, we get:

(h|gL)(f)=h(f★Lg)−h(0★Lg)h(1★Lg)−h(0★Lg)

So, this shows that we got the right update! What happens when we update twice?

Proposition 15:(H|gL)|g′L′=H|⎛⎝g★1−L1−LL′g′⎞⎠LL′So, updating twice in a row produces the same effect as one big modified update. It may be a bit clearer if we express it as:

Corollary 2:Regardless of L and L′ and g, (H|gL)|gL′=H|g(LL′)Corollary 3:IfXandYare clopen sets, then, glossing over the difference between indicator functions and sets,(H|gY)|gZ=H|g(Y∩Z)Now, what happens when you update a prior/mixture of infradistributions? We get something that recovers Bayesian updating.

Theorem 6, InfraBayes:(EζHi)|gL=Eζ(PgHi(L)⋅(Hi|gL))Eζ(PgHi(L))

if the update doesn’t fail.This means that when we update a prior, it’s the same as updating everything individually, and then mixing those with probabilities weighted by the probability the infradistribution assigned to L, just like standard Bayes!

In particular, if some hypothesis goes “nothing matters anymore” and gives up and cries after seeing L, then its probability term is 0, so it will drop out of the updated prior entirely, and now you’re only listening to hypotheses that think what you do matters. Thus, with a fairly broad prior, we don’t have to worry about the agent giving up on life because nothing matters post-update, just as long as

somecomponent in its prior gives it the will to continue living/says different policies have different values. Well, actually, we need to show an analogue of this for belief functions, but it pretty much works there too.Additional ConstructionsThere are more probability theory analogues, but they are primarily material for a future post. We’ll just give their forms in the concave functional view. If they look unmotivated, just note that they match up with the standard probability-theory notions if we use infradistributions corresponding to actual probability distributions. We’ll be using lambda-notation for functions. If you haven’t seen it before, λx.f(x,2) is the function that takes in an x, and returns f(x,2). λz.(λa.a+z) is the function that maps z to the function that maps a to a+z.

Definition 12: ProductIfh1∈□Xandh2∈□Y, the producth1×h2∈□(X×Y)is given by:(h1×h2)(f):=h1(λx.h2(λy.f(x,y)))These products are noncommutative! The product of bounded infradistributions is a bounded infradistribution.

There’s also infrakernels, the infra-analogue of Markov kernels. A Markov kernel is a function X\to\Delta Y that maps each x to some probability distribution over Y. Concrete example: The function mapping income to a probability distribution over house size.

Definition 13: InfrakernelAn infrakernel is a functionX→□Ythat is:1: Pointwise convergent. For all sequencesxnlimiting toxand allf∈C(Y,[0,1]),limn→∞K(xn)(f)=K(x)(f)2: Uniformly equicontinuous. For allϵ, there is aδwhere if|f−f′|<δ, then∀x:|K(x)(f)−K(x)(f′)|<ϵIf there is someλ⊙>0where property 2 works withδ=ϵλ⊙, then it is a Lipschitz infrakernel.The first two conditions let you preserve uniform continuity, and the strengthening of the second condition lets you preserve Lipschitzness/being a bounded inframeasure.

Now, we can define the semidirect product, h⋉K. The semidirect product in the probability-theory case is… Consider the aforementioned Markov kernel mapping income to a probability distribution over house size. Given a starting probability distribution over income, the semidirect product of (income distribution) with (house size kernel) would be the joint distribution over income and house size. It’s a critically important concept to know that isn’t discussed much.

Definition 14: Semidirect ProductIfh∈□X, andKis an infrakernelX→□Y, the semidirect producth⋉K∈□(X×Y)is given by(h⋉K)(f):=h(λx.K(x)(λy.f(x,y)))Products are just a special case of this, where K(x)=h2, regardless of x. If h is a bounded infradistribution and K is a Lipschitz infrakernel, then the semidirect product is a bounded infradistribution.

The pushforward of a probability distribution is… given the house size Markov kernel, and a distribution over income, the pushforward is the induced distribution over house size. We earlier gave pushforwards for a continuous function g:X→Y. What’s the analogue of the pushforward for an infrakernel? Or, can you do a pushforward w.r.t. a Markov kernel?

Definition 15: PushforwardIfh∈□X, andKis an infrakernelX→□Y, the pushforwardK∗(h)∈□Yis given byK∗(h)(f):=h(λx.K(x)(f))And ifkis a continuous (in the KR-metric) Markov kernelX→ΔY, the pushforwardk∗(h)∈□Yis given byk∗(h)(f):=h(λx.Ek(x)(f))An interesting note about this is, if h is a bounded infradistribution, then we need Lipschitz infrakernels to preserve that property for pushforwards, but we do not need any additional condition on a Markov kernel to preserve boundedness besides continuity. Exercise: Try to figure out why.

Really, everything originates from the semidirect product. The product is the special case of a semidirect product for a constant infrakernel, the pushforward is a semidirect product that’s projected down to the Y coordinate, the pushforward w.r.t a Markov kernel is a special case of pushforward w.r.t. an infrakernel, and the pushforward w.r.t. a continuous function is a special case of pushforward w.r.t. a Markov kernel.

And that’s about it for now, see you in the next post!