Infra-Miscellanea

This post is a writeup of some miscellaneous results that don’t cohere into a single big post, but might possibly be of interest.

Also, in mostly unrelated news, this stuff can be a bit tricky to learn for the first time. So, I was thinking of assembling an infra version of the fixpoint exercise sheet, to help cross the gap from “There’s all these complicated posts, and it looks nifty-but-intimidating” to “I understand the basic concepts well enough to casually use them when applicable, and can work through most of the requisite math on my own”. This particular realm of math is one where it seems pretty feasible to guide someone to reinvent large chunks of it on their own with sufficiently leading discussion, and recreating something yourself does wonders for being able to actually wield it elsewhere. So, if anyone wants to chip in with assembling the problem sheet, or testing it out by spending a day trying out various math problems, please send me a message.

Result Summaries:

1: Because the vector space of signed measures on $X$ and the vector space of continuous functions $X \to R$ are dual to each other, inframeasures are dual to infrafunctions! An infrafunction can be viewed as either a concave functional mapping a measure to an expectation value to expectation values, or as a set of a-functions. Compare this to how an inframeasure can be viewed as either a concave functional mapping a function to an expectation value, or as a set of a-measures. The theory of these sorts of things is highly incomplete at the moment.

2: Convex spaces are spaces with a distinguished function $Δ X \to X$ , which takes the average of a mixture of points. $R$ being a convex space (kinda) is why the notion of “average of a distribution” exists. So, we can ask “what sorts of space $X$ have a distinguished function $□ X \to X$ ?”, to get the infra-analogue of the notion of “average”. As it turns out, it’s a perfect halfway point between probability theory, and lattice stuff/set theory. They’re convex spaces that are also complete lattices, and the probabilistic structure must interact with the order structure in the appropriate way. Specifically, given a probability distribution over subsets, mixing the sets together and taking the infinimum should produce the same result as mixing the infinima of the sets together.

3: What is the space of inframeasures over the empty set? It’s just fun working through this, in the same way as it’s fun working out why minimizing over the empty set produces infinity.

4: Besides the semidirect product and the free product, another notion of product has shown up. In the specific case of crisp infradistributions, it’s the closest match to the naive notion of the product of probability distributions, as it lets you factorize expectations in the same way as products of probability distributions do.

5: The notion of cross-entropy of crisp infradistributions is defined, rounding out the trio of “entropy, cross-entropy, KL-divergence”, where we previously had two of the three. An equality in the probability-theory case (entropy plus KL-divergence is cross-entropy) is shown to be an inequality here. Cross-entropy has a nice intuitive interpretation.

6: There’s a function $f i x : [X \to □ X] \to □ X$ (where $X$ is a compact space) that gets you the Kleene fixpoint theorem (from computer science), the Banach fixpoint theorem, and that one theorem about Markov chains having a stationary distribution.

7: A classic result of probability theory is the disintegration theorem, which roughly states that given a probability distribution $ν : Δ (X \times Y)$ , it can always be written as being produced by a probability distribution $μ : Δ X$ , and a Markov kernel $k : X \to Δ Y$ giving you the conditional probabilities. This is the theorem that secretly underlies the ability to condition a distribution nearly any event or variable you want. This fails for infradistributions, but fortunately, given any crisp infradistribution $ϕ$ , there’s a canonical choice of nearby infradistribution that is disintegrable into an infradistribution $ψ : □ X$ , and an infrakernel $K : X \to □ Y$ . So you can kinda make conditional infradistributions.

8: Consider the following computational problem. You’ve got some value function $V : E \times [0, 1] \to [0, 1]$ ( $E$ is the space of environments), which describes how much you care about getting a particular utility in a particular environment, and you’re trying to do

a r g m a x_{π} min e V (e, E_{e \cdot π} [U])

Ie, pick the policy $π$ which maximizes the worst-case value of (your environment,your utility). As it turns out, subject to some mild conditions on the derivatives of the value function, any problem of this form can be rephrased as

a r g m a x_{π} (Θ \cdot π) (U)

Namely, maximizing expected utility w.r.t some causal belief function $Θ$ . This allows recasting several other possible value functions as just the problem of “do well in an infraenvironment”.

9: The Objectively Correct topolog(ies) on spaces of inframeasures has been found. It was generated by a minor variation on the attempt made in “Less Basic Inframeasure Theory”, and so obsoletes the relevant section. The topology section wound up getting too big and got broken out into its own post.

10: It doesn’t matter whether you’re viewing inframeasures as a concave function from functions to their expectation values, or viewing them as a set of a-measures. This is LF-duality, aka the Fundamental Theorem of inframeasures. Sadly, the proof of it didn’t generalize over to more abstract cases like what shows up in domain theory. Until now!

11: For belief functions, it’s proved which sorts of them can be translated into causal belief functions without affecting the utilities particularly much. In other words, this result tells you a decision-theoretic fairness condition on which sorts of problems can be viewed as causal if you try (in an expanded sense which manages to include XOR blackmail, Newcomb, and Counterfactual Mugging). Said fairness condition comes out to be something along the lines of “if what happens now depends on what you’d do in alternate situations, the magnitude of that effect must be proportional to the time-discounted probability of actually getting into those alternate situations”

Section 1: Infrafunctions

$C_{B} (X)$ is the space of bounded continuous functions $X \to R$ , and $M^{\pm} (X)$ is the space of finite signed measures on $X$ . $X$ is just some space. Compact metric spaces are the best, though it’s possible to go beyond that.

The fundamental theorem of inframeasures aka LF-duality, says that there’s a bijection between closed, convex, upper-complete subsets of $M^{\pm} (X) \times R$ , and concave functions $C_{B} (X) \to R$ . This is nice because it lets you transition back and forth between an “analytic” way of viewing inframeasures (the concave functionals), and a “geometric” way (as a set of measure,number pairs).

Just embellish your theorem with a bunch of fiddly details about what sort of space $X$ is and what additional conditions to impose on the subsets and concave functions, and you’re done.

And the core piece that makes the fundamental theorem tick is “for every concave function, the set of hyperplanes above it is sufficient data to recover the concave function”, which isn’t especially sensitive to measures vs functions, it’s a general fact of linear algebra/functional analysis. The only extra step that’s specific to inframeasures is the interpretation of those hyperplanes as measure, number pairs.

But note that $C_{B} (X)$ (the space of bounded continuous functions $X \to R$ ), and $M^{\pm} (X)$ (the space of finite signed measures on $X$ ) are dual to each other. Continuous linear functionals $C_{B} (X) \to R$ correspond to signed measures, and continuous linear functionals $M^{\pm} (X) \to R$ correspond to bounded continuous functions, it’s a completely symmetric situation. So what if we tried doing things the other way around?

Inframeasures are what you get when you start with concave functions $C_{B} (X) \to R$ , represent them as a set of affine functions/hyperplanes, and interpret those affine functions as measure,number pairs. But what about when you start with concave functions $M^{\pm} (X) \to R$ , represent them as a set of affine functions/hyperplanes, and interpret those affine functions as function,number pairs? Well, that’s an infrafunction.

An infrafunction $F$ is either a concave function of type $M^{\pm} (X) \to R$ , or a set of function/number pairs $(f, b)$ , and we have

F (m) = min (f, b) \in F m (f) + b

Just as inframeasures respond to functions with the worst-case measure, infrafunctions respond to measures with the worst-case function.

When would these be useful? Well, they seem to capture every case where randomizing your action is better than picking what to do deterministically. If randomizing your action is better than deterministic choice, that means that the function $Δ A \to [0, 1]$ yielding the value of various distributions over actions is a concave function. And so, LF-duality will turn that concave function over distributions into a set of functions $A \to [0, 1]$ that you’re (implicitly) assuming the worst-case of. This formalizes the thing Eliezer was talking about where randomization is only helpful when you’re trying to do worst-case optimization against an adversary instead of average-case optimization.

Most of the proofs for this are utterly routine. There’s probably lots of interesting stuff to show about infrafunctions but I want to get this post out in a timely manner, so instead I’ll just observe that infrafunction theory is a pretty wide-open place to look for nifty results.

Section 2: Infra-algebras

Note: In this section, and for the proofs, the domain-theory ordering on inframeasures is used, where infinimum corresponds to union.

As motivated in the results preamble, which spaces $X$ have a distinguished function $□ X \to X$ ? This would be the infra-version of “average”. Well, the space of infradistributions contains the space of probability distributions within it. So, any space like that induces a function $Δ X \to X$ . Thus, whatever sort of space we’re looking for, it should at least be a convex space, with a notion of mixture of points.

Well, if $X$ is a convex space, is that enough to get a function $□ X \to X$ ? Not yet. We do know where the probability distributions should get mapped to, we’ve got a function $a v g_{X} : Δ X \to X$ . But, for a crisp infradistribution (ie, set of probability distributions) $ψ$ , where should that get mapped to? Well, $ψ$ is associated with a set of probability distributions $Ψ$ , so $ψ$ should probably get mapped to a point that has something to do with the set $a v g_{X} (Ψ)$ , the set you get by pushing $Ψ$ through the “take the average” function. The set $Ψ$ is compact, so the set $a v g_{X} (Ψ)$ should be compact as well.

Net question: Given a nonempty compact subset of $X$ , is there some distinguished point of $X$ associated with it? Well, that’s basically like asking for a space $X$ with a distinguished function $K X \to X$ , mapping nonempty compact sets ( $K X$ is the space of nonempty compact subsets of $X$ ) to points. As it turns out, (if the entire space $X$ is compact), a space like that is a poset with all nonempty infinima existing. It’s basically a complete lattice except it may be missing the top point. Just map your compact set to the infinimum.

So, the function we’re after (let’s call it $f l a t_{X}$ for “flatten”) should be defined something like

f l a t_{X} (ψ) := \land {a v g_{X} (μ) | μ \in Ψ}

Ie, convert $ψ$ into a set of probability distributions, take the average of all of those distributions to get a set of points in $X$ , and take the inf of that set.

This, at minimum, requires that $X$ be a convex space (so you know where to map the probability distributions to), and a poset with all nonempty infinima (so you know where to map sets of probability distributions to). But actually, this isn’t enough by itself, the mixture structure and the order structure must interact in a certain way, as we’ll get to later.

But wait, there’s lots of sorts of infradistributions, and we’ve just dealt with the crisp infradistributions (sets of probability distributions) so far! Yes. It’s important to note that there’s actually lots of inframeasure monads (space of crisp infradimeasures, space of cohomogenous inframeasures, etc...), producing different spaces of inframeasures. $□ X$ is ambiguous notation that’s bad here, because all the different possible choices of “space of inframeasures” produce a slightly different batch of properties a space $X$ needs for there to be a special function $□ X \to X$ .

It’s not too hard to work out the details for other sorts of inframeasures, but my preferred choice is the inframeasures of type $(X \to [0, 1]) \to [0, 1]$ , which are made of sets of a-measures $(λ μ, b)$ where $λ + b \leq 1$ , which heretofore haven’t been named.

The reason why is because any a-measure of that form can be expressed as an element of $Δ (X + 2)$ . Basically, the measure component $λ μ$ tells you how probability is distributed on the space $X$ , the $b$ term is the probability of a distinguished “utility” point, and $1 - λ - b$ is the probability of a distinguished “no utility” point.

So, if we can just figure out where “pure utility” and “pure lack of utility” should go, then we can go “by convexity, we know where all the a-measures should get mapped to”, and then map an inframeasure to the inf of the set it induces, like usual.

The answer is that there should be a top point of the space $X$ (we already know there’s a bottom point because of all nonempty infinima existing), and “pure utility” gets mapped to the top point, and “pure lack of utility” goes to the bottom point.

And so, we’re about ready to define the necessary-and-sufficient conditions for a compact metric space $X$ to have a distinguished function $□ X \to X$ . Where $□ X$ is the space of inframeasures on $X$ generated by sets of a-measures $(λ μ, b)$ with $λ + b \leq 1$ .

1: The space $X$ is a $Δ$ -algebra, ie there’s a distinguished continuous function $a v g_{X} : Δ X \to X$ .

2: The space $X$ is a $K$ -algebra, ie there’s a distinguished continuous function $\land_{X} : K X \to X$ . In particular, this induces an ordering on $X$ where $x \leq y$ if $\land_{X} {x, y} = x$ .

3: Under the induced ordering, there is a top point, $⊤_{X}$ .

4: The $\land_{X}$ and $a v g_{X}$ functions interact as follows. Given any $μ : Δ K X$ , a probability distribution over compact sets, we have

\land_{X} ¯ ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ ¯ {a v g_{X} (k_{*} (μ)) | k : K X \to X is measurable selection} = a v g_{X} (\land_{X *} (μ))

Condition 4 is the key one, and quite mystifying at first glance. The key insight is that it’s just implementing mixture of sets. If you’ve got two sets, $C$ and $D$ , what would the set $p C + (1 - p) D$ be? Well, you take all possible ways of picking a point from $C$ and a point from $D$ , and for each way of doing that, you average the two points together, and this gets you the set ${p c + (1 - p) d | c \in C, d \in D}$ , and that’s the obvious way to mix two sets together. If we generalize beyond this to any probability distribution over sets, you’d take all possible (measurable) ways of picking a point from each set (that’d be the function $k$ ), $k_{*} (μ)$ would be the resulting distribution over points, you average them together, and that gets you the mixture set. This whole mess is just implementing averaging sets together, we could write that whole process as $a v g_{K X} (μ)$ if we wanted.

This would turn the mysterious condition 4 into

\land_{X} (a v g_{K X} (μ)) = a v g_{X} (\land_{X *} (μ))

Which is just a commutative diagram. It’s saying that going $Δ K X \to K X \to X$ (the left-hand path) and $Δ K X \to Δ X \to X$ (the right-hand path) are the same function. Intuitively, if you have a batch of sets, it doesn’t matter whether you average them together first, and then take the infinimum of the mixture set, or whether you take the infinimum of all of the sets first, and then mix the infinima together.

And so, in informal slogan form, the conditions for a space $X$ to have a distinguished function $□ X \to X$ are:

1: You can mix points together.

2,3: The space is a complete lattice.

4: Mixing sets together and taking the inf of the mixture set is the same as taking the infs of the various sets and mixing those together.

Some minor differences show up when you adjust the sorts of spaces you’re dealing with, or the sorts of inframeasures you’re considering, but this is the basics.

Theorem 1: If $X$ is a compact metrizable space, then the four conditions above are necessary and sufficient for $X$ to be an algebra of the $□$ monad.

“algebra of the $□$ monad” is the formal category-theory way of saying “there’s a distinguished function $□ X \to X$ ”. It means that there’s a continuous function $f l a t_{X} : □ X \to X$ (which we explicitly gave earlier! That’s the unique function that works) s.t. $f l a t_{X} \circ f l a t_{X *} = f l a t_{X} \circ f l a t_{□ X}$ (both paths go $□ □ X \to □ X \to X$ ), and $f l a t_{X} \circ e_{X} = i d_{X}$ (where $e_{X} : X \to □ X$ is the unit of the monad, ie, the function mapping a point to the dirac-delta distribution on that point). Also, $□$ in this case is the monad mapping the space $X$ to the space of inframeasures $C (X, [0, 1]) \to [0, 1]$ generated by a set of a-measures $(λ μ, b)$ s.t. $λ + b \leq 1$ .

The theorem was way harder to prove than I thought. It requires expanding things into a 4x4 grid of commutative squares, requiring seven different lemmas to eventually turn one path into the other, and that was for only one direction of the proof.

As it turns out, the category of infra-algebras and the structure-preserving maps between them (affine functions that preserve infinima, top, and bottom) is somewhat interesting. If you’re an ncatlab fan, it’s the Eilenberg-Moore category of the inframeasure monad.

The product in it is the ordinary Cartesian product. Coproduct is a hideous complicated mess but it exists, and as it turns out, the coproduct of $□ X$ and $□ Y$ is $□ (X + Y)$ . The category *might* be monoidal closed, I don’t know. For the specific case of infradistributions, the monoidal product of $□ X$ and $□ Y$ should be $□ (X \times Y)$ , but generalizing that construction to all convex complete lattices is pretty hard and I haven’t done it yet. And the internal hom of $□ X$ and $□ Y$ should be the space $[X \to □ Y]$ of all lower-semicontinuous infrakernels. But again, I don’t know how to generalize that construction to all convex lattices.

Section 3: Empty Set Shenanigans

What is $□ \emptyset$ ? After all, $\emptyset$ is vacuously a compact metric space. Well, it depends on what $□$ is. If $□$ is the space of crisp infradistributions, then $□ \emptyset = \emptyset$ . Well, let’s say $□$ is the space of all $[0, 1]$ -inframeasures that can be generated by sets of a-measures $(λ μ, b)$ , with $λ + b \leq 1$ , and work it out (since I like that one).

A $[0, 1]$ -inframeasure is a concave function of type $C (X, [0, 1]) \to [0, 1]$ , fulfilling some extra properties. If $X$ is the empty set, then our elements would be the inhabitants of $C (\emptyset, [0, 1]) \to [0, 1]$ (fulfilling some extra properties). Now, there’s only one function from the empty set to anything, the empty function, which is vacuously continuous. So, the inframeasures would be the inhabitants of $1 \to [0, 1]$ , the various constant functions. Or, expressing it another way, just the space $[0, 1]$ . These functionals are all vacuously concave, continuous, and all the rest.

So, we have $□ \emptyset = [0, 1]$ . Remember, an a-measure is $(λ μ, b)$ where $b, λ$ are nonzero numbers, and $μ$ is a probability distribution. There are no probability distributions over the empty set, so all that’s left is that $+ b$ constant term.

Now, if $f_{\emptyset, X} : \emptyset \to X$ is the empty function, then what is its pushforward? What’s $f_{\emptyset, X *} : [0, 1] \to □ X$ ? Well, if $a$ is a constant, and $g$ is some function $X \to [0, 1]$ , then we have

f_{\emptyset, X *} (a) (g) = a (g \circ f_{\emptyset, X}) = a (f_{\emptyset, [0, 1]}) = a

So, the pushforward along the empty function is just the function mapping the constant $a$ to the constant infradistribution $c_{a}$ (which returns the value $a$ no matter which function is fed in).

Section 4: Independent Product

To start off with recapping the two existing notions of product...

The first notion, the semidirect product, was defined as follows. If $ψ : □ X$ and $ϕ : □ Y$ , then $ψ ⋉ ϕ : □ (X \times Y)$ is defined as

(ψ ⋉ ϕ) (f) := ψ (λ x . ϕ (λ y . f (x, y)))

This is an asymmetric notion, and the notation reflects this. $ψ ⋉ ϕ \neq ϕ ⋉ ψ$ . For the special case of crisp infradistributions, you could think of $ψ ⋉ ϕ$ as containing all the distributions whose marginal distribution on $X$ lies in $Ψ$ , and whose conditional distributions on $Y$ are all selected from $Φ$ , regardless of the $x$ . If we think of $Ψ$ and $Φ$ as being sets of available strategies for a player 1 and player 2 respectively, then $ψ ⋉ ϕ$ would be the set of probability distributions over results that are possible when player 2 is allowed to look at player 1′s result, but player 2′s set of available choices doesn’t vary as player 1′s result varies.

The second notion, free product, is defined as follows. If $ψ : □ X$ and $ϕ : □ Y$ , then $ψ \otimes ϕ : □ (X \times Y)$ is defined as

ψ \otimes ϕ := p r_{X}^{*} (ψ) \cap p r_{Y}^{*} (ϕ)

Note that $p r_{X}^{*}$ is the pullback of projection. It’s like the preimage of projection, going $□ X \to □ (X \times Y)$ . So, you just take the intersection of the two sets. For the special case of crisp infradistributions, $ψ \otimes ϕ$ consists of all the distributions whose marginal distribution on $X$ lies in $Ψ$ , and whose marginal distribution on $Y$ lies in $Φ$ . Note that taking the free product of two probability distributions produces a set of many probability distributions.

But wait… Restricting to crisp infradistributions (sets of probability distributions)… If free product is “the set of probability distributions where their marginals must be thus-and-such”, and semidirect product is “the set of probability distributions where their marginal along this coordinate is thus-and-such, and the conditional distributions along the other coordinate are thus-and-such”… That points to a missing notion of product, where the conditionals along both coordinates.

It’d be something like… $ψ \times ϕ$ is the set of all probability distributions $μ$ where, regardless of $x$ and $y$ , $μ | x \in Φ$ , and $μ | y \in Ψ$ .

Making it more formal, the independent product $ψ \times ϕ : □ (X \times Y)$ is given by

ψ \times ϕ := (⊤_{X} ⋉ ϕ) \cap (⊤_{Y} ⋉ ψ)

Where $⊤_{X}$ is the infradistribution $λ f . {min}_{x \in X} f (x)$ , the infradistribution corresponding to total uncertainty over $X$ . Similar for $⊤_{Y}$ . Technically, that equation should have a function that flips the coordinates of that second thingy from $Y \times X$ to $X \times Y$ , but whatever, it’s pretty understandable anyways.

Does this have any nice properties? Well, it’s a bit tricky to tell in general, but for the restricted special case of crisp infradistributions (sets of probability distributions), we have the following key property which also characterizes products of distributions in ordinary probability theory. And it plays a big role in evaluating expectations of factorizable functions, because you can find expectations of those functions by just decomposing it into smaller parts and multiplying them together.

Theorem 2: If $f : X \to R$ , and $g : Y \to R$ (or $[0, 1]$ as the case may be), and $ψ, ϕ$ are crisp infradistributions, then

(ψ \times ϕ) (f \cdot g) = ψ (f) \cdot ϕ (g)

Also, in the probabilistic case, we have that the entropy of a product of distributions is the sum of their individual entropies. The same thing holds here, for the generalization of entropy to sets of probability distributions.

Theorem 3: If $ψ, ϕ$ are crisp infradistributions, then $H (ψ \times ϕ) = H (ψ) + H (ϕ)$ .

Section 5: Cross-Entropy

The definition of cross-entropy, $H (ψ, ϕ)$ , when $ψ, ϕ$ are sets of probability distributions, is

H (ψ, ϕ) := min ν \in Φ max μ \in Ψ H (μ, ν)

The interpretation of this is that the cross-entropy of distributions, $H (μ, ν)$ , is “how many bits does it take on average to communicate what happened when reality is drawing from the distribution $μ$ , and you’re using a communication scheme optimized for the distribution $ν$ ”. And so, if there’s a set of possible things that reality could be ( $ψ$ ), and a set of possible codes you could use ( $ϕ$ ), and you knew you were in a worst-case scenario, you’d use an encoding from $ϕ$ that achieves fairly good efficiency over all the possible distributions in $ψ$ that reality could be selecting from. And, we’re assuming reality is being mean and trying to maximize the communication cost. This whole paragraph abbreviates as ${min}_{ν \in Φ} {max}_{μ \in Ψ} H (μ, ν)$ .

An interesting thing, however, is that this makes the usual “cross entropy equals entropy plus KL-divergence” equality into an inequality. This is because entropy, in this setting, is

H (ψ) = max μ \in Ψ H (μ)

This is actually a theorem. The original definition of entropy was way more complex. And also, KL-divergence was previously defined as

D_{K L} (ψ | ϕ) := min ν \in Φ max μ \in Ψ D_{K L} (μ | ν)

So, we get

H (ψ, ϕ) = min ν \in Φ max μ \in Ψ H (μ, ν) = min ν \in Φ max μ \in Ψ (H (μ) + D_{K L} (μ | ν))

$\leq {min}_{ν \in Φ} ({max}_{μ \in Ψ} H (μ) + {max}_{μ \in Ψ} D_{K L} (μ | ν))$

$= {max}_{μ \in Ψ} H (μ) + {min}_{ν \in Φ} {max}_{μ \in Ψ} D_{K L} (μ | ν) = H (ψ) + D_{K L} (ψ | ϕ)$

And so, we have

H (ψ, ϕ) \leq H (ψ) + D_{K L} (ψ | ϕ)

The cross-entropy/entropy/KL-divergence inequality, instead of equality.

Section 6: Iterative fixpoints

There’s a function $f i x : [X \to □ X] \to □ X$ , where, if $K : [X \to □ X]$ , then $f i x (K)$ is the limit of $⊤_{X}, K_{*} (⊤_{X}), K_{*} (K_{*} (⊤_{X})) . .$ .

This may look very familiar, and it is! It’s just a special case of Kleene’s fixpoint theorem! Kleene’s fixpoint theorem says that if you’ve got a poset where all chains have suprema, and there’s a bottom point, and your function $f : X \to X$ is monotone and preserves suprema, you can make the chain $⊥, f (⊥), f (f (⊥)), f (f (f (⊥))) . .$ and the supremum of that chain will be the least fixpoint of the function $f$ .

In our particular case, given some infrakernel $K : X \to □ X$ , the pushforward through that infrakernel, $K_{*}$ , is a function of type $□ X \to □ X$ . Since $□ X$ has a bottom (under the domain theory ordering, which is reversed from the usual ordering. And this is why we’re using $⊤_{X}$ , because that’s actually bottom under the domain theory ordering), and $□ X$ has suprema of chains (intersection), we can just throw Kleene’s fixpoint theorem at this to show that it’s well-defined.

The cool thing about this is that it manages to unify Kleene’s fixpoint theorem, the Banach fixpoint theorem, stationary distributions on Markov chains, and more.

We can’t expect fix to be continuous in the usual sense. As an example, let the space $X$ be the space $[0, 1]^{2}$ , the unit square. The sequence of functions $K_{n} : X \to X$ (extend it to a function $X \to □ X$ in the obvious way) we feed it are defined as

K_{n} (x, y) = x + \frac{1}{n} (0.5 - x), y + \frac{1}{n} (0.5 - y)

Basically, $K_{n}$ maps points closer to the center, though the rate at which it gets closer slows down as $n$ increases. Now, for all finite $n$ , $f i x (K_{n}) = δ_{0.5, 0.5}$ (the one fixpoint is the center of the square, since everything else moves towards that.) However, the limit of the $K_{n}$ is just the identity function, and then pushforward through identity is identity, so we have $f i x (K) = ⊤_{[0, 1]^{2}}$ . This is a clear example where there’s a sudden change in the least fixpoint as the function changes. The winds blowing points towards the center get weaker and weaker until they cease at the limit, and then the fixpoint suddenly goes from a single point to the entire space.

However, we can at least hope for Scott-continuity, a weakening of ordinary continuity that’s closely tied to lower-semicontinuity, and domain theory. This requires that $f i x$ is monotone and preserves suprema of chains, and this actually does hold.

Theorem 4: fix is Scott-continuous.

So, what sort of results does this produce? What if $K$ is just a function $X \to X$ ? Well then, $f i x (K)$ is $⊤_{⋂_{n} K^{n} (X)}$ . Ie, take the entire space, shove it through $K$ over and over and over and over, and eventually your set stabilizes to something where $K$ is a bijection on it.

For places where you’d normally apply the Banach fixpoint theorem, that limiting set will shrink to a point, so applying $f i x$ to a contractive function will produce a single point/the dirac-delta distribution on that point.

If $K$ is a Markov chain, mapping inputs to probability distributions over outputs, and it has a single stationary distribution, then $f i x (K)$ will give you exactly that stationary distribution.

If you’re in a domain-theory setting, and you’ve got a function $K : X \to X$ , fix will just give you the least fixpoint of the function, as expected.

Section 7: The Disintegration Theorem

So, the original formulation of the Disintegration Theorem, in the probabilistic case, says that given any probability distribution on $X \times Y$ , it’s possible to decompose it uniquely into a distribution on $X$ , and a conditional probability function $X \to Δ Y$ which is unique almost everywhere.

So, for crisp infradistributions (sets of probability distributions), we might reasonably ask whether a similar thing holds, whether they decompose into a crisp infradistribution on $X$ and an infrakernel $X \to □ Y$ .

The answer is, kinda. It certainly doesn’t hold in general, but given any crisp infradistribution, there’s always a smallest superset of it that does disintegrate like that. Here’s how it’s constructed.

To make the infradistribution on the $X$ coordinate (given some $ϕ : □ (X \times Y)$ that we’re trying to decompose into $ψ : □ X$ and $K : X \to □ Y$ ), it’s given by $ψ := p r_{X *} (ϕ)$ . Just project it onto the $X$ coordinate, as expected.

As for constructing the $K$ , that will be more difficult to express. First, the definition of enclosing sets, for $μ : Δ (X \times Y)$ .

A closed set $C \subseteq X \times Δ Y$ is $μ$ -enclosing if, for $p r_{X *} (μ)$ -almost all x, $(x, μ | x) \in C$ .

Stated informally, an enclosing set gives you a set of available conditional probability distributions for each $x$ , and the conditional probabilities of $μ$ have to be picked from that set. And there should be a weak sort of continuity where if you zoom in far enough on any particular $x$ value, the set of available options for conditional probabilities for $x$ should be approximately as-big-or-bigger than the set of options available in its immediate neighborhood.

The definition of a $μ$ -enclosing set is robust against precisely what choice of conditional probabilities we have, because by the disintegration theorem, $μ | x$ is unique for $p r_{X *} (μ)$ -almost-all x, so altering the conditional probabilities only disrupts a set of measure zero, and so can only ensure that a measure-zero set of $x$ have $(x, μ | x)$ go outside of $C$ .

Now that that’s defined, given a crisp infradistribution $ϕ$ , a closed set $C \subseteq X \times Δ Y$ is $ϕ$ -enclosing when it’s $μ$ -enclosing for all $μ \in Φ$ .

Or, in other words, the set of available conditional distributions for each $x$ has to be consistent with the conditional distributions of all the probability distributions in your set $Φ$ .

Let $C_{ϕ} \subseteq X \times Δ Y$ be defined as $⋂_{C : C is ϕ -enclosing} C$

Basically, this is the smallest allowable set of conditional distributions for each $x$ where all the probability distributions in your set $Φ$ are like “yeah, this set is big enough to have my conditional distributions in it”, and that also fulfills our closed-graph continuity-like condition.

And now can finally define our conditional infradistribution! $K (x) := C_{ϕ} (x)$ , where $C_{ϕ} \subseteq X \times Δ Y$ . We’re implicitly converting the closed subset of $X \times Δ Y$ to a closed subset of $Δ Y$ , by picking an x, and then using the usual translation from a set of probability distributions to an infradistribution.

Theorem 5: $C_{ϕ}$ is a lower-semicontinuous infrakernel, and $p r_{X *} (ϕ) ⋉ C_{ϕ}$ is the smallest disintegrable superset of $ϕ$ .

Section 8: Environment-Forming Constraints

Credit to Vanessa for this idea.

We might want to have worst-case desiderata of the form “you must do this well in these environments”. This can be expressed as a zero-sum game where you’re trying to find a policy where, regardless of environment, a value function depending on the environment and your expected utility, exceeds a particular value. (You can have the value function automatically score well for environments you want to ignore, since yu’re optimizing the worst-case.)

Ie, you’re doing

{argmax}_{π} min e V (e, E_{e \cdot π} [U])

Where $V : E \times [0, 1] \to [0, 1]$ ( $E$ is the space of environments) is the value function. And $e \cdot π$ is the distribution over histories from policy $π$ interacting with environment $e$ , and $U$ is expected utility. Increased expected utility should always be good, so we demand that $V$ is monotone in its second argument.

As it turns out, any constraint of this form (subject to some conditions on the first and second derivatives) can be reexpressed as just trying to do well vs one particular causal law/infraenvironment/set of a-environments/causal belief function/damn we really need to sort out this terminology, don’t we.

Theorem 6: Given any value function $V : E \times [0, 1] \to [0, 1]$ (where $E$ is the space of environments) which is monotone in the second argument, and has first derivatives bounded above 0 and second derivatives bounded below infinity, there is a causal law (set of infraenvironments) $Θ$ s.t

min e V (e, E_{e \cdot π} [U]) > min e V (e, E_{e \cdot π^{'}} [U])

occurs iff

Θ (π) (U) > Θ (π^{'}) (U)

Regardless of $π, π^{'}$ , and utility function $U$ .

As it turns out, if $λ x . V (e, x)$ is concave for all $e$ , you can translate it very directly to a causal law. If it’s not concave, you might get some nonlinear scaling of how the various policies do relative to each other, but the ordering of how well various policies do is unchanged. Concavity is very important here, and almost all of the difficulty with proving this theorem is concentrated in figuring out how to build a monotone function $g : [0, 1] \to [0, 1]$ (the nonlinear rescaling) s.t. $g \circ (λ x . V (e, x))$ is concave for every choice of environment $e$ .

Section 9: Inframeasure Topologies

This section entirely obsoletes section 6 in Less Basic Inframeasure Theory, and propositions 41 through 50. I mean, the theorems are correct, but they’re not about the Right Thing.

But it was too long, so I had to break it out into its own post. Suffice to say, there’s two “objectively correct” topologies on inframeasures motivated purely by continuity considerations. And the continuity considerations naturally spit out all the complicated compact-set requirements for Polish spaces, for free. There’s four highly distinct and equivalent ways to view inframeasure convergence, and all the complicated conditions on when an infrakernel is “well-behaved” can be obsoleted in one strike with “it’s a continuous function $X \to □ Y$ when you give $□ Y$ the objectively correct topology”.

Also, some of the results like LF-duality for Polish spaces were actually subtly wrong. The flaw was that $C_{B} (X)$ , the space of bounded continuous functions $X \to R$ , actually should have been equipped with a much more unusual and obscure topology than I originally thought it had. But don’t worry, everything got repaired to the correct form.

Section 10: Generalizing LF-duality

So, if there’s a Fundamental Theorem of Inframeasures, it’d definitely be LF-duality, the theorem about how it’s equally valid to look at inframeasures as concave functions fulfilling certain properties, or as sets of a-measures fulfilling certain properties. It’s equally valid to look at a concave function or the set of hyperplanes above it.

Sadly, if you dig into the mathematical guts of how this result is proven, it doesn’t generalize to the case of domain theory. Until now, it’s been impossible to take an inframeasure defined on a domain, and have an assurance that the “set of a-measures” view is faithful to the originally defined function.

But it got solved anyways. Thank “Hahn-Banch Type Theorems for Locally Convex Cones”, an excellent paper by Walter Roth on abstract generalizations of the Hahn-Banach separation theorem. Their Theorem 1 is what I disassembled and reassembled to prove what I needed.

Theorem 7: Given $X$ a locally compact and compact space, and a convex function $ϕ$ and concave Scott-continuous function $ψ$ (both of type $[X \to (- \infty, \infty]] \to (- \infty, \infty]$ or $[X \to [0, \infty]] \to [0, \infty]$ ) where $ψ \leq ϕ$ , then there is some affine, Scott-continuous function $μ$ of the same type signature, where $ψ \leq μ \leq ϕ$ .

Corollary 1, LF-duality for domains Given any $ω$ -BC domain $X$ and inframeasure $ψ : □ X$

ψ = ⊓ {λ f . m (f) + b | \forall f : [X \to (- \infty, \infty]], m (f) + b \geq ψ (f)}

So even in domain theory, it’s legitimate to view an inframeasure as the same as a set of a-measures.

Section 11: Faithful Acausal-to-Causal Translations

Referring back to “The Many Faces of Infra-Beliefs”, there was a way to (non-faithfully) translate acausal belief functions (which can encode any sort of problem where the way reality goes depends on your policy) to pseudocausal belief functions, which is, intuitively, “assume it’s possible for the predictor to make a mistake”. And there was a way to faithfully translate pseudocausal belief functions to causal belief functions, by adding an extra imaginary state, “Nirvana”, that’s 1 utility if you ever hit it.

So, all decision theory problems can be recast in causal form, by adding imaginary extra reward states!

Well… not really. There’s two complications. One of them is the acausal-to-pseudocausal translation, not all decision-theory problems survive that one. It was extensively analyzed back in “The Many Faces of Infra-Beliefs”, and the key to the translation not affecting expected values particularly much is whether a particular condition holds, which can be roughly described as “for a misprediction to have a large effect on what happens, it must be the case that you’ve got a non-negligible chance at actually entering the situation where you’re mispredicted”. It’s like a sort of fairness condition. If Omega screwed up its predictions, could you prove it wrong?

The other complication is the “if Nirvana ever happens, it’s 1 utility” part, as that isn’t a continuous utility function, that’s a lower-semicontinuous one. You could try to make Nirvana into a 1-reward state forevermore (and then things would be continuous, because Nirvana late isn’t as good), but then that introduces another complication that could change things, where you stop dealing well with decision-theory problems because the time where you’re being predicted is just way way too far in the future and it gets time-discounted into negligibility.

Although, that actually does sound somewhat sensible for good learning behavior. It seems quite difficult to learn good behavior in policy-selection problems when the gap between the action and the results is long.

Anyways, the result we want is some condition on acausal belief functions $Θ^{a}$ s.t. translating them into causal belief functions $Θ^{c}$ (with the Nirvana state counting as 1 reward forevermore) doesn’t affect expected utilities of the policies particularly much, so the good things to do post-translation are still considered pretty good things to do pre-translation, and they agree more and more as the time-discount goes to infinity and the agent ges more farsighted.

To explain what’s going on, we’ll need to introduce the $γ$ -metric on histories, a notion of difference between histories that depends on the time-discount. Remember, $γ \in [0, 1)$ , values closer to 1 are more far-sighted, and the utility functions are like

U^{γ} (h) = \frac{1}{1 - γ} \infty \sum n = 0 γ^{n} \cdot r (h_{0 : n})

Ie, the reward on timestep n depends on the first n+1 actions and observations (the string $h_{0 : n}$ ), and we do a time-discounted sum of them. This lets you assess what happens to a utility function as its time-discount goes to 1, ie, the limit of farsightedness.

Anways, the $γ$ -metric, the time-discount dependent notion of difference between histories, is

d_{γ} (h, h^{'}) = γ^{first n where h_{n} \neq h_{n}^{'}}

The intuitive reason we care about this notion of distance is that it being small means that any utility function $U^{γ}$ with the same time-discount will assign similar values to those two histories, as they only start differing rather late. It’s like “how indistinguishable are these two histories with our our given time discount”.

Let $T h e t a$ be your acausal belief function/decision theory problem that you want to translate. Then the condition to be able to translate it faithfully to a causal form is

lim γ \to 1 (sup π, π^{'}, μ \in Θ (π) (x \cdot inf ν \in Θ (π^{'}) d_{γ K R} (μ, ν) - μ (γ^{n : h_{n} ≁ π^{'}}))) = 0

Where $x$ is the upper bound to your reward function. $x \in [0, 1]$ .

This is pretty complicated and introduces some new notation, so we’ll break it down. I don’t expet the reader to spot this, but that quantity that we’re taking the limit of is always 0 or higher. For the limit and the supremum, this is basically saying “in the farsightedness limit, as the time-discount $γ$ approaches 1, we should have that for any two policies $π, π^{'}$ , and any distribution over histories $μ$ that can result from the first policy (ie, $μ \in Θ (π)$ ), the quantity in parentheses should be low or negative”

But, uh… what is that quantity in parentheses? Why do we need it to be low or negative? Reshuffling the statement “the quantity in paretheses should be 0 or lower”, it turns into

x \cdot inf ν \in Θ (π^{'}) d_{γ K R} (μ, ν) \leq μ (γ^{n : h_{n} ≁ π^{'}})

Remember, the acausal to pseudocausal translation is basically “the predictor is allowed o make mistakes”, and then the pseudocausal to causal translation is basically “if the predictor makes a mistake, you get 1 reward forever”.

If you think of $π$ as “what you’re predicted to do”, and $π^{'}$ as “what you actually do”, and $μ$ as “the distribution over histories that results from what you’re predicted to do”, then $μ (γ^{n : h_{n} ≁ π^{'}})$ is “the time-discounted reward, according to the predicted distribution over histories, from going to Nirvana because your actual behavior (the policy $π^{'}$ ) is inconsistent with was was supposed to happen (the history $h$ )”.

Basically, this quantity is “you were mispredicted, your actual policy is something different, how much time-discounted utility do you pick up from Nirvana. Remember, Nirvana happens when your policy goes off the rails of what it was predicted to do”

And what’s that quantity $x \cdot {inf}_{ν \in Θ (π^{'})} d_{γ K R} (μ, ν)$ ? Well, $d_{γ}$ is a distance metric on histories, that’s basically like “level of similarity according to utility functions with time-discount $γ$ ”. $d_{γ K R}$ is just extending this metric to probability distributions. Thinking of a probability distribution as a pile of dirt on events, the KR-distance is “effort to rearrange one dirt pile into the other dirt pile”. You can either move a small amount of dirt/measure a large distance to a very different outcome, or a large amount of dirt/measure a small distance to a very similar outcome. So, $γ K R$ -distance being low means that the two probability distributions on histories are similar at early times, and only start diverging quite late, so any utility function with time-discount $γ$ thinks that the expected value doesn’t change much. So, that quantity is like “the magnitude of the time-discounted change in the probability flow that results from being mispredicted”. The ${inf}_{ν \in Θ (π^{'})}$ part is because we’re trying to find the possible distribution that’s consistent with you being correctly predicted which is the closest to the distribution $μ$ that occurred from the wrong prediction. And the multiplication by $x$ is because, if your utility function is bounded in $[0, x]$ , larger differences in the $γ$ -KR distance make less difference to your expected values. Putting this all together, we can summarize the condition

lim γ \to 1 (sup π, π^{'}, μ \in Θ (π) (x \cdot inf ν \in Θ (π^{'}) d_{γ K R} (μ, ν) - μ (γ^{n : h_{n} ≁ π^{'}}))) = 0

as “in the farsightedness limit, regardless of your true policy ( $π^{'}$ ), predicted policy ( $π$ ), or results from the misprediction ( $μ$ ), the payoff from going off the rails of destiny and attaining Nirvana (the $μ (γ^{n})$ part) should be comparable to or greater than the time-discounted change in probability flow that results from the misprediction (the KR-distance thing). If your utility function is bounded in $[0, x]$ , that means that “1 reward forever” is a bigger boon to you, so bigger changes in probbility flow from being mispredicted are acceptable”

It’s the fairness condition of “if I’m mispredicted, the misprediction should either have little effect on what happens, or if has a large effect, I should have a comparably large chance of actually entering the situation where I’ve been mispredicted and proving the prediction wrong”.

For the theorem, $Θ^{c}$ is the causal translation of the acausal belief function $Θ$ , and $U^{γ, N}$ is the utility function that’s just $U^{γ}$ but with 1 reward forevermore if it hits a Nirvana state.

Theorem 8: For any acausal belief function $Θ$ , if

lim γ \to 1 (sup π, π^{'}, μ \in Θ (π) (x \cdot inf ν \in Θ (π^{'}) d_{γ K R} (μ, ν) - μ (γ^{n : h_{n} ≁ π^{'}}))) = 0

and the utility function $U$ has rewards bounded in $[0, x]$ , then

lim γ \to 1 sup π^{'} | Θ (π^{'}) (U^{γ}) - Θ^{c} (π^{'}) (U^{γ, N}) | = 0

So, pretty much, this theorem says that the condition we discussed means that in the farsightedness limit, regardless of the utility function $U$ , the maximum difference in expected utility between the old acausal belief function and its causal translation shrinks to 0, uniformly across all of the policies.