Master’s student in applied mathematics, funded by Center on Long-Term Risk to investigate the cheating problem in safe pareto-improvements. Former dovetail fellow with @Alex_Altair.
Daniel C
A summary of Condensation and its relation to Natural Latents
Maximum redund=Maximal redund, Minimum mediator=Minimal mediator, & Naturality=Shrinking regimes of tradeoffs
Previously, we’ve shown that given constraints on the entropy of a natural latent variable , the mediation and redundancy errors are minimized exactly when the sum of mutual information with observables is maximized. In addition, the entropy of the latent variable is exactly the parameter that represents the tradeoff between the mediation and redundancy condition. In particular, the mediation error can only reduce with while the redundancy errors can only increase with .
However, there may be regimes where changes in can reduce the mediation error without increasing the redundancy errors or vice versa. For instance:
If we gradually increase while perfectly preserving the redundancy conditions , then we can reduce the mediation error (as it can only reduce with increasing ) without increasing the redundancy errors (as they stay 0). Increasing therefore becomes a weak pareto-improvement over the naturality conditions
Similarly, if we gradually reduce while perfectly preserving the mediation condition , then we can reduce the redundancy errors without increasing the mediation error (as it stays 0)
If we define a maximum redund as a latent variable that satisfies the redundancy conditions and has the maximum entropy among redunds, then represents the regime where we can increase without increasing the redundancy errors, since increasing beyond would necessarily violate the redundancy condition given our assumption of maximum entropy
Similarly, define a minimum mediator as a mediator with minimal entropy (among mediators). Then represents the regime where we can reduce entropy without increasing the mediation error, since reducing below necessarily violates the mediation condition.
Combining these ideas, represents the regime where changing actually presents a tradeoff between the mediation and redundancy errors; the minimum mediator and maximum redund marks the boundaries for when weak pareto-improvements are possible.
Maximal redunds and minimal mediators
In natural latents we care about the uniqueness of latent variables, which is why we have concepts like minimal mediators and maximal redunds:
A minimal mediator is a mediator such that for any other mediator we have . So a minimal mediator is an approximately deterministic function of any other mediator
A maximal redund is a redund such that for any other redund we have . So any redund is approximately a deterministic function of the maximal redund
Through a universal-property-flavored proof, we can show approximate isomorphism among any pair of minimal mediators : Since is a minimal mediator and is a mediator, approximately determines , and using a similar argument we conclude determines . The same reasoning also allows us to derive uniqueness of any pair of maximal redunds. Naturality occurs when the maximal redund converge with the minimal mediator.
However, note that the concepts of minimal mediators and maximal redunds are at least conceptually distinct from minimum mediators and maximum redunds. We shall therefore prove that these concepts are mathematically equivalent. This can be useful because it’s much easier to find minimum mediators and maximum redunds computationally, but we ultimately care about the unqiueness property offered by minimal mediators and maximal redunds, proving an equivalence enables the former to have the uniqueness guarantees of the latter.
Minimum mediator = Minimal mediator (when minimal mediator exists)
Proof:
Let be a minimal mediator and be a minimum mediator
Since is a minimal mediator and is a mediator, we have which means
Since has minimal entropy, we have which means
For any other latent , we have , which means is also a minimal mediator (up to error )
In addition, we have so is also an approximate minimum mediator
Maximum redund= Maximal redund
Proof:
Suppose that is maximum redund of and is any other redund
- is a redund since both and are deterministic functions of any , since is maximum redund we have
hence we have . Since is nonegative, we must have
As a result, is also a maximal redund
Similarly, suppose that is a maximal redund, then , which means and is also an approximate maximum redund.
Naturality as shrinking regime of tradeoffs
Recall that (where is the minimum mediator and is the maximum redund) represents the regime where changing actually presents a tradeoff between the mediation and redundancy errors. Due to the equivalence we proved, we can also think of as the minimal mediator and as the maximal redund.
We also know that naturality occurs when the minimal mediator converges with the maximal redund (as a natural latent satisfies both mediation and redundancy, and mediator determines redund); we can picture this convergence as if we’re shrinking the gap between and . In other words, naturality occurs exactly when the regime of tradeoff () between the redundancy and mediation error is small. If we have exact naturality , then pareto-improvements on the naturality conditions can always be made by nudging closer to .
Combining this with our previous result, we conclude that that maximizing represents strong pareto-improvements over the naturality conditions; and represents the regime where we can have weak pareto-improvements by nudging closer to the boundary of or ; whereas represents the regime of real tradeoffs between naturality conditions. An approximate natural latent exist exactly when the regime of real tradeoffs is small and we can pareto-improve towards naturality
Natural latent maximizes sum of mutual information with each observable
Claim: Given constraints on the entropy of a latent variable , the redundancy and mediation errors are optimized exactly when the sum of mutual information is maximized.
Why is this useful?
Early works on natural abstractions often talk about throwing away information while maintaining predictive power, but how should predictive power be measured? If you have multiple observables, you want to maximize predictive power for each of the observables you care about. In particular, you want your latent variable to include “shared information” that allows you to predict multiple variables at once. This is similar to maximizing while minimizing (as we’re trying to throw away information).
Maximizing the sum of mutual information is likely computationally easier than directly optimizing for the mediation and redundancy conditions, so this framing could be helpful for anyone trying to find natural latents computationally
Mediation
Mediation is one of the main conditions for naturality of latent variables. Intuitively, an (approximate) mediator captures (approximately) all the correlation between a collection of observables . We will prove that a latent variable is an (optimal) approximate mediator it maximizes the sum of mutual information given constraint on entropy
Intuition: To maximize the sum of mutual information , it is beneficial to include information that is shared among multiple in (as that would increase multiple mutual information term). A mediator is a latent variable that captures all the shared information among the s, so it will be selected for when maximizing .
Proof:
Note that the mediation error of is equivelant to the conditional total correlation
We have , since each term is fixed relative to , maximizing the sum of mutual information is equivalent to minimizing the sum of conditional entropies
We have , which means
. For the last inequality we used , we can obtain equality if is a deterministic function of .Hence, if we have a constraint on entropy , then we have the lower bound.
We can achieve equality on the lower bound by choosing a deterministic (w.r.t ) latent with entropy . For such we have , which means is minimized exactly when is minimized (as that’s the only term which depends on )
Since is nonnegative and equivalent to the mediation error, we conclude that is minimized exactly when the mediation error is minimized, and an exact mediator () with entropy always achieves minimal (hence maximal ) among latent variables with entropy
Redundancy
The correspondence with the redundancy condition is quite simple: The sum of the redundancy errors are , so if we have a constraint of the form , then we have , and the sum of redundancy errors are minimized exactly when is maximized and .
Putting it together
We’ve shown that given constraints on , both the mediation and the redundancy conditions are minimized exactly when the sum of mutual information is maximized, we can use this to simplify the search for natural latents, and while optimizing for this quantity there is no tradeoff between the redundancy and mediation errors.
However, note that mediation error increases as decreases (the mediation error for the empty latent is simply total correlation), while the redundancy error increases with (which is why we imposed for mediation but for redundancy). So the entropy of the latent is exactly the parameter that represents the tradeoff between the mediation and redundancy errors.
In summary, we can picture a pareto-frontier of latent variables with maximal and different entropies, by ramping up the entropy of along the pareto-frontier we gradually increase the redundancy errors while reducing the mediation errors, and these are the only parameters relevant for latent variables that are pareto-optimal w.r.t the naturality conditions.
Daniel C’s Shortform
Abstraction as a generalization of algorithmic Markov condition
Algorithmic thermodynamics and three types of optimization
Great to see the concreteness of this example, some thoughts on the candidate properties:
The relationships between latent variables can change under ontology shifts, but we still want the semantics of our latent variables to remain invariant in some sense. This means we need some other variables tracking the relationships between our latent variables, & we’d update on those variables during ontology shifts. Latent variables are partly selected on the criterion that they can adapt to these changes in relationships while still maintaining the correspondence principle.
There’s a problem of whether we want to think of factorization/latent variables as a stance/frame we’re taking towards the system or an objective structural property of the system. I lean towards the former because when humans model the minds of other humans, they think of those minds as making use of similar abstractions to think & reason despite almost never observing the actual internals of human brains. The degree of freedom in the “latent variable embedding” of nuclear exchange initiation also suggests some notion of subjectivity.
The fact that we have to identify nuclear exchange initiation across a wide variety of scenarios points to ‘re-use’ being an important basis of convergent factorization. Loopiness comes in when our existing reusable abstractions affect what new abstractions/models of the world we’re able to construct.
For this we need a mechanism such that the maintenance of the mechanism is a schelling point. Specifically, the mechanism at T+1 should reward agents for actions at time T that reinforce the mechanism itself (in particular the actions are distributed). The incentive raises the probability of the mechanism being actualized at T+1, which in turn raises the “weight” of the reward offered by the mechanism at T+1, creating a self-fulfilling prophecy.
“Merging” forces parallelism back into sequential structures, which is why most blockchains are slow. You could make it faster by bundling a lot of actions together, but you need to make sure all actions are actually observable & checked by most of the agents (aka the data availability problem)
For translatability guarantees, we also want an answer for why agents have distinct concepts for different things, and the criteria for carving up the world model into different concepts. My sketch of an answer is that different hypotheses/agents will make use of different pieces of information under different scenarios, and having distinct reference handles to different types of information allows the hypotheses/agents to access the minimal amount of information they need.
For environment structure, we’d like an answer for what it means for there to be an object that persists through time, or for there to be two instances of the same object. One way this could work is to look at probabilistic predictions of an object over its Markov blanket, and require some sort of similarity in probabilistic predictions when we “transport” the object over spacetime
I’m less optimistic about the mind structure foundation because the interfaces that are the most natural to look at might not correspond to what we call “human concepts”, especially when the latter requires a level of flexibility not supported by the former. For instance, human concepts have different modularity structures with each other depending on context (also known as shifting structures), which basically rules out any simple correspondence with interfaces that have fixed computational structure over time. How we want to decompose a world model is an additional degree of freedom to the world model itself, and that has to come from other ontological foundations.
Seems like the main additional source of complexity is that each interface has its own local constraint, and the local constraints are coupled with each other (but lower-dimensional than parameters themselves); whereas regular statmech usually have subsystems sharing the same global constraints (different parts of a room of ideal gas are independent given the same pressure/temperature etc)
To recover the regular statmech picture, suppose that the local constraints have some shared/redundant information with each other: Ideally we’d like to isolate that redundant/shared information into a global constraint that all interfaces has access to, and we’d want the interfaces to be independent given the global constraint. For that we need something like relational completeness, where indexical information is encoded within the interfaces themselves, while the global constraint is shared across interfaces.
IIUC there are two scenarios to be distinguished:
One is that the die has bias p unknown to you (you have some prior over p) and you use i.i.d flips to estimate bias as usual & get maxent distribution for a new draw. The draws are independent given p but not independent given your priors, so everything works out.
The other is that the die is literally i.i.d over your priors. In this case everything from your argument routes through: Whatever bias\constraint you happen to estimate from your outcome sequence doesn’t say anything about a new i.i.d draw because they’re uncorrelated, the new draw is just another sample from your prior
I think steering is basically learning, backwards, and maybe flipped sideways. In learning, you build up mutual information between yourself and the world; in steering, you spend that mutual information. You can have learning without steering—but not the other way around—because of the way time works.
Alternatively, for learning your brain can start out in any given configuration, and it will end up in the same (small set of) final configuration (one that reflects the world); for steering the world can start out in any given configuration, and it will end up in the same set of target configurationsIt seems like some amount of steering without learning is possible (open-loop control), you can reduce entropy in a subsystem while increasing entropy elsewhere to maintain information conservation
Nice, some connections with why are maximum entropy distributions so ubiquitous:
If your system is ergodic, time average=ensemble average. Hence expected constraints can be estimated via following your dynamical system over time
If your system follows the second law, then entropy increases subject to the constraints
So the system converges to the maxent invariant distribution subject to constraint, which is why langevin dynamics converges to the Boltzmann distribution, and you can estimate equilibrium energy by following the particle around
In particular, we often use maxent to derive the prior itself (=invariant measure), and when our system is out of equilibrium, we can then maximize relative entropy w.r.t our maxent prior to update our distribution
Congratulations!
I would guess the issue with KL relates to the fact that a bound on permits situations where is small but is large (as we take the expectation under ), whereas JS penalizes both ways.
In particular, in the original theorem on resampling using KL divergence, the assumption bounds KL w.r.t the joint distribution , so there may be situation where the resampled probability is large but is small. But the intended conclusion bounds the KL under the resampled distribution , so the error on the values would be weighted much more under than under . Since we’re taking expectation under for the conclusion, the bound on the other resampling error under becomes insufficient.
Would this still give us guarantees on the conditional distribution ?
E.g. Mediation:
is really about the expected error conditional on individual values of , & it seems like there are distributions with high mediation error but low error when the latent is marginalized inside , which could be load-bearing when the agents cast out predictions on observables after updating on
The current theory is based on classical hamiltonian mechanics, but I think the theorems apply whenever you have a markovian coarse-graining. Fermion doubling is a problem for spacetime discretization in the quantum case, so the coarse-graining might need to be different. (E.g. coarse-grain the entire hilbert space, which might have locality issues but probably not load-bearing for algorithmic thermodynamics)
On outside view, quantum reduces to classical (which admits markovian coarse-graining) in the correspondence limit, so there must be some coarse-graining that works
I also talked to Aram recently & he’s optimistic that there’s an algorithmic version of the generalized heat engine where the hot vs cold pool correspond to high vs low k-complexity strings. I’m quite interested in doing follow-up work on that
The continuous state-space is coarse-grained into discrete cells where the dynamics are approximately markovian (the theory is currently classical) & the “laws of physics” probably refers to the stochastic matrix that specifies the transition probabilities of the discrete cells (otherwise we could probably deal with infinite precision through limit computability)
As in, take a set of variables X, then search for some set of its (non-overlapping?) subsets such that there’s a nontrivial natural latent over it? Right, it’s what we’re doing here as well.
I think the subsets can actually be partially overlapping, for instance you may have a that’s approximately deterministic w.r.t and but not alone, weak redundancy (approximately deterministic w.r.t ) is also an example of redunds across overlapping subsets
Preserving mutual information terms ⟹ ( Stochastic ⟹ Deterministic Natural latent)
(See this post for background about the stochastic → deterministic natural latent conjecture)
We’ve shown that given fixed H(Λ), both the redundancy and mediation errors of a latent Λ are minimized when ∑iI(Xi,Λ) is maximized, while H(Λ) is exactly the parameter the determines the tradeoff between redundancy and mediation errors (among pareto-optimal latent). We’ll discuss how this could open up new angles of attack for the stochastic → deterministic natural latent conjecture.
Suppose that we have a stochastic natural latent Λ that satisfies:
I(X2;Λ|X1)≤ϵ
I(X1;Λ|X2)≤ϵ
TC(X|Λ)=I(X1;X2|Λ)≤ϵ
From our result, we know that to construct a deterministic natural latent Λ′, all we have to do is to determine the entropy H(Λ′) and then select the latent that maximizes ∑iI(Xi,Λ′). The latter ensures that the latent is pareto-optimal w.r.t the mediation and determinism conditions, while the former selects a particular point on the pareto-frontier.
Now suppose that our stochastic natural latent has a particular amount of mutual information with the joint observables I(X1,X2;Λ). If the stochastic natural latent was a deterministic function of the observables, then we would have:
H(Λ)=I(X1,X2;Λ) (as that would imply H(Λ|X1,X2)=H(Λ)−I(X1,X2;Λ)=0)
So one heuristic for constructing a deterministic natural latent is to just set H(Λ′)=I(X1,X2;Λ) and maximize ∑iI(Xi,Λ′) given the entropy constraint (so that Λ′ hopefully captures all the mutual info between Λ and X). We will show that if Λ′ preserves the mutual information with each observable (i.e. I(Xi;Λ′)=I(Xi;Λ),i=1,2), then the mediation condition is conserved and the stochastic redundancy conditions implies the deterministic redundancy conditions
Preserving mutual information terms ⟹ Mediation is conserved
Note that the mediation error is TC(X|Λ)=∑iH(Xi|Λ)−H(X|Λ)=∑iH(Xi)−I(Xi;Λ)−H(X)+I(X;Λ)
Since all H(Xi) and H(X) terms are fixed relative to Λ, the mediation error is completely unchanged if we replace Λ with a deterministic latent Λ′ that satisfies I(X;Λ′)=H(Λ′)=I(X;Λ) and I(Xi;Λ′)=I(Xi,Λ) for each i.
Preserving mutual information terms ⟹ Redundancy is conserved
Note that using partial information decomposition[1], we can decompose the stochastic redundancy errors as the following:
I(X1;Λ|X2)=Syn(X1,X2;Λ)+Uniq(X1;Λ)<ϵ⟹Uniq(X1;Λ)<ϵ
I(X2;Λ|X1)=Syn(X1,X2;Λ)+Uniq(X2;Λ)<ϵ⟹Uniq(X2;Λ)<ϵ
where Syn(X1,X2;Λ) represents synergistic information of X2 and Λ w.r.t Λ while Uniq(X1;Λ) represents unique information of X1 w.r.t Λ. Intuitively, I(X1;Λ|X2) represents the information that X1 has about Λ when we have access to X2, which should include unique information that we can only derive from X1 but not X2, but also synergistic information that we can only derive when we have both X1 and X2.
We also have:
I(X1;Λ)=Red(X1,X2;Λ)+Uniq(X1;Λ)
I(X2;Λ)=Red(X1,X2;Λ)+Uniq(X2;Λ)
Intuitively, this is because I(X1;Λ) contains both the unique information about Λ that you can only derive from X1 but not X2, and also the redundant information that you can derive from either X1 or X2. Note that since0≤Uniq(X1;Λ)≤ϵ and 0≤Uniq(X2;Λ)≤ϵ, we have
Red(X1,X2;Λ)≤I(X1;Λ)≤Red(X1,X2;Λ)+ϵ
Red(X1,X2;Λ)≤I(X2;Λ)≤Red(X1,X2;Λ)+ϵ
Similarly, we have:
I(X1,X2;Λ)=Red(X1,X2;Λ)+Uniq(X1;Λ)+Uniq(X2;Λ)+Syn(X1,X2;Λ)
where
Uniq(X1;Λ)+Uniq(X2;Λ)+Syn(X1,X2;Λ)≤2ϵ⟹Red(X1,X2;Λ)≤I(X1,X2;Λ)≤Red(X1,X2;Λ)+2ϵ
As a result, both I(X1,X2;Λ)−I(X1;Λ) and I(X1,X2;Λ)−I(X2;Λ) are bounded by 2ϵ. This means that if we can find a deterministic latent Λ′ that conserves all the relevant mutual information terms I(X1,X2;|Λ), I(X1;|Λ) and I(X2;|Λ), then we can bound the deterministic redundancy errors:
H(Λ′|X1)=H(Λ′)−I(X1;Λ′)=I(X1,X2;Λ)−I(X1;Λ)<2ϵH(Λ′|X2)=H(Λ′)−I(X2;Λ′)=I(X1,X2;Λ)−I(X2;Λ)<2ϵ
Conclusion
We’ve shown that a sufficient condition for mediation and redundancy to transfer from the stochastic to deterministic case is if the deterministic latent preserves the mutual information of the stochastic latent with both the joint observable X as well as the individual observables X1 and X2. Given this, the remaining task would be to prove that such a deterministic latent always exists, or that it can preserve the mutual information terms up to some small error. In particular, if existence is guaranteed, then a tractable way to find the deterministic latent Λ′ given a stochastic latent Λ is to just set H(Λ′)=I(X;Λ) and maximize ∑iI(Xi;Λ′)
Note that PID depends on a choice of redundancy measure, but our proof holds for any choice that guarantees non-negativity of PID atoms