Natural latent maximizes sum of mutual information with each observable
Claim: Given constraints on the entropy of a latent variable Λ, the redundancy and mediation errors are optimized exactly when the sum of mutual information ∑iI(Xi;Λ) is maximized.
Why is this useful?
Early works on natural abstractions often talk about throwing away information while maintaining predictive power, but how should predictive power be measured? If you have multiple observables, you want to maximize predictive power for each of the observables you care about. In particular, you want your latent variable to include “shared information” that allows you to predict multiple variables at once. This is similar to maximizing ∑iI(Xi;Λ) while minimizing H(Λ) (as we’re trying to throw away information).
Maximizing the sum of mutual information is likely computationally easier than directly optimizing for the mediation and redundancy conditions, so this framing could be helpful for anyone trying to find natural latents computationally
Mediation
Mediation is one of the main conditions for naturality of latent variables. Intuitively, an (approximate) mediator Λ captures (approximately) all the correlation between a collection of observables X1…Xn. We will prove that a latent variable Λ is an (optimal) approximate mediator ⟺ it maximizes the sum of mutual information ∑iI(Xi;Λ) given constraint on entropy H(Λ)≤k
Intuition: To maximize the sum of mutual information ∑iI(Xi;Λ), it is beneficial to include information that is shared among multiple Xi in Λ (as that would increase multiple mutual information term). A mediator is a latent variable that captures all the shared information among the Xis, so it will be selected for when maximizing ∑iI(Xi;Λ).
Proof:
Note that the mediation error of Λ is equivelant to the conditional total correlation TC(X|Λ)=DKL(P(X1…Xn|Λ)∥P(X1|Λ)…P(Xn|Λ))
We have ∑iI(Xi;Λ)=∑iH(Xi)−H(Xi|Λ), since each H(Xi) term is fixed relative to Λ, maximizing the sum of mutual information is equivalent to minimizing the sum of conditional entropies ∑iH(Xi|Λ)
We have DKL(P(X1…Xn|Λ)∥P(X1|Λ)…P(Xn|Λ))=TC(X|Λ)=∑iH(Xi|Λ)−H(X|Λ), which means ∑iH(Xi|Λ)=H(X|Λ)+TC(X|Λ)=H(X)−I(X;Λ)+TC(X|Λ)≥H(X)−H(Λ)+TC(X|Λ). For the last inequality we used H(X)≥I(X;Λ), we can obtain equality if Λ is a deterministic function of X.
Hence, if we have a constraint on entropy H(Λ)≤k, then we have the lower bound∑iH(Xi|Λ)≥H(X)−H(Λ)+TC(X|Λ)≥H(X)−k+TC(X|Λ).
We can achieve equality on the lower bound by choosing a deterministic (w.r.t X) latent Λ with entropy H(Λ)=k. For such Λ we have ∑iH(Xi|Λ)=H(X)−k+TC(X|Λ), which means ∑iH(Xi|Λ) is minimized exactly when TC(X|Λ) is minimized (as that’s the only term which depends on Λ)
Since TC(X|Λ) is nonnegative and equivalent to the mediation error, we conclude that ∑iH(Xi|Λ) is minimized exactly when the mediation error is minimized, and an exact mediator (TC(X|Λ)=0) with entropy k always achieves minimal ∑iH(Xi|Λ) (hence maximal ∑iI(Xi;Λ)) among latent variables with entropy ≤k
Redundancy
The correspondence with the redundancy condition is quite simple: The sum of the redundancy errors are ∑iH(Λ|Xi)=nH(Λ)−∑iI(Xi;Λ), so if we have a constraint of the form H(Λ)≥k, then we have ∑iH(Λ|Xi)=nH(Λ)−∑iI(Xi;Λ)≥nk−∑iI(Xi;Λ), and the sum of redundancy errors are minimized exactly when ∑iI(Xi;Λ) is maximized and H(Λ)=k.
Putting it together
We’ve shown that given constraints on H(Λ), both the mediation and the redundancy conditions are minimized exactly when the sum of mutual information ∑iI(Xi,Λ) is maximized, we can use this to simplify the search for natural latents, and while optimizing for this quantity there is no tradeoff between the redundancy and mediation errors.
However, note that mediation error increases as H(Λ) decreases (the mediation error for the empty latent is simply total correlation), while the redundancy error increases with H(Λ) (which is why we imposed H(Λ)≤k for mediation but H(Λ)≥k for redundancy). So the entropy of the latent is exactly the parameter that represents the tradeoff between the mediation and redundancy errors.
In summary, we can picture a pareto-frontier of latent variables with maximal ∑iI(Xi;Λ) and different entropies, by ramping up the entropy of Λ along the pareto-frontier we gradually increase the redundancy errors while reducing the mediation errors, and these are the only parameters relevant for latent variables that are pareto-optimal w.r.t the naturality conditions.
Natural latent maximizes sum of mutual information with each observable
Claim: Given constraints on the entropy of a latent variable Λ, the redundancy and mediation errors are optimized exactly when the sum of mutual information ∑iI(Xi;Λ) is maximized.
Why is this useful?
Early works on natural abstractions often talk about throwing away information while maintaining predictive power, but how should predictive power be measured? If you have multiple observables, you want to maximize predictive power for each of the observables you care about. In particular, you want your latent variable to include “shared information” that allows you to predict multiple variables at once. This is similar to maximizing ∑iI(Xi;Λ) while minimizing H(Λ) (as we’re trying to throw away information).
Maximizing the sum of mutual information is likely computationally easier than directly optimizing for the mediation and redundancy conditions, so this framing could be helpful for anyone trying to find natural latents computationally
Mediation
Mediation is one of the main conditions for naturality of latent variables. Intuitively, an (approximate) mediator Λ captures (approximately) all the correlation between a collection of observables X1…Xn. We will prove that a latent variable Λ is an (optimal) approximate mediator ⟺ it maximizes the sum of mutual information ∑iI(Xi;Λ) given constraint on entropy H(Λ)≤k
Intuition: To maximize the sum of mutual information ∑iI(Xi;Λ), it is beneficial to include information that is shared among multiple Xi in Λ (as that would increase multiple mutual information term). A mediator is a latent variable that captures all the shared information among the Xis, so it will be selected for when maximizing ∑iI(Xi;Λ).
Proof:
Note that the mediation error of Λ is equivelant to the conditional total correlation TC(X|Λ)=DKL(P(X1…Xn|Λ)∥P(X1|Λ)…P(Xn|Λ))
We have ∑iI(Xi;Λ)=∑iH(Xi)−H(Xi|Λ), since each H(Xi) term is fixed relative to Λ, maximizing the sum of mutual information is equivalent to minimizing the sum of conditional entropies ∑iH(Xi|Λ)
We have DKL(P(X1…Xn|Λ)∥P(X1|Λ)…P(Xn|Λ))=TC(X|Λ)=∑iH(Xi|Λ)−H(X|Λ), which means
∑iH(Xi|Λ)=H(X|Λ)+TC(X|Λ)=H(X)−I(X;Λ)+TC(X|Λ)≥H(X)−H(Λ)+TC(X|Λ). For the last inequality we used H(X)≥I(X;Λ), we can obtain equality if Λ is a deterministic function of X.
Hence, if we have a constraint on entropy H(Λ)≤k, then we have the lower bound∑iH(Xi|Λ)≥H(X)−H(Λ)+TC(X|Λ)≥H(X)−k+TC(X|Λ).
We can achieve equality on the lower bound by choosing a deterministic (w.r.t X) latent Λ with entropy H(Λ)=k. For such Λ we have ∑iH(Xi|Λ)=H(X)−k+TC(X|Λ), which means ∑iH(Xi|Λ) is minimized exactly when TC(X|Λ) is minimized (as that’s the only term which depends on Λ)
Since TC(X|Λ) is nonnegative and equivalent to the mediation error, we conclude that ∑iH(Xi|Λ) is minimized exactly when the mediation error is minimized, and an exact mediator (TC(X|Λ)=0) with entropy k always achieves minimal ∑iH(Xi|Λ) (hence maximal ∑iI(Xi;Λ)) among latent variables with entropy ≤k
Redundancy
The correspondence with the redundancy condition is quite simple: The sum of the redundancy errors are ∑iH(Λ|Xi)=nH(Λ)−∑iI(Xi;Λ), so if we have a constraint of the form H(Λ)≥k, then we have ∑iH(Λ|Xi)=nH(Λ)−∑iI(Xi;Λ)≥nk−∑iI(Xi;Λ), and the sum of redundancy errors are minimized exactly when ∑iI(Xi;Λ) is maximized and H(Λ)=k.
Putting it together
We’ve shown that given constraints on H(Λ), both the mediation and the redundancy conditions are minimized exactly when the sum of mutual information ∑iI(Xi,Λ) is maximized, we can use this to simplify the search for natural latents, and while optimizing for this quantity there is no tradeoff between the redundancy and mediation errors.
However, note that mediation error increases as H(Λ) decreases (the mediation error for the empty latent is simply total correlation), while the redundancy error increases with H(Λ) (which is why we imposed H(Λ)≤k for mediation but H(Λ)≥k for redundancy). So the entropy of the latent is exactly the parameter that represents the tradeoff between the mediation and redundancy errors.
In summary, we can picture a pareto-frontier of latent variables with maximal ∑iI(Xi;Λ) and different entropies, by ramping up the entropy of Λ along the pareto-frontier we gradually increase the redundancy errors while reducing the mediation errors, and these are the only parameters relevant for latent variables that are pareto-optimal w.r.t the naturality conditions.