Let X1,...,Xn be random variables distributed according to a probability distribution p on a sample space Ω.
Defn. A (weak) natural latent of X1,...,Xn is a random variable Λ such that
(i) Xi are independent conditional on Λ
(ii) [reconstructability] p(Λ=λ|X1,...,^Xi,...,Xn)=p(Λ=λ|X1,...,Xn) for all i=1
[This is not really reconstructability, more like a stability property. The information is contained in many parts of the system… I might also have written this down wrong]
Defn. A strong natural latent Λ additionally satisfies p(Λ|Xi)=p(Λ|X1,...,Xn)
Defn. A natural latent is noiseless if ?
H(Λ)=H(X1,...,Xn) ??
[Intuitively, Λ should contain no independent noise not accoutned for by the Xi]
Causal states
Consider the equivalence relation on tuples (x1,...,xn) given (x1,...,xn)∼(x′1,...,x′n) if for all i=1,...,np(Xi=xi|x1,...,^xi,...,xn)=p(Xi=xi|x′1,...,^xi′,...,x′n)
We call the set of equivalence relation Ω/∼ the set of causal states.
By pushing forward the distribution p on Ω along the quotient map Ω↠Ω/∼
This gives a noiseless (strong?) natural latent Λ.
Remark. Note that Wentworth’s natural latents are generalizations of Crutchfield causal states (and epsilon machines).
Minimality and maximality
Let X1,...,Xn be random variables as before and let Λ be a weak latent.
Minimality Theorem for Natural Latents. Given any other variable N such that the Xi are independent conditional on N we have the following DAG
Λ→N→{Xi}i
i.e. p(X1,...,Xn|N)=p(X1,...,Xn|N,Λ)
[OR IS IT for all i ?]
Maximality Theorem for Natural Latents. Given any other variable M such that the reconstrutability property holds with regard to Xi we have
M→Λ→{Xi}i
Some other things:
Weak latents are defined up to isomorphism?
noiseless weak (strong?) latents are unique
The causal states as defined above will give the noiseless weak latents
Not all systems are easily abstractable. Consider a multivariable gaussian distribution where the covariance matrix doesn’t have a low-rank part. The covariance matrix is symmetric positive—after diagonalization the eigenvalues should be roughly equal.
Consider a sequence of buckets Bi,i=1,...,n and you put messages mj in two buckets mj→B2j,B2j+1. In this case the minimal latent has to remember all the messages—so the latent is large. On the other hand, we can quotient B2i,B2i+1↦B′i: all variables become independent.
EDIT: Sam Eisenstat pointed out to me that this doesn’t work. The construction actually won’t satisfy the ‘stability criterion’.
The noiseless natural latent might not always exist. Indeed consider a generic distribution p on 2N. In this case, the causal state cosntruction will just yield a copy of 2N. In this case the reconstructavility/stability criterion is not satisfied.
Inspired by this Shalizi paper defining local causal states. The idea is so simple and elegant I’m surprised I had never seen it before.
Basically, starting with a a factored probability distribution Xt=(X1(t),...,Xkt(t)) over a dynamical DAG Dt we can use Crutchfield causal state construction locally to construct a derived causal model factored over the dynamical DAG as X′t where X′t is defined by considering the past and forward lightcone of Xt defined as L−(Xt),L+(Xt) all those points/ variables Yt2 which influence Xt respectively are influenced by Xt (in a causal interventional sense) . Now take define the equivalence relatio on realization at∼bt of L−(Xt) (which includes Xt by definition)[1] whenever the conditional probability distribution p(L+(Xt)|at)=p(L+(Xt)|bt) on the future light cones are equal.
These factored probability distributions over dynamical DAGs are called ‘fields’ by physicists. Given any field F(x,t) we define a derived local causal state field ϵ(F(x,t)) in the above way. Woah!
Some thoughts and questions
this depends on the choice of causal factorizations. Sometimes these causal factorizations are given but in full generality one probably has to consider all factorizations simultaneously, each giving a different local state presentation!
What is the Factored sets angle here?
In particular, given a stochastic process ...→X−1→X0→X1→... the reverse XBackToTheFuturet:=X−t can give a wildly different local causal field as minimal predictors and retrodictors can be different. This can be exhibited by the random insertion process, see this paper.
Let a stochastic process Xt be given and define the (forward) causal states St as usual. The key ‘stochastic complexity’ quantity is defined as the mutual information I(St;X≤0) of the causal states and the past. We may generalize this definition, replacing the past with the local past lightcone to give a local stochastic complexity.
Under the assumption that the stochastic process is ergodic the causal state form an irreducible Hidden Markov Model and the stochastic complexity can be calculated as the entropy of the stationary distribution.
!!Importantly, the stochastic complexity is different from the ‘excess entropy’ of the mutual information of the past (lightcone) and the future (lightcone).
This gives potentially a lot of very meaningful quantities to compute. These are I think related to correlation functions but contain more information in general.
Note that the local causal state construction is always possible—it works in full generality. Really quite incredible!
How are local causal fields related to Wentworth’s latent natural abstractions?
Shalizi conjectures that the local causal states form a Markov field—which would mean by Hammersley-Clifford we could describe the system as a Gibb distribution ! This would prove an equivalence between the Gibbs/MaxEnt/ Pitman-Koopman-Darmois theory and the conditional independence story of Natural Abstraction roughly similar to early approaches of John.
I am not sure what the status of the conjecture is at this moment. It seems rather remarkable that such a basic fact, if true, cannot be proven. I haven’t thought about it much but perhaps it is false in a subtle way.
A Markov field factorizes over an undirected graph which seems strictly less general than a directed graph. I’m confused about this.
Given a symmetry group G acting on the original causal model /field F(x,t)=(p,D) the action will descend to an action G↷ϵ(F)(x,t) on the derived local causal state field.
A stationary process X(t) is exactly one with a translation action by Z. This underlies the original epsilon machine construction of Crutchfield, namely the fact that the causal states don’t just form a set (+probability distribution) but are endowed with a monoid structure → Hidden Markov Model.
[Intuitively, Λ should contain no independent noise not accoutned for by the Xi]
That condition doesn’t work, but here’s a few alternatives which do (you can pick any one of them):
Λ=(x↦P[X=x|Λ]) - most conceptually confusing at first, but most powerful/useful once you’re used to it; it’s using the trick from Minimal Map.
Require that Λ be a deterministic function of X, not just any latent variable.
H(Λ)=I(X,Λ)
(The latter two are always equivalent for any two variables X,Λ and are somewhat stronger than we need here, but they’re both equivalent to the first once we’ve already asserted the other natural latent conditions.)
Latent abstractions Bootlegged.
Let X1,...,Xn be random variables distributed according to a probability distribution p on a sample space Ω.
Defn. A (weak) natural latent of X1,...,Xn is a random variable Λ such that
(i) Xi are independent conditional on Λ
(ii) [reconstructability] p(Λ=λ|X1,...,^Xi,...,Xn)=p(Λ=λ|X1,...,Xn) for all i=1
[This is not really reconstructability, more like a stability property. The information is contained in many parts of the system… I might also have written this down wrong]
Defn. A strong natural latent Λ additionally satisfies p(Λ|Xi)=p(Λ|X1,...,Xn)
Defn. A natural latent is noiseless if ?
H(Λ)=H(X1,...,Xn) ??
[Intuitively, Λ should contain no independent noise not accoutned for by the Xi]
Causal states
Consider the equivalence relation on tuples (x1,...,xn) given (x1,...,xn)∼(x′1,...,x′n) if for all i=1,...,n p(Xi=xi|x1,...,^xi,...,xn)=p(Xi=xi|x′1,...,^xi′,...,x′n)
We call the set of equivalence relation Ω/∼ the set of causal states.
By pushing forward the distribution p on Ω along the quotient map Ω↠Ω/∼
This gives a noiseless (strong?) natural latent Λ.
Remark. Note that Wentworth’s natural latents are generalizations of Crutchfield causal states (and epsilon machines).
Minimality and maximality
Let X1,...,Xn be random variables as before and let Λ be a weak latent.
Minimality Theorem for Natural Latents. Given any other variable N such that the Xi are independent conditional on N we have the following DAG
Λ→N→{Xi}i
i.e. p(X1,...,Xn|N)=p(X1,...,Xn|N,Λ)
[OR IS IT for all i ?]
Maximality Theorem for Natural Latents. Given any other variable M such that the reconstrutability property holds with regard to Xi we have
M→Λ→{Xi}i
Some other things:
Weak latents are defined up to isomorphism?
noiseless weak (strong?) latents are unique
The causal states as defined above will give the noiseless weak latents
Not all systems are easily abstractable. Consider a multivariable gaussian distribution where the covariance matrix doesn’t have a low-rank part. The covariance matrix is symmetric positive—after diagonalization the eigenvalues should be roughly equal.
Consider a sequence of buckets Bi,i=1,...,n and you put messages mj in two buckets mj→B2j,B2j+1. In this case the minimal latent has to remember all the messages—so the latent is large. On the other hand, we can quotient B2i,B2i+1↦B′i: all variables become independent.
EDIT: Sam Eisenstat pointed out to me that this doesn’t work. The construction actually won’t satisfy the ‘stability criterion’.
The noiseless natural latent might not always exist. Indeed consider a generic distribution p on 2N. In this case, the causal state cosntruction will just yield a copy of 2N. In this case the reconstructavility/stability criterion is not satisfied.
Inspired by this Shalizi paper defining local causal states. The idea is so simple and elegant I’m surprised I had never seen it before.
Basically, starting with a a factored probability distribution Xt=(X1(t),...,Xkt(t)) over a dynamical DAG Dt we can use Crutchfield causal state construction locally to construct a derived causal model factored over the dynamical DAG as X′t where X′t is defined by considering the past and forward lightcone of Xt defined as L−(Xt),L+(Xt) all those points/ variables Yt2 which influence Xt respectively are influenced by Xt (in a causal interventional sense) . Now take define the equivalence relatio on realization at∼bt of L−(Xt) (which includes Xt by definition)[1] whenever the conditional probability distribution p(L+(Xt)|at)=p(L+(Xt)|bt) on the future light cones are equal.
These factored probability distributions over dynamical DAGs are called ‘fields’ by physicists. Given any field F(x,t) we define a derived local causal state field ϵ(F(x,t)) in the above way. Woah!
Some thoughts and questions
this depends on the choice of causal factorizations. Sometimes these causal factorizations are given but in full generality one probably has to consider all factorizations simultaneously, each giving a different local state presentation!
What is the Factored sets angle here?
In particular, given a stochastic process ...→X−1→X0→X1→... the reverse XBackToTheFuturet:=X−t can give a wildly different local causal field as minimal predictors and retrodictors can be different. This can be exhibited by the random insertion process, see this paper.
Let a stochastic process Xt be given and define the (forward) causal states St as usual. The key ‘stochastic complexity’ quantity is defined as the mutual information I(St;X≤0) of the causal states and the past. We may generalize this definition, replacing the past with the local past lightcone to give a local stochastic complexity.
Under the assumption that the stochastic process is ergodic the causal state form an irreducible Hidden Markov Model and the stochastic complexity can be calculated as the entropy of the stationary distribution.
!!Importantly, the stochastic complexity is different from the ‘excess entropy’ of the mutual information of the past (lightcone) and the future (lightcone).
This gives potentially a lot of very meaningful quantities to compute. These are I think related to correlation functions but contain more information in general.
Note that the local causal state construction is always possible—it works in full generality. Really quite incredible!
How are local causal fields related to Wentworth’s latent natural abstractions?
Shalizi conjectures that the local causal states form a Markov field—which would mean by Hammersley-Clifford we could describe the system as a Gibb distribution ! This would prove an equivalence between the Gibbs/MaxEnt/ Pitman-Koopman-Darmois theory and the conditional independence story of Natural Abstraction roughly similar to early approaches of John.
I am not sure what the status of the conjecture is at this moment. It seems rather remarkable that such a basic fact, if true, cannot be proven. I haven’t thought about it much but perhaps it is false in a subtle way.
A Markov field factorizes over an undirected graph which seems strictly less general than a directed graph. I’m confused about this.
Given a symmetry group G acting on the original causal model /field F(x,t)=(p,D) the action will descend to an action G↷ϵ(F)(x,t) on the derived local causal state field.
A stationary process X(t) is exactly one with a translation action by Z. This underlies the original epsilon machine construction of Crutchfield, namely the fact that the causal states don’t just form a set (+probability distribution) but are endowed with a monoid structure → Hidden Markov Model.
In other words, by convention the Past includes the Present X0 while the Future excludes the Present.
That condition doesn’t work, but here’s a few alternatives which do (you can pick any one of them):
Λ=(x↦P[X=x|Λ]) - most conceptually confusing at first, but most powerful/useful once you’re used to it; it’s using the trick from Minimal Map.
Require that Λ be a deterministic function of X, not just any latent variable.
H(Λ)=I(X,Λ)
(The latter two are always equivalent for any two variables X,Λ and are somewhat stronger than we need here, but they’re both equivalent to the first once we’ve already asserted the other natural latent conditions.)