Alexander Gietelink Oldenziel comments on Alexander Gietelink Oldenziel’s Shortform

Alexander Gietelink Oldenziel 30 Jun 2023 4:35 UTC
4 points
0
Latent abstractions Bootlegged.
Let $X_{1}, . . ., X_{n}$ be random variables distributed according to a probability distribution $p$ on a sample space $Ω$ .
Defn. A (weak) natural latent of $X_{1}, . . ., X_{n}$ is a random variable $Λ$ such that
(i) $X_{i}$ are independent conditional on $Λ$
(ii) [reconstructability] $p (Λ = λ | X_{1}, . . .,_{i}, . . ., X_{n}) = p (Λ = λ | X_{1}, . . ., X_{n})$ for all $i = 1$
[This is not really reconstructability, more like a stability property. The information is contained in many parts of the system… I might also have written this down wrong]
Defn. A strong natural latent $Λ$ additionally satisfies $p (Λ | X_{i}) = p (Λ | X_{1}, . . ., X_{n})$
Defn. A natural latent is noiseless if ?
$H (Λ) = H (X_{1}, . . ., X_{n})$ ??
[Intuitively, $Λ$ should contain no independent noise not accoutned for by the $X_{i}$ ]
Causal states
Consider the equivalence relation on tuples $(x_{1}, . . ., x_{n})$ given $(x_{1}, . . ., x_{n}) \sim (x_{1}^{'}, . . ., x_{n}^{'})$ if for all $i = 1, . . ., n$ $p (X_{i} = x_{i} | x_{1}, . . .,_{i}, . . ., x_{n}) = p (X_{i} = x_{i} | x_{1}^{'}, . . ., {_{i}}^{'}, . . ., x_{n}^{'})$
We call the set of equivalence relation $Ω / \sim$ the set of causal states.
By pushing forward the distribution $p$ on $Ω$ along the quotient map $Ω ↠ Ω / \sim$
This gives a noiseless (strong?) natural latent $Λ$ .
Remark. Note that Wentworth’s natural latents are generalizations of Crutchfield causal states (and epsilon machines).
Minimality and maximality
Let $X_{1}, . . ., X_{n}$ be random variables as before and let $Λ$ be a weak latent.
Minimality Theorem for Natural Latents. Given any other variable $N$ such that the $X_{i}$ are independent conditional on $N$ we have the following DAG
$Λ \to N \to {X_{i}}_{i}$
i.e. $p (X_{1}, . . ., X_{n} | N) = p (X_{1}, . . ., X_{n} | N, Λ)$
[OR IS IT for all $i$ ?]
Maximality Theorem for Natural Latents. Given any other variable $M$ such that the reconstrutability property holds with regard to $X_{i}$ we have
$M \to Λ \to {X_{i}}_{i}$
Some other things:
- Weak latents are defined up to isomorphism?
- noiseless weak (strong?) latents are unique
- The causal states as defined above will give the noiseless weak latents
- Not all systems are easily abstractable. Consider a multivariable gaussian distribution where the covariance matrix doesn’t have a low-rank part. The covariance matrix is symmetric positive—after diagonalization the eigenvalues should be roughly equal.
- Consider a sequence of buckets $B_{i}, i = 1, . . ., n$ and you put messages $m_{j}$ in two buckets $m_{j} \to B_{2 j}, B_{2 j + 1}$ . In this case the minimal latent has to remember all the messages—so the latent is large. On the other hand, we can quotient $B_{2 i}, B_{2 i + 1} \mapsto B_{i}^{'}$ : all variables become independent.
EDIT: Sam Eisenstat pointed out to me that this doesn’t work. The construction actually won’t satisfy the ‘stability criterion’.
The noiseless natural latent might not always exist. Indeed consider a generic distribution $p$ on $2^{N}$ . In this case, the causal state cosntruction will just yield a copy of $2^{N}$ . In this case the reconstructavility/stability criterion is not satisfied.
- Alexander Gietelink Oldenziel 3 Aug 2023 16:31 UTC
  12 points
  0
  Parent
  Inspired by this Shalizi paper defining local causal states. The idea is so simple and elegant I’m surprised I had never seen it before.
  Basically, starting with a a factored probability distribution $X_{t} = (X_{1} (t), . . ., X_{k_{t}} (t))$ over a dynamical DAG $D_{t}$ we can use Crutchfield causal state construction locally to construct a derived causal model factored over the dynamical DAG as $X_{t}^{'}$ where $X_{t}^{'}$ is defined by considering the past and forward lightcone of $X_{t}$ defined as $L^{-} (X_{t}), L^{+} (X_{t})$ all those points/ variables $Y_{t_{2}}$ which influence $X_{t}$ respectively are influenced by $X_{t}$ (in a causal interventional sense) . Now take define the equivalence relatio on realization $a_{t} \sim b_{t}$ of $L^{-} (X_{t})$ (which includes $X_{t}$ by definition)^[1] whenever the conditional probability distribution $p (L^{+} (X_{t}) | a_{t}) = p (L^{+} (X_{t}) | b_{t})$ on the future light cones are equal.
  These factored probability distributions over dynamical DAGs are called ‘fields’ by physicists. Given any field $F (x, t)$ we define a derived local causal state field $ϵ (F (x, t))$ in the above way. Woah!
  Some thoughts and questions
  - this depends on the choice of causal factorizations. Sometimes these causal factorizations are given but in full generality one probably has to consider all factorizations simultaneously, each giving a different local state presentation!
    What is the Factored sets angle here?
    In particular, given a stochastic process $. . . \to X_{- 1} \to X_{0} \to X_{1} \to . . .$ the reverse $X_{t}^{B a c k T o T h e F u t u r e} := X_{- t}$ can give a wildly different local causal field as minimal predictors and retrodictors can be different. This can be exhibited by the random insertion process, see this paper.
  - Let a stochastic process $X_{t}$ be given and define the (forward) causal states $S_{t}$ as usual. The key ‘stochastic complexity’ quantity is defined as the mutual information $I (S_{t}; X_{\leq 0})$ of the causal states and the past. We may generalize this definition, replacing the past with the local past lightcone to give a local stochastic complexity.
    Under the assumption that the stochastic process is ergodic the causal state form an irreducible Hidden Markov Model and the stochastic complexity can be calculated as the entropy of the stationary distribution.
    !!Importantly, the stochastic complexity is different from the ‘excess entropy’ of the mutual information of the past (lightcone) and the future (lightcone).
    This gives potentially a lot of very meaningful quantities to compute. These are I think related to correlation functions but contain more information in general.
  - Note that the local causal state construction is always possible—it works in full generality. Really quite incredible!
  - How are local causal fields related to Wentworth’s latent natural abstractions?
  - Shalizi conjectures that the local causal states form a Markov field—which would mean by Hammersley-Clifford we could describe the system as a Gibb distribution ! This would prove an equivalence between the Gibbs/MaxEnt/ Pitman-Koopman-Darmois theory and the conditional independence story of Natural Abstraction roughly similar to early approaches of John.
    I am not sure what the status of the conjecture is at this moment. It seems rather remarkable that such a basic fact, if true, cannot be proven. I haven’t thought about it much but perhaps it is false in a subtle way.
    A Markov field factorizes over an undirected graph which seems strictly less general than a directed graph. I’m confused about this.
  - Given a symmetry group $G$ acting on the original causal model /field $F (x, t) = (p, D)$ the action will descend to an action $G ↷ ϵ (F) (x, t)$ on the derived local causal state field.
    A stationary process $X (t)$ is exactly one with a translation action by $Z$ . This underlies the original epsilon machine construction of Crutchfield, namely the fact that the causal states don’t just form a set (+probability distribution) but are endowed with a monoid structure → Hidden Markov Model.
  1. ^
    In other words, by convention the Past includes the Present $X_{0}$ while the Future excludes the Present.
  - Dalcy 11 Jul 2024 16:37 UTC
    3 points
    0
    Parent
    Just finished the local causal states paper, it’s pretty cool! A couple of thoughts though:
    I don’t think the causal states factorize over the dynamical bayes net, unlike the original random variables (by assumption). Shalizi doesn’t claim this either.
    This would require proving that each causal state is conditionally independent of its nondescendant causal states given its parents, which is a stronger theorem than what is proved in Theorem 5 (only conditionally independent of its ancestor causal states, not necessarily all the nondescendants)
    Also I don’t follow the Markov Field part—how would proving:
    if we condition on present neighbors of the patch, as well as the parents of the patch, then we get independence of the states of all points at time t or earlier. (pg 16)
    … show that the causal states is a markov field (aka satisfies markov independencies (local or pairwise or global) induced by an undirected graph)? I’m not even sure what undirected graph the causal states would be markov with respect to. Is it the …
    … skeleton of the dynamical Bayes Net? that would require proving a different theorem: “if we condition on parents and children of the patch, then we get independence of all the other states” which would prove local markov independency
    … skeleton of the dynamical Bayes Net + edges for the original graph for each t? that would also require proving a different theorem: “if we condition on present neighbors, parents, and children of the patch, then we get independence of all the other states” which would prove local markov independency
    Also for concreteness I think I need to understand its application in detecting coherent structures in cellular automata to better appreciate this construction, though the automata theory part does go a bit over my head :p
- johnswentworth 30 Jun 2023 16:09 UTC
  8 points
  0
  Parent
  Defn. A natural latent is noiseless if ?
  $H (Λ) = H (X_{1}, . . ., X_{n})$ ??
  [Intuitively, $Λ$ should contain no independent noise not accoutned for by the $X_{i}$ ]
  That condition doesn’t work, but here’s a few alternatives which do (you can pick any one of them):
  - $Λ = (x \mapsto P [X = x | Λ])$ - most conceptually confusing at first, but most powerful/useful once you’re used to it; it’s using the trick from Minimal Map.
  - Require that $Λ$ be a deterministic function of $X$ , not just any latent variable.
  - $H (Λ) = I (X, Λ)$
  (The latter two are always equivalent for any two variables $X, Λ$ and are somewhat stronger than we need here, but they’re both equivalent to the first once we’ve already asserted the other natural latent conditions.)