David Johnston comments on $500 Bounty Problem: Are (Approximately) Deterministic Natural Latents All You Need?

David Johnston 23 Apr 2025 12:26 UTC
5 points
0
I’ve thought about it a bit, I have a line of attack for a proof, but there’s too much work involved in following it through to an actual proof so I’m going to leave it here in case it helps anyone.

I’m assuming everything is discrete so I can work with regular Shannon entropy.

Consider the range $R_{1}$ of the function $g_{1} : λ \mapsto P (X_{1} | Λ = λ)$ and $R_{2}$ defined similarly. Discretize $R_{1}$ and $R_{2}$ (chop them up into little balls). Not sure which metric to use, maybe TV.

Define $Λ_{1}^{'} (λ)$ to be the index of the ball into which $P (X_{1} | Λ = λ)$ falls, $Λ_{2}^{'}$ similar. So if $d (P (X_{1} | Λ = a), P (X_{1} | Λ = b))$ is sufficiently small, then $Λ_{1}^{'} (a) = Λ_{1}^{'} (b)$ .

By the data processing inequality, conditions 2 and 3 still hold for $Λ^{'} = (Λ_{1}^{'}, Λ_{2}^{'})$ . Condition 1 should hold with some extra slack depending on the coarseness of the discretization.

It takes a few steps, but I think you might be able to argue that, with high probability, for each $X_{2} = x_{2}$ , the random variable $Q_{1} := P (X_{1} | Λ_{1}^{'})$ will be highly concentrated (n.b. I’ve only worked it through fully in the exact case, and I think it can be translated to the approximate case but I haven’t checked). We then invoke the discretization to argue that $H (Λ_{1}^{'} | X_{1})$ is bounded. The intuition is that the discretization forces nearby probabilities to coincide, so if $Q_{1}$ is concentrated then it actually has to “collapse” most of its mass onto a few discrete values.

We can then make a similar argument switching the indices to get $H (Λ_{2}^{'} | X_{2})$ bounded. Finally, maybe applying conditions 2 and 3 we can get $H (Λ_{1}^{'} | X_{2})$ bounded as well, which then gives a bound on $H (Λ | X_{i})$ .

I did try feeding this to Gemini but it wasn’t able to produce a proof.
- J Bostock 23 Apr 2025 15:26 UTC
  7 points
  0
  Parent
  I’ve been working on the reverse direction: chopping up $P [Λ]$ by clustering the points (treating each distribution as a point in distribution space) given by $P [Λ | X = x]$ , optimizing for a deterministic-in- $X$ latent $Δ = Δ (X)$ which minimizes $D_{K L} (P [Λ | X] | | P [Λ | Δ (X)])$ .
  This definitely separates $X_{1}$ and $X_{2}$ to some small error, since we can just use $Δ$ to build a distribution over $Λ$ which should approximately separate $X_{1}$ and $X_{2}$ .
  To show that it’s deterministic in $X_{1}$ (and by symmetry $X_{2}$ ) to some small error, I was hoping to use the fact that—given $X_{1}$ $X_{2}$ —has very little information about $Λ$ , so it’s unlikely that $P [Λ | X_{1}]$ is in a different cluster to $P [Λ | X_{1}, X_{2}]$ . This means that $P [Δ | X_{1}]$ would just put most of the weight on the cluster containing $P [Λ | X_{1}]$ .
  A constructive approach for $Δ$ would be marginally more useful in the long-run, but it’s also probably easier to prove things about the optimal $Δ$ . It’s also probably easier to prove things about $Δ$ for a given number of clusters $| Δ |$ , but then you also have to prove things about what the optimal value of $| Δ |$ is.
  - johnswentworth 23 Apr 2025 15:59 UTC
    7 points
    0
    Parent
    Sounds like you’ve correctly understood the problem and are thinking along roughly the right lines. I expect a deterministic function of $X$ won’t work, though.
    Hand-wavily: the problem is that, if we take the latent to be a deterministic function $Δ (X)$ , then $P [X | Δ (X)]$ has lots of zeros in it—not approximate-zeros, but true zeros. That will tend to blow up the KL-divergences in the approximation conditions.
    I’d recommend looking for a function $Δ (Λ)$ . Unfortunately that does mean that low entropy of $Δ (Λ)$ given $X$ has to be proven.
    - Alex Gibson 23 Jul 2025 18:02 UTC
      9 points
      2
      Parent
      I’m confused by this. The KL term we are looking at in the deterministic case is
      $D_{KL} (P [X, Λ] | | P [Λ] P [X_{1} | Λ] P [X_{2} | Λ])$ , right?
      For simplicity, we imagine we have finite discrete spaces. Then this would blow up if $P [X = (x_{1}, x_{2}), Λ = λ] \neq 0$ , and $P [Λ = λ] P [X_{1} = x_{1} | Λ = λ] P [X_{2} = x_{2} | Λ = λ] = 0$ . But this is impossible, because any of the terms in the product being 0 imply that $P [X = (x_{1}, x_{2}), Λ = λ]$ is $0$ .
      Intuitively, we construct an optimal code for encoding the distribution $P [Λ] P [X_{1} | Λ] P [X_{2} | Λ]$ , and the KL divergence measures how many more bits on average we need to encode a message than optimal, if the true distribution is given by $P [X, Λ]$ . Issues occur when but the true distribution $P [X, Λ]$ takes on values which never occur according to $P [Λ] P [X_{1} | Λ] P [X_{2} | Λ]$ , i.e: the optimal code doesn’t account for those values potentially occurring.
      Potentially there are subtleties when we have continuous spaces. In any case I’d be grateful if you’re able to elaborate.
      - johnswentworth 23 Jul 2025 19:13 UTC
        9 points
        2
        Parent
        Yeah, I’ve since updated that deterministic functions are probably the right thing here after all, and I was indeed wrong in exactly the way you’re pointing out.
    - J Bostock 23 Apr 2025 22:56 UTC
      4 points
      1
      Parent
      Huh, I had vaguely considered that but I expected any $P [X | Δ (X)] = 0$ terms to be counterbalanced by $P [X, Δ (X)] = 0$ terms, which together contribute nothing to the KL-divergence. I’ll check my intuitions though.
      I’m honestly pretty stumped at the moment. The simplest test case I’ve been using is for $X_{1}$ and $X_{2}$ to be two flips of a biased coin, where the bias is known to be either $k$ or $1 - k$ with equal probability of either. As $k$ varies, we want to swap from $Δ ≅ Λ$ to the trivial case $| Δ | = 1$ and back. This (optimally) happens at around $k = 0.08$ and $k = 0.92$ . If we swap there, then the sum of errors for the three diagrams of $Δ$ does remain less than $2 (ϵ + ϵ + ϵ)$ at all times.
      Likewise, if we do try to define $Δ (X)$ , we need to swap from a $Δ$ which is equal to the number of heads, to $| Δ | = 1$ , and back.
      In neither case can I find a construction of $Δ (X)$ or $Δ (Λ)$ which swaps from one phase to the other at the right time! My final thought is for $Δ$ to be some mapping $Λ \to P (Λ)$ consisting of a ball in probability space of variable radius (no idea how to calculate the radius) which would take $k \to {k}$ at $k \approx 1$ and $k \to {k, 1 - k}$ at $k \approx 0.5$ . Or maybe you have to map $Λ \to P (X)$ or something like that. But for now I don’t even have a construction I can try to prove things for.
      Perhaps a constructive approach isn’t feasible, which probably means I don’t have quite the right skillset to do this.
- J Bostock 2 May 2025 21:01 UTC
  4 points
  0
  Parent
  OK so some further thoughts on this: suppose we instead just partition the values of $Λ$ directly by something like a clustering algorithm, based on $D_{K L}$ in $P [X | Λ]$ space, and take $Δ (Λ)$ just be the cluster that $λ$ is in:
  Assuming we can do it with small clusters, we know that $P [X | Λ] \approx P [X | Δ]$ is pretty small, so $D_{K L} (P [X] | | P [X | Δ])$ is also small.
  And if we consider $X_{2} \leftarrow X_{1} \to Λ$ , this tells us that learning $X_{1}$ restricts us to a pretty small region of $P [X_{2}]$ space (since $P [X_{2} | X_{1}] \approx P [X_{2} | X_{1}, Λ]$ ) so $Δ$ should be approximately deterministic in $X_{1}$ . This second part is more difficult to formalize, though.
  Edit: The real issue is whether or not we could have lots of $Λ$ values which produce the same distribution over $X_{2}$ but different distributions over $X_{1}$ , and all be pretty likely given $X_{1} = x_{1}$ for some $x_{1}$ . I think this just can’t really happen for probable values of $x_{1}$ , because if these values of $λ$ produce the same distribution over $X_{2}$ , but different distributions over $X_{1}$ , then that doesn’t satisfy $X_{1} \leftarrow X_{2} \to Λ$ , and secondly because if they produced wildly different distributions over $X_{1}$ , then that means they can’t all have high values of $P [X_{1} = x_{1} | Λ = λ]$ , and so they’re not gonna have high values of $P [Λ = λ | X_{1} = x_{1}]$ .