J Bostock comments on $500 Bounty Problem: Are (Approximately) Deterministic Natural Latents All You Need?

J Bostock 23 Apr 2025 15:26 UTC
7 points
0
I’ve been working on the reverse direction: chopping up $P [Λ]$ by clustering the points (treating each distribution as a point in distribution space) given by $P [Λ | X = x]$ , optimizing for a deterministic-in- $X$ latent $Δ = Δ (X)$ which minimizes $D_{K L} (P [Λ | X] | | P [Λ | Δ (X)])$ .
This definitely separates $X_{1}$ and $X_{2}$ to some small error, since we can just use $Δ$ to build a distribution over $Λ$ which should approximately separate $X_{1}$ and $X_{2}$ .
To show that it’s deterministic in $X_{1}$ (and by symmetry $X_{2}$ ) to some small error, I was hoping to use the fact that—given $X_{1}$ $X_{2}$ —has very little information about $Λ$ , so it’s unlikely that $P [Λ | X_{1}]$ is in a different cluster to $P [Λ | X_{1}, X_{2}]$ . This means that $P [Δ | X_{1}]$ would just put most of the weight on the cluster containing $P [Λ | X_{1}]$ .
A constructive approach for $Δ$ would be marginally more useful in the long-run, but it’s also probably easier to prove things about the optimal $Δ$ . It’s also probably easier to prove things about $Δ$ for a given number of clusters $| Δ |$ , but then you also have to prove things about what the optimal value of $| Δ |$ is.
- johnswentworth 23 Apr 2025 15:59 UTC
  7 points
  0
  Parent
  Sounds like you’ve correctly understood the problem and are thinking along roughly the right lines. I expect a deterministic function of $X$ won’t work, though.
  Hand-wavily: the problem is that, if we take the latent to be a deterministic function $Δ (X)$ , then $P [X | Δ (X)]$ has lots of zeros in it—not approximate-zeros, but true zeros. That will tend to blow up the KL-divergences in the approximation conditions.
  I’d recommend looking for a function $Δ (Λ)$ . Unfortunately that does mean that low entropy of $Δ (Λ)$ given $X$ has to be proven.
  - Alex Gibson 23 Jul 2025 18:02 UTC
    9 points
    2
    Parent
    I’m confused by this. The KL term we are looking at in the deterministic case is
    $D_{KL} (P [X, Λ] | | P [Λ] P [X_{1} | Λ] P [X_{2} | Λ])$ , right?
    For simplicity, we imagine we have finite discrete spaces. Then this would blow up if $P [X = (x_{1}, x_{2}), Λ = λ] \neq 0$ , and $P [Λ = λ] P [X_{1} = x_{1} | Λ = λ] P [X_{2} = x_{2} | Λ = λ] = 0$ . But this is impossible, because any of the terms in the product being 0 imply that $P [X = (x_{1}, x_{2}), Λ = λ]$ is $0$ .
    Intuitively, we construct an optimal code for encoding the distribution $P [Λ] P [X_{1} | Λ] P [X_{2} | Λ]$ , and the KL divergence measures how many more bits on average we need to encode a message than optimal, if the true distribution is given by $P [X, Λ]$ . Issues occur when but the true distribution $P [X, Λ]$ takes on values which never occur according to $P [Λ] P [X_{1} | Λ] P [X_{2} | Λ]$ , i.e: the optimal code doesn’t account for those values potentially occurring.
    Potentially there are subtleties when we have continuous spaces. In any case I’d be grateful if you’re able to elaborate.
    - johnswentworth 23 Jul 2025 19:13 UTC
      9 points
      2
      Parent
      Yeah, I’ve since updated that deterministic functions are probably the right thing here after all, and I was indeed wrong in exactly the way you’re pointing out.
  - J Bostock 23 Apr 2025 22:56 UTC
    4 points
    1
    Parent
    Huh, I had vaguely considered that but I expected any $P [X | Δ (X)] = 0$ terms to be counterbalanced by $P [X, Δ (X)] = 0$ terms, which together contribute nothing to the KL-divergence. I’ll check my intuitions though.
    I’m honestly pretty stumped at the moment. The simplest test case I’ve been using is for $X_{1}$ and $X_{2}$ to be two flips of a biased coin, where the bias is known to be either $k$ or $1 - k$ with equal probability of either. As $k$ varies, we want to swap from $Δ ≅ Λ$ to the trivial case $| Δ | = 1$ and back. This (optimally) happens at around $k = 0.08$ and $k = 0.92$ . If we swap there, then the sum of errors for the three diagrams of $Δ$ does remain less than $2 (ϵ + ϵ + ϵ)$ at all times.
    Likewise, if we do try to define $Δ (X)$ , we need to swap from a $Δ$ which is equal to the number of heads, to $| Δ | = 1$ , and back.
    In neither case can I find a construction of $Δ (X)$ or $Δ (Λ)$ which swaps from one phase to the other at the right time! My final thought is for $Δ$ to be some mapping $Λ \to P (Λ)$ consisting of a ball in probability space of variable radius (no idea how to calculate the radius) which would take $k \to {k}$ at $k \approx 1$ and $k \to {k, 1 - k}$ at $k \approx 0.5$ . Or maybe you have to map $Λ \to P (X)$ or something like that. But for now I don’t even have a construction I can try to prove things for.
    Perhaps a constructive approach isn’t feasible, which probably means I don’t have quite the right skillset to do this.