Ok I have a neat construction for z=1, https://www.lesswrong.com/posts/g9uMJkcWj8jQDjybb/ping-pong-computation-in-superposition that works pretty well ( with width and layers), and zero error. Note that is exact here, not asymptotic.
I’m trying to find a similar construction that scales like in the number of parameters required, without just scaling the number of layers up. I’d also be curious if it’s possible to avoid scaling parameters linearly with , but it seems quite difficult.
I’m confused by this. The KL term we are looking at in the deterministic case is
DKL(P[X,Λ]||P[Λ]P[X1|Λ]P[X2|Λ]), right?
For simplicity, we imagine we have finite discrete spaces. Then this would blow up if P[X=(x1,x2),Λ=λ]≠0, and P[Λ=λ]P[X1=x1|Λ=λ]P[X2=x2|Λ=λ]=0. But this is impossible, because any of the terms in the product being 0 imply that P[X=(x1,x2),Λ=λ] is 0.
Intuitively, we construct an optimal code for encoding the distribution P[Λ]P[X1|Λ]P[X2|Λ], and the KL divergence measures how many more bits on average we need to encode a message than optimal, if the true distribution is given by P[X,Λ]. Issues occur when but the true distribution P[X,Λ] takes on values which never occur according to P[Λ]P[X1|Λ]P[X2|Λ], i.e: the optimal code doesn’t account for those values potentially occurring.
Potentially there are subtleties when we have continuous spaces. In any case I’d be grateful if you’re able to elaborate.