Thank you for writing up! I’m still not sure I understand condensation. I would summarize as: instead of encoding the givens, we encode some latents which can be used to compute the set of possible answers to the givens (so we need a distribution over questions).
Also, the total cost of condensation has to be the at least the entropy of the answer distribution (generated by the probability distribution over questions, applied to the givens) because of Shannon’s bound.
I feel like if the optimal condensation setup is indeed 1 book per question, then it’s not a very good model of latent variables, no? But perhaps it’s going in the right direction.
I do think this is a good insight. Or like, it’s not new, SAEs do this; but it’s fresh way of looking at it that yields: perhaps SAEs are trying to impose a particular structure on the input too much, and instead we should just try to compress the latent stream. Perhaps using diffusion or similar techniques.