Thank you for writing up! I’m still not sure I understand condensation. I would summarize as: instead of encoding the givens, we encode some latents which can be used to compute the set of possible answers to the givens (so we need a distribution over questions).
Also, the total cost of condensation has to be the at least the entropy of the answer distribution (generated by the probability distribution over questions, applied to the givens) because of Shannon’s bound.
I feel like if the optimal condensation setup is indeed 1 book per question, then it’s not a very good model of latent variables, no? But perhaps it’s going in the right direction.
Thank you for writing up! I’m still not sure I understand condensation. I would summarize as: instead of encoding the givens, we encode some latents which can be used to compute the set of possible answers to the givens (so we need a distribution over questions).
Also, the total cost of condensation has to be the at least the entropy of the answer distribution (generated by the probability distribution over questions, applied to the givens) because of Shannon’s bound.
I feel like if the optimal condensation setup is indeed 1 book per question, then it’s not a very good model of latent variables, no? But perhaps it’s going in the right direction.