I have a pet peeve around this, which is hopefully a useful comment for someone to read; KL-divergence should not be symmetric, because of the whole thing that it is. If you’re using KL-divergence and thinking to yourself “I wish this was symmetric”, then that should be a red flag that you’re using the wrong tool!
I think it’s easy for people to think, “hm, I’d like a way to quantify the how different two probability distributions are from each other” and then they grab the nearest hammer, which happens to be KL-divergence. But mathematical definitions are not for things, instead they mean things.
You should use KL-divergence when you want to measure the cost of modelling a true distribution using a false distribution. The asymmetry comes from the fact that one of them is the true one (and therefore the one that you take the expected value with respect to).
Is there an interpretation of KL divergence which works for subjective probability (credence functions) where there is no concept of “true” or “false” distribution? And even for an objective interpretation, the term “cost” seems to be external to probability theory.
The KL-divergence seems pretty principled here—the idea behind natural abstractions is that there are convergent abstractions that minds will hit upon, as backed up by some of their theorems. They deal with diagrams that are approximately satisfied, where you’ve learned the diagram (e.g. defined some latent variable that you put into your models). You would expect minds to prefer modelling the environment in ways that have low KL-divergence, because that’s the thing that tells you how much extra cost you are paying for using the wrong distribution to compress and predict.
I have a pet peeve around this, which is hopefully a useful comment for someone to read; KL-divergence should not be symmetric, because of the whole thing that it is. If you’re using KL-divergence and thinking to yourself “I wish this was symmetric”, then that should be a red flag that you’re using the wrong tool!
I think it’s easy for people to think, “hm, I’d like a way to quantify the how different two probability distributions are from each other” and then they grab the nearest hammer, which happens to be KL-divergence. But mathematical definitions are not for things, instead they mean things.
You should use KL-divergence when you want to measure the cost of modelling a true distribution using a false distribution. The asymmetry comes from the fact that one of them is the true one (and therefore the one that you take the expected value with respect to).
Is there an interpretation of KL divergence which works for subjective probability (credence functions) where there is no concept of “true” or “false” distribution? And even for an objective interpretation, the term “cost” seems to be external to probability theory.
(I have no idea what y’all are using KL-divergence for, so I have no opinion about whether you should have been using it in this theorem.)
The KL-divergence seems pretty principled here—the idea behind natural abstractions is that there are convergent abstractions that minds will hit upon, as backed up by some of their theorems. They deal with diagrams that are approximately satisfied, where you’ve learned the diagram (e.g. defined some latent variable that you put into your models). You would expect minds to prefer modelling the environment in ways that have low KL-divergence, because that’s the thing that tells you how much extra cost you are paying for using the wrong distribution to compress and predict.