I don’t think there is any kind of general result for this. It depends on what you’re trying to infer. For example, are you trying to find a scale parameter? Are you trying to find a shape parameter? I think the most popular approach is to find the Maximm Entropy distribution. Admittedly I don’t know a lot about the math behind this.
Math-wise, basically you pick whatever distribution maximizes integral from negative infinity to infinity of p(x)log(p(x)) with respect to x, multiplied by −1. So -Sp(x)log(p(x))dx.
The crux of this making sense is that that value can be interpreted as the amount of information you expect to learn from hearing that x happened. Or more straightforwardly, its how much you expect to not know about a particular variable/event. If you use log base 2, its measured in the average number of yes/no questions needed to concisely learn that it happened. For an explanation of why that’s true, thesearticles are excellent.
The reason that you want to maximize this value in the distribution is that not doing so assumes that you have information that you don’t know. Say you have 5 bits of entropy in the maximum entropy distribution, and 4 in some other one. If you choose the4 bit one then you’re basically making up information by thinking that you need one fewer yes/no question than you actually do.
I don’t think there is any kind of general result for this. It depends on what you’re trying to infer. For example, are you trying to find a scale parameter? Are you trying to find a shape parameter? I think the most popular approach is to find the Maximm Entropy distribution. Admittedly I don’t know a lot about the math behind this.
Math-wise, basically you pick whatever distribution maximizes integral from negative infinity to infinity of p(x)log(p(x)) with respect to x, multiplied by −1. So -Sp(x)log(p(x))dx.
The crux of this making sense is that that value can be interpreted as the amount of information you expect to learn from hearing that x happened. Or more straightforwardly, its how much you expect to not know about a particular variable/event. If you use log base 2, its measured in the average number of yes/no questions needed to concisely learn that it happened. For an explanation of why that’s true, these articles are excellent.
The reason that you want to maximize this value in the distribution is that not doing so assumes that you have information that you don’t know. Say you have 5 bits of entropy in the maximum entropy distribution, and 4 in some other one. If you choose the4 bit one then you’re basically making up information by thinking that you need one fewer yes/no question than you actually do.