“it depends on what the distributions are, but there is another simple stat you can computer from the , which combined with their average, gives you all the info you need”
Yes, assuming it’s a maximum entropy distribution (e.g. normal, dirichlet, beta, exponential, geometric, hypergeometric, … basically all the distributions we typically use as fundamental building blocks). If it’s not a maximum entropy distribution, then the relevant information can’t be summarized by a simple statistic; we need to keep around the whole distribution P[X=x | M] for every possible value of x. In the maxent case, the summary statistics are sufficient to compute that distribution, which is why we don’t need to keep around anything else.
Yes definitely. I’ve omitted examples from software and math because there’s no “fuzziness” to it; that kind of abstraction is already better-understood than the more probabilistically-flavored use-cases I’m aiming for. But the theory should still apply to those cases, as the limiting case where probabilities are 0 or 1, so they’re useful as a sanity check.