The Variational Characterization of Expectation
Epistemological Status: Fairly confident. This is much closer to the expected minimal map I had in mind ;)
Expectation for a random variable given is the best point estimate of given under squared error loss.
Say we have a random variable , but are only allowed to summarize the outcomes with a single number . What should we pick so that the squared error is minimized? Thus, in this sense, the expectation of a random variable is the best point estimate of it’s outcomes.
A major reason a variational characterization is interesting is because it creates a tunnel to optimization theory. The expected minimal map presented here can ‘average out’ irrelevant details allowing you to filter to just the relevant things in a stream-lined manner.
When the random variables have binary outcomes we can use the conditional expectation to characterize probability without referencing information. It is effectively the same statement without reference to information. So there are at least two ways of giving a variational characterization of probability.
Lemma 1 (Optimal Prediction): Define a random variable . The best point representation of the outcome of for a given observation is equivalent to . Moreover, the optimal point representation of is invariant under the pull-back of the conditioning.
Corollary: When indicates a binary answer to a query we have a characterization of probability.
The second condition is a fancy way of saying that our optimal estimate is an optimal way to condition our expectation. It’s somewhat like cheating on a test. You can either read the question and then predict via or you could be told the answer and then predict via .
Suppose that instead of cheating we answer some other question that gives us a hint for the answer. Now we have something like, We have a question, we get a hint, we answer. From the above lemma, the best guess for is , but we don’t know . Our best guess for is so the best we can do is, On a multi-part problem we might having something like, So we take the approximations as, In practice, we’ll often want or need to restrict the class of functions we know how to optimize over. In such cases extending the Q/A process is the only way to guarantee that the final answer is always close to the optimum. At this point we can ‘forget’ about the intermediates and optimize, If the function class is appropriate, this gives you the optimization problem associated with training a neural-network. Literally, but only approximately, each function averages out irrelevant features of it’s input to create more relevant features for prediction in it’s output.
Proof of Lemma 1: The proof works for non-binary random variables. However, the probabalistic interpretation is lost. Expectation for random variables in is characterized by, The minimizer exists and is unique by the Hilbert projection theorem. Moreover, So all that’s left is to verify the pull-back doesn’t affect the outcome. We have, Therefore, the probability conditioned on the observation is the optimal point estimate and pulling-back the observation to just the point-estimate leaves the estimate unchanged.