Epistemological Status: Fairly confident. This is much closer to the expected minimal map I had in mind ;)
Expectation for a random variable X given Y is the best point estimate of X given Y under squared error loss.
Say we have a random variable X, but are only allowed to summarize the outcomes with a single number eX. What should we pick so that the squared error is minimized?
mineXE[(X−eX)2]⟺∂eXE[(X−eX)2]=0⟺E[X−eX]=0⟺E[X]=eX
Thus, in this sense, the expectation of a random variable is the best point estimate of it’s outcomes.
A major reason a variational characterization is interesting is because it creates a tunnel to optimization theory. The expected minimal map presented here can ‘average out’ irrelevant details allowing you to filter to just the relevant things in a stream-lined manner.
When the random variables have binary outcomes we can use the conditional expectation to characterize probability without referencing information. It is effectively the same statement without reference to information. So there are at least two ways of giving a variational characterization of probability.
Lemma 1 (Optimal Prediction): Define a random variable χU∈L2. The best point representation of the outcome of χU for a given observation O is equivalent to E(χU|O). Moreover, the optimal point representation of χU is invariant under the pull-back η:O→E(χU|O) of the conditioning.
Corollary: When χU indicates a binary answer to a query we have a characterization of probability.
The second condition is a fancy way of saying that our optimal estimate is an optimal way to condition our expectation. It’s somewhat like cheating on a test. You can either read the question Q and then predict A via E[A|Q]or you could be told the answer and then predict A via E[A|E[A|Q]].
Suppose that instead of cheating we answer some other question that gives us a hint for the answer. Now we have something like,
Q→A1→A2
We have a question, we get a hint, we answer. From the above lemma, the best guess for A2 is E[A2|A1], but we don’t know A1. Our best guess for A1 is E[A1|Q] so the best we can do is,
E[A2|Q]≈E[A2|E[A1|Q]]
On a multi-part problem we might having something like,
(Q=A0)→A1→…→An−1→An
So we take the approximations as,
^Ak=E[Ak|^Ak−1]=argminek∈L2E[(Ak−ek(^Ak−1))2]
In practice, we’ll often want or need to restrict the class of functions we know how to optimize over. In such cases extending the Q/A process is the only way to guarantee that the final answer is always close to the optimum. At this point we can ‘forget’ about the intermediates and optimize,
^An=argmine1,…,en∈EE[(An−en∘…∘e2∘e1(Q))2]
If the function class is appropriate, this gives you the optimization problem associated with training a neural-network. Literally, but only approximately, each function averages out irrelevant features of it’s input to create more relevant features for prediction in it’s output.
Proof of Lemma 1: The proof works for non-binary random variables. However, the probabalistic interpretation is lost. Expectation for random variables in L2 is characterized by,
P(χU|O)=E[χU|O]=mineχU∈L2E[(χU−eχU(O))2]
The minimizer exists and is unique by the Hilbert projection theorem. Moreover,
E[χU|η(O)]=E[χU|E(χU|O)]⟺argmineχU∈L2E[(χU−eχU(O))2]=argmine1∈L2E[(χU−e1(E[χU|O]))2]
So all that’s left is to verify the pull-back doesn’t affect the outcome. We have,
argmine1∈L2E[(χU−e1(E[χU|O]))2]=argmine1∈L2E[(χU−e1(argmine2∈L2E[(χU−e2(O))2]))2]=argmine1∈L2E[(χU−e1∘e2(O))2]=argmine1,e2∈L2E[(χU−e1∘e2(O))2]=argmineχU∈L2E[(χU−eχU(O))2]
Therefore, the probability conditioned on the observation is the optimal point estimate and pulling-back the observation to just the point-estimate leaves the estimate unchanged. □
The Variational Characterization of Expectation
Epistemological Status: Fairly confident. This is much closer to the expected minimal map I had in mind ;)
Say we have a random variable X, but are only allowed to summarize the outcomes with a single number eX. What should we pick so that the squared error is minimized? mineXE[(X−eX)2]⟺∂eXE[(X−eX)2]=0⟺E[X−eX]=0⟺E[X]=eX Thus, in this sense, the expectation of a random variable is the best point estimate of it’s outcomes.
A major reason a variational characterization is interesting is because it creates a tunnel to optimization theory. The expected minimal map presented here can ‘average out’ irrelevant details allowing you to filter to just the relevant things in a stream-lined manner.
When the random variables have binary outcomes we can use the conditional expectation to characterize probability without referencing information. It is effectively the same statement without reference to information. So there are at least two ways of giving a variational characterization of probability.
Lemma 1 (Optimal Prediction): Define a random variable χU∈L2. The best point representation of the outcome of χU for a given observation O is equivalent to E(χU|O). Moreover, the optimal point representation of χU is invariant under the pull-back η:O→E(χU|O) of the conditioning.
Corollary: When χU indicates a binary answer to a query we have a characterization of probability.
The second condition is a fancy way of saying that our optimal estimate is an optimal way to condition our expectation. It’s somewhat like cheating on a test. You can either read the question Q and then predict A via E[A|Q] or you could be told the answer and then predict A via E[A|E[A|Q]].
Suppose that instead of cheating we answer some other question that gives us a hint for the answer. Now we have something like, Q→A1→A2 We have a question, we get a hint, we answer. From the above lemma, the best guess for A2 is E[A2|A1], but we don’t know A1. Our best guess for A1 is E[A1|Q] so the best we can do is, E[A2|Q]≈E[A2|E[A1|Q]] On a multi-part problem we might having something like, (Q=A0)→A1→…→An−1→An So we take the approximations as, ^Ak=E[Ak|^Ak−1]=argminek∈L2 E[(Ak−ek(^Ak−1))2] In practice, we’ll often want or need to restrict the class of functions we know how to optimize over. In such cases extending the Q/A process is the only way to guarantee that the final answer is always close to the optimum. At this point we can ‘forget’ about the intermediates and optimize, ^An=argmine1,…,en∈E E[(An−en∘…∘e2∘e1(Q))2] If the function class is appropriate, this gives you the optimization problem associated with training a neural-network. Literally, but only approximately, each function averages out irrelevant features of it’s input to create more relevant features for prediction in it’s output.
Proof of Lemma 1: The proof works for non-binary random variables. However, the probabalistic interpretation is lost. Expectation for random variables in L2 is characterized by, P(χU|O)=E[χU|O]=mineχU∈L2E[(χU−eχU(O))2] The minimizer exists and is unique by the Hilbert projection theorem. Moreover, E[χU|η(O)]=E[χU|E(χU|O)]⟺argmineχU∈L2E[(χU−eχU(O))2]=argmine1∈L2E[(χU−e1(E[χU|O]))2] So all that’s left is to verify the pull-back doesn’t affect the outcome. We have, argmine1∈L2E[(χU−e1(E[χU|O]))2]=argmine1∈L2E[(χU−e1(argmine2∈L2E[(χU−e2(O))2]))2]=argmine1∈L2E[(χU−e1∘e2(O))2]=argmine1,e2∈L2E[(χU−e1∘e2(O))2]=argmineχU∈L2E[(χU−eχU(O))2] Therefore, the probability conditioned on the observation is the optimal point estimate and pulling-back the observation to just the point-estimate leaves the estimate unchanged. □