Singular Learning Theory Comprehensive − 1
Introduction
There are some very nice resources to understand the intuition of Singular Learning Theory. However, I am quite unsatisfied with the current resources online explaining or approaching the subject, as I find them quite concise and brief—skipping many concepts that actually serve to strengthen the intuition to do research in this field, thus being confusing to me. While they are very nice to understand the subject overall, it is equally important for a resource to be there which aims to explain the field in detail. This is an attempt to change that, and I have tried to keep this sequence as comprehensive as possible. The material is directly adapted from the Watanabe Texts and Suzuki’s WAIC and WBIC with python book, and solutions to some exercises as well as examples are given. I am giving out these explanations as I understand this subject, so all feedback is appreciated. We start with and do a good deal of the work with classical Bayesian framework first.
Guide: Please refer to this notebook for examples with code, some exercises and their solutions as well.
Introduction To Bayesian Statistics
We start with Bayesian Statistics. Watanabe’s theory is fundamentally based on generalizing classical results in Bayesian Statistics, so it is important to get a strong grip and understand this classical theory well before moving on. It also gives us the complete understanding of the framework we are working in, and is the first essential thing to master.
Connection with Machine Learning and Setup
Machine Learning Models are primarily consisting of two frameworks (or a combination of them): Frequentist and Bayesian.
The setup is that we have a true data generating distribution
The likelihood function of our statistical model is defined as
The frequentist approach is to find the optimal
The KL divergence from probability distribution
This is the main measure that we will use to associate similarity between probability distributions (even though it is not really a metric, it is clear that it is not even symmetric).
It can be easily seen that finding the optimal
We will not delve into the frequentist approach more here (you may refer to Goodfellow et al). We will move on to the Bayesian approach here. Thus, when we refer to neural networks here, an important distinction is that now this is not the standard neural networks where SGD is used. Still, we gain many insights from this approach that also carry to the standard networks.
In the Bayesian approach, instead of considering just the optimal parameter, we consider a probability distribution over the space of parameters itself. Initially, this is called the prior function, and as we observe the data from the true distribution, we update this prior function to successively obtain a posterior function, which is an estimate over the entire parameter space to what generates the true distribution function.
Specifically, we consider an appropriate prior function
Construct the universe and the mathematical laws between bayesian observables which hold for any arbitrary: true distribution, statistical model, and a prior.
Evaluate how appropriate the statistical model and the prior is using these laws.
Employ the most suitable pair.
Introduction to Bayesian Statistics
The posterior function is obtained through Bayes’ rule.
ut neither do we know the statistical model, nor do we know the prior. Thus a meaningful approach is to just start with something, evaluate how good it is, and then update it. The evaluation is done through the mathematical laws described above.
This gives rise to the estimated pdf of x, called the predictive distribution:
Expected:
True Distribution
A realized value of
Let us just revise the basics first as they will be important in the calculations that we make.
Let
Do observe that we are able to take the product here because of the independent sampling.
The average entropy of the true distribution is defined as:
The empirical entropy is defined as:
By definition, one can see that
Similarly, one can see that the variance of the empirical entropy is:
The average and empirical entropies of the true distribution which is a conditional distribution is defined similarly:
Model, Prior and Posterior
Let
where
which is called the partition function/marginal likelihood/evidence.
Expected value over the posterior distribution is denoted
This expected value is a random variable as it depends on
The posterior gives rise to the predictive density function:
(estimate w from
If
An Important Example—The Exponential Family
In many simple statistical models, the posterior converges to the normal distribution as
At this point, I highly recommend referring to the example (given in the notebook link at the end).
We are now going to prove the formulae given in the example.
If the statistical model is of the form
where u is a real valued function (and the other two are vector valued), then this distribution is said to belong to the exponential family. Furthermore, if the distribution of the parameter θ∈Θ depends on some hyper-parameter ϕ, and can be written as
where z(ϕ) is the normalizing factor, then
In the case when the distribution is of the form
we can take
Now, as we know,
Let us denote
Let us get the
Hence,
which is also from the exponential family!
Finally, the predictive probability is given by
One may notice that we are using a different formula for the predictive density, bypassing the integral definition. This comes directly from using the bayes rule in the given definition (check it yourself), and it is computationally more useful in some cases to use this instead.
For the example given at the start of the section, it is just a matter of inputting numbers into the formulae.
Estimation and Generalization
We need an objective measure which indicates the difference between true and estimated probability density to evaluate how accurate the predictive density is.
Let
Notice how both of these quantities are random variables.
Thus
Thus
An observation: As entropy does not depend on either a model / prior, smaller generalization error is equivalent to lower KL divergence.
Definition: Assume
Cross validation loss is defined by
We now prove an important theorem, which has three statements regarding the definitions that we made.
Theorem: Assume that
is independent. Then the following holds.
Assume that
are finite values. Then
The cross validation loss satisfies the following:
For an arbitrary set of
, , with equality iff is a const function of on .
Note:
(1) Here is the proof of the first statement.
While it was not mentioned what the expectation is being taken over in the statement, the proof clarifies it. In any case, the answer to the clarification is the canonical and the most standard answer.
(2) We now prove the second statement:
Thus,
Call the integrand in the denominator
We introduced cross validation as a measure to evaluate the accuracy of our estimation. However, there are two issues with cross validation:
1) Although the averages of
2) In the second statement, if the average by the posterior is numerically approximated, then
is called the importance sampling cross validation loss. Importance sampling is the method of calculating an expectation more easily by writing it as a more manageable distribution.
is fundamentally different from the former.
3) Let us prove the third statement now.
By Cauchy Schwarz. Equality holds iff
We introduce another measure now, and it is often better than the cross validation loss. There are also many cases where WAIC can be employed whereas cross validation cannot.
Definition: Let
Here is a result: If
Remark:
Just to summarize, we have introduced three instruments of measure:
Generalization Error:
Cross Validation Error:
WAIC Error:
In numerical experiments, we often care about minimizing errors instead of the loss itself due to the lower variance.
Marginal likelihood or Partition Function
If a prior satisfies
We have slyly used Fubini Theorem above. Thus
Definition: The Free energy, or the minus log marginal likelihood is defined by
We look at this quantity also as an estimate of
Using the notation
Thus,
Then,
Smaller
We now prove yet another important theorem.
Theorem: Let
. The average generalization loss is equal to the increase in free energy. Thus
.
Proof: For an arbitrary function,
Now,
Thus,
Remark: As
We can illustrate the failure for the former: Let the marginal likelihood ratio for
Meaning of Marginal Likelihood
Assume that
By Bayes’ Theorem,
Thus if n is sufficiently large, maximizing
Conditional independent cases
We will make the definitions for the conditionally independent case. They are quite similar.
Let us assume
For an arbitrary function
which is a function of
Everything else is defined similarly. But in this case,
For Example:
Regression problem
for a fixed set are studied. Cross validation cannot be employed.Consider the time series expressed by the relation
This can be understood as a regression problem
Thus
Exercises
I now refer you to the first set of exercises given in the notebook.
Further Steps
In the next post, we will introduce the concepts of realizability and regularity. We will discuss the main theorems of the regular statistical models. We will discuss MCMC methods that are a key tool for calculations, and we may do some other things as well.
This is great, thanks for writing it!
As you learn, you may also be interested in the section on singular learning theory in our Iliad Intensive course.