VC Theory Overview

In this post, I will give a brief overview of VC theory and computational learning theory. The reason for this is that I will later write a post about singular learning theory, in which I will refer back to this post. However, this post is self-contained, and may be interesting for anyone interested in theoretical machine learning.

What is Computational Learning Theory?

In short, computational learning theory (CLT) is what happens when you combine complexity theory with machine learning. Just like complexity theory is concerned with understanding what kinds of problems are hard to compute, or easy to compute, CLT is concerned with understanding what kinds of things are easy to learn, or hard to learn. This means that CLT seeks to identify classes of problems that you can learn with a small amount of data, and classes of problems that no algorithm can learn unless it gets a large amount of data. This also means that CLT produces generalisation bounds for different kinds of learning algorithms, and other such results.

Probably Approximately Correct Learning

In order to study what kinds of problems are hard to learn, and easy to learn, we first need a mathematical formalism that captures what it means to “learn a problem”. The most popular model in CLT is given by the probably approximately correct (PAC) learning framework. This is a model of supervised learning for classification. It uses the following components:

We have an instance space $X$ , which contains all the data that we might observe. For example, $X$ may be a vector space, or it may be the set of all bit strings, etc. If we are doing image classification, then $X$ may be the set of all RGB images of a certain size, and so on.

We have a set of concepts $C$ , where every element of $C$ is a function $c : X \to {0, 1}$ . This is the set of functions that we might be trying to learn. For example, if $X$ is a vector space, then $C$ could be the set of all linear classifiers on $X$ . Alternatively, if $X$ is ${0, 1}^{n}$ , then $C$ could be the set of all Boolean functions over $n$ variables with circuit complexity $\leq m$ . And so on.

We assume that we have a (potentially unknown) distribution $D$ over $X$ , and that there is an unknown function $c : X \to {0, 1}$ that we are trying to learn. To do this, our learning algorithm will observe a number of data points $(x, c (x))$ , where each $x$ is sampled iid from $D$ . Based on this dataset, it will guess a function $h : X \to {0, 1}$ . We want this function to be similar to $c$ with high probability.

Specifically, given a distribution $D$ over $X$ , a concept $c \in C$ , and a hypothesis $h : X \to {0, 1}$ , we say that the error of $h$ is $P (h (x) \neq c (x))$ , where $x$ is sampled from $D$ .

A learning algorithm $L$ is an algorithm that takes as input a dataset ${(x, c (x))}$ of points, and returns a function $h : X \to {0, 1}$ .

We say that $L$ is a PAC-learning algorithm for $C$ if there exists a polynomial $p$ , such that for all $ϵ$ and $δ$ , and all distributions $D$ over $X$ and all concepts $c \in C$ , if $L$ is given access to at least $p (1 / ϵ, 1 / δ)$ samples from $D$ , together with their labels according to $c$ , then $L$ will with probability at least $1 - δ$ learn a function whose error is at most $ϵ$ . Moreover, if there exists a PAC-learning algorithm for $C$ , then we say that $C$ is PAC-learnable.

In other words, PAC-learning is a mathematical formalism that describes supervised learning of (binary) classification algorithms. $C$ is a set of classification functions, that we may be trying to learn. We want to know if there exists a supervised learning algorithm $L$ , such that if $L$ is given access to a small (polynomial) amount of training data, then it will with high probability learn a function with low error. We want this bound to hold for all $c \in C$ , and all $D$ over $X$ , so that we can be sure that this bound holds even if we have no prior knowledge of $c$ or $D$ .

Note that, while this is required to hold for all $D$ , the error of $h$ is defined relative to the same distribution which the training data is drawn from. In other words, we assume that there is no distributional shift. This means that, while $D$ is unknown, it is the case that if a particular data point has a low probability under the training distribution, then it must also have a low probability under the test distribution, and vice versa.

Note also that we are not concerned with the question of whether or not $h = c$ . As far as this framework is concerned, it could be that $L$ never proposes the right hypothesis, even in the limit of infinite data. As long as $h$ has a low error, this is enough.

Ways To Generalise the PAC Formalism

There are some parts of the PAC-learning framework that seem somewhat strong, so before moving on, let me briefly make some comments on how this framework could be generalised.

First of all, we require that for all $ϵ$ and for all $δ$ , if $L$ is given access to at least $p (1 / ϵ, 1 / δ)$ samples from $D$ , together with their labels according to $c$ , then $L$ will with probability at least $1 - δ$ learn a function whose error is at most $ϵ$ . What if we only require that this holds for some $δ$ (smaller than 0.5) and some $ϵ$ (smaller than 0.5)?

This generalisation makes no difference. First of all, if we have an algorithm $L$ that with probability at least (say) 0.6 learns a function with error at most $ϵ$ , then we can get an algorithm that with probability at least 1- $δ$ learn a function whose error is at most $ϵ$ , by simply running $L$ multiple times and creating an ensemble of the functions that it learns. Similarly, if we have an algorithm that only is a PAC-learning algorithm for some $ϵ$ , then this can similarly be used to construct a full PAC-learning algorithm. However, this is a bit more difficult to show, because simply running the algorithm multiple times is not guaranteed to work. The proof instead relies on creating multiple artificial data distributions, and then training the algorithm on these different distributions (for an overview, see the theory behind AdaBoost and similar algorithms).

Another thing is that this formalism requires $L$ to work for any distribution $D$ over $X$ . Are there cases where there is a PAC-learning algorithm for some $D$ , but not for all $D$ ? Here, the answer is actually yes. Trivially, we may assume that $D$ only gives support to a single element of $X$ ; in that case, a single data point is always enough to learn a hypothesis $h$ with zero error. Less trivially, there are concept classes which are PAC-learnable when $D$ is uniform, but not in general. You could therefore create a version of the PAC-learning framework where you only require that everything works for one particular distribution. However, this generalisation ends up not being very insightful.

A third thing is that the PAC formalism assumes that we are doing binary classification. This can be straightforwardly extended to multi-label classification, since classification with $n$ labels is equivalent to learning $n$ binary classifiers. Extending the formalism to deal with regression is also possible, but a bit more involved.

Finally, the PAC formalism assumes that the true labeling function c is contained in C, and it assumes that there is no label noise. You can lift these assumptions, without changing the big picture very much. I will briefly discuss this at the end of this post.

Other Formalisms in CLT

Before we move on, let me briefly mention some of the other formalisms that are studied in CLT. While the PAC-learning framework probably is the most common one, there are also other frameworks. One option is to assume that there is label noise, which gets especially interesting (mathematically speaking) if the label noise is adversarial. Another option is to assume that the learning algorithm isn’t just doing supervised learning, but that it also is able to do active learning. In this setting, we assume that $L$ has access to a subroutine that it can query at unit cost, and which for an input $x$ returns $c (x)$ . Interestingly, there are concept classes which are efficiently learnable with active learning, but not with supervised learning. You can also assume that the data isn’t drawn i.i.d., but that it is instead correlated in some way. For example, it could come from a time series, or it could be selected intelligently (either helpfully or adversarially). For an overview of these things, see some standard textbook.

Vapnik-Chernovenkis Dimension

You may be surprised to learn that we have necessary and sufficient conditions that characterise when a concept class $C$ is PAC-learnable. These results rely on the Vapnik-Chernovenkis (VC) dimension of $C$ , which is a measure of the expressivity of $C$ . It is defined like this:

Given a finite set $S$ of points in $X$ , we say that $C$ “shatters” $S$ if, for every subset $S^{'}$ of $S$ , there is a function $c \in C$ such that, for $x \in S$ , we have that $c (x) = 1$ iff $x \in S^{'}$ . In other words, C is able to express every Boolean function over S. We then say that the VC dimension of $C$ is the size of the largest set that it can shatter. If it can shatter some arbitrarily large sets, then we say that the VC dimension of $C$ is infinite.

For example, if $C$ is the set of all linear classifiers in an n-dimensional vector space, then the VC dimension of $C$ is $n + 1$ . Alternatively, if $C$ is the set of all Boolean functions over n variables, then the VC dimension of $C$ is $2^{n}$ . If C is the set of all computable functions on ${0, 1}^{⋆}$ , then the VC dimension of $C$ is infinite. If C is the set of all classifiers for convex sets in $R^{n}$ , then the VC dimension of $C$ is infinite.

If $C$ has a finite cardinality, then the VC-dimension of $C$ is at most $l o g_{2} | C |$ .

Putting it Together

We say that $L$ is a consistent learner for $C$ if, for every dataset that is consistent with some $c \in C$ , $L$ will output a hypothesis $h \in C$ such that $h$ fits all training data.

We then have that $C$ is PAC-learnable if and only if $C$ has a finite VC dimension. Moreover, in that case, any consistent learner is a PAC-learning algorithm for $C$ .

If $L$ is a consistent learner for $C$ , where $C$ has VC-dimension $d$ , and $L$ is given access to at least $k ((1 / ϵ) l o g (1 / δ) + (d / ϵ) l o g (1 / ϵ))$ data points, then for any $c \in C$ and any $D$ over X, L will with probability at least $1 - δ$ learn a function with error at most $ϵ$ . Here k is a universal constant.

Moreover, for any learning algorithm L for C, if C has VC-dimension d, then we can find a $c \in C$ and a $D$ over $X$ such that if $L$ is given access to less than $(d - 1) / (32 * ϵ)$ data points, then the probability that it will learn a function $h$ with error greater than $ϵ$ , is at least $δ$ . We can also find a $c \in C$ and a D over X for which we need at least $(1 / 4 ϵ) l o g (1 / 4 δ)$ data points.

Thus, the VC-dimension of C gives us both an upper and a lower bound on how much data we need in order to probably learn an approximately correct function. Moreover, these bounds are nearly equal, and are attained by any consistent learning algorithm.

What About Computational Cost?

The version of PAC-learning that I have given above is only concerned with how much training data L needs, and does not consider the computational cost of L. This raises the question; are there concept classes C where we only need a polynomial amount of data to learn c, but where the cost of fitting a hypothesis to the training data is exponential (or worse)? The answer to this question is yes, given certain widely held assumptions about complexity theory and cryptography. Intuitively, there are cases where C can have a finite VC-dimension, but where fitting a hypothesis to the training data is NP-complete (for example). In this setting, I don’t think there are neat necessary and sufficient conditions yet. I won’t talk more about this here, but if you are interested, it may be worthwhile to read a textbook on CLT.

What About Neural Networks?

The VC dimension of a neural network depends on its architecture, parameter count, and activation function. However, unless you pick a very strange activation function, you will find that the VC dimension indeed is finite. Moreover, a neural network is of course also a consistent learner (if you train for long enough). This means that you can apply all the math above, and get both an upper and a lower bound on the amount of data that you need to ensure that a neural network will generalise well.

If you do this, you will find that the bound is very, very large—much larger than the amount of data that in practice is enough to get good generalisation. This is because neural networks have a very large VC dimension, so the above bounds say that a lot of data should be required. Moreover, the VC dimension grows as you add more parameters, so the more parameters you have, the more data is required for the above bounds to apply (whereas in practice, adding more parameters typically improves performance, rather than decrease it).

Some people take this to mean that VC theory somehow doesn’t apply to neural networks. This is not correct; the above results are mathematical theorems that apply to all learning machines, including neural networks. However, it is important to note that these bounds are worst-case bounds—they say that there exists a function and a data distribution for which neural networks require a large amount of data. And this is true; you can artificially construct a labeling function that a neural network can express, but which it takes a lot of data for it to learn.

From this, we can immediately see where the PAC-learning formalism must be relaxed, in order to provide bounds that are more informative in practice. The PAC-learning framework demands that L probably will learn an approximately correct hypothesis for all c in C. In reality, there are many functions that a neural network in principle could express, but which will never come up in reality. The right way to augment the formalism would therefore be to assume that we have some prior probability distribution P over C, and that c is sampled from P. We then want L to probably learn an approximately correct hypothesis, where the probability is given relative to both P and D, or something along those lines. We could perhaps also relax the data requirement, to allow p to be polynomial in $1 / P (c)$ , in addition to $1 / δ$ and $1 / ϵ$ .

(Note that this is also why we get the result that any consistent learner attains both the upper and the lower bound; if c and D can be selected arbitrarily, including adversarially, then the inductive bias of the learning algorithm doesn’t matter.)

Note that the VC theory framework still applies somewhat to this setting; we could assume that c with high probability is drawn from some subset C’ of C, where C’ has a much lower VC-dimension than C, and that L will attempt to find a hypothesis in C’ that fits the training data. However, spelling all of this out in detail is still somewhat involved.

What Does This Have To Do With Alignment?

Understanding generalisation and inductive bias is important for alignment. VC theory does not provide that (at least not when it comes to neural networks), but it is an important part of the story of theoretical ML as it stands right now. I wrote this because I think VC theory is interesting, and because I will want to reference this post in an upcoming post on singular learning theory.

Addendum: Agnostic PAC Learning

The definition of PAC-learning makes two assumptions that are somewhat unsatisfying; first of all, it assumes that the true label function $c$ is contained in $C$ , and secondly, it assumes that there is no label noise in the training data (or test data). Both of these assumptions can be lifted, without changing the overall picture very much.

To do this, we will allow the data to be sampled from an arbitrary distribution $D$ over $X \times {0, 1}$ (so that each data point is a tuple, where the first element is interpreted as the input, and the second element as the label). This covers the case where there is no function $c \in C$ which fits the data distribution perfectly. It also allows for label noise (and the noise isn’t even required to be e.g. i.i.d.). Given such a distribution $D$ , and a function $h : X \to {0, 1}$ , we say that the error of $h$ is $e r r_{D} (h) = P_{X, Y \sim D} (h (X) \neq Y)$ .

Now, we say that $L$ is an “agnostic” PAC-learning algorithm with hypothesis class $C$ if there exists a polynomial function $p$ such that for every $ϵ$ , every $δ$ , and every distribution $D$ over $X \times {0, 1}$ , if $L$ is given $p (1 / ϵ, 1 / δ)$ i.i.d. samples from $D$ , then with probability at least $1 - δ$ , $L$ will learn a hypothesis $h \in C$ such that $e r r_{D} (h) \leq {min}_{h^{⋆} \in C} e r r_{D} (h^{⋆}) + ϵ$ . Moreover, we say that $C$ is agnostically PAC-learnable if there exists an agnostic PAC-learning algorithm for $C$ .

In other words, in this setting, we allow label noise, and we allow data distributions which cannot be perfectly modeled by $C$ . We are concerned with whether or not it is possible to learn a hypothesis $h$ which is approximately optimal among all functions in $C$ , with high probability, using a reasonable amount of training data, while making very minimal assumptions about the underlying data distribution.

Say that an empirical risk minimiser is any learning algorithm which, for any dataset ${(x, y)}$ , returns a function $h \in C$ which minimises the empirical error on that dataset.

We then have that $C$ is agnostically PAC-learnable if and only if $C$ has a finite VC-dimension. Moreover, in that case, any empirical risk minimiser is an agnostic PAC-learning algorithm for $C$ . Thus, agnostic PAC-learning is characterised by the same conditions as PAC-learning.

We have that if $L$ is an empirical risk minimiser for $C$ , where $C$ has VC-dimension $d$ , then for any distribution $D$ over $X \times {0, 1}$ , if $L$ is given access to at least $u ((d / ϵ^{2}) + (1 / ϵ^{2}) l o g (1 / δ))$ samples from $D$ , then $L$ will with probability at least $1 - δ$ learn a function $h$ where $e r r_{D} (h) \leq {min}_{h^{⋆} \in C} e r r_{D} (h^{⋆}) + ϵ$ . Moreover, any agnostic PAC-learning algorithm for $C$ needs at least $l ((d / ϵ^{2}) + (1 / ϵ^{2}) l o g (1 / δ))$ samples. Here $u$ and $l$ are two constants.