Understanding Machine Learning (I)
Understanding Machine Learning is another book on Miri’s research guide. I’ve previously written about Topology and Linear Algebra Done Right. With this one, my write-up ended up looking a lot more like a self-contained introduction into the field and a lot less like a review. I’ve decided to go with that, and make this a sort of shortened version of the book, or at least the first part of the book, which covers the “fundamentals” of Machine Learning. This will take three posts of roughly similar size. My main reason for doing this is that it feels quite useful for myself, and in fact more useful than doing exercises.
One pattern I noticed with this book (in contrast to the pure math ones) is that both definitions and theorem statements tend to be longer. This may be because the research is more directed – it is actually trying to solve a particular problem, rather than always pursuing the most elegant looking direction, and it is trying to do so with the same level of rigor found in non-applied math. That is a difficult task. If you are not already familiar with Machine Learning, consider the problem of formalizing a theory around the idea of learning from data. This seems highly non-trivial – how would you even start? Consequently, the precise statements and definitions we can make end up inherently complex, and they make use of several strands of mathematics. Among other fields, this book contains probability theory and calculus, and a lot of linear algebra in later chapters. It seems to be aimed at graduate students.
Throughout this series, I will be using bold and italics to mean “this is the first official mention of a new Machine Learning term”, whereas normal italics means regular emphasis.
The Learning Model
The book begins with a sort of mission statement. Given a domain set and a target set , we wish to find a function , which we interchangeably call a hypothesis or a predictor or a classifier. In the example used in the first chapter, represents the set of all papayas, and represents a two-element set where one element is the label “tasty” and the other the label “not-tasty”. To compress this into something manageable, we represent papayas by a set of features, in this case color and softness, which we both map onto the interval , and we represent the two labels by 1 and 0. I.e. and . Now the idea is that, if we learn such a predictor, we can apply it to any new papaya we encounter, say on the market, and it will tell us whether or not it is tasty before we’ve spent any money on it. Of course, this will only be useful if the predictor we learn is actually accurate.
This model seems quite general. It clearly covers the case of chess: one could let be the set of all possible states of the chessboard and be the set of all moves; then we want to map each onto the best possible move. In fact, it is hard to see which solvable problem could not be modeled this way. Any time we think we can in principle solve some problem, it would consist of us looking at some stuff and then presenting the solution in some way. Now we just let the set of all possible instances of such stuff be and the set of all possible solutions be , and we have an instance of this model.
One would clearly have to put in a lot more work to fully formalize this problem – and this is exactly what the book does. Recall that is called the domain set. If is finite as it is for the papayas example, we also call it the label set.
We model the environment, which creates our labeled domain points by a probability distribution over as well as the i.i.d. assumption (not bolded since it’s not ML-specific), which states that all elements are independently and identically distributed according to . In other words, whenever a papaya is plucked in the real world, the environment throws some coins and chooses its color and softness, as well as tastiness, according to . Of course, we generally assume that the color and softness have a major impact on the tastiness of the papaya, but it is not reasonable to assume that they fully determine its tastiness – surely it is possible that two papayas that are equally colored and equally soft taste differently. Thus, given some particular , we might reasonably hope that either or is going to be a lot more probable than the other, but both are still going to be possible. On the other hand, in the chess example, one could argue that the domain set fully determines the value of the target set, i.e. the best possible move (although then the idea of having a probability distribution over becomes problematic). In such a case, one can alternatively model the environment as having a “true” function . Clearly, this approach is strictly less general than the former (it corresponds to a probability distribution where for every , either or has probability 0), but in both cases, the joint distribution / true function is not known, and it is exactly what we are trying to figure out.
We say that the algorithm which outputs a predictor is the learner. As its input (assuming we are doing what is called supervised learning here), the learner has access to a training sequence . This is a sequence of labeled domain points, i.e. which have been generated by the joint distribution . For this reason, one can consider the training sequence to be a glimpse into how the environment works, and the learner’s job to be to generalize from that sequence to all of , i.e. to output a predictor which does a good job on not just the training data but also on new elements that are generated by .
We also define a loss function to formalize what we mean by “good job”. The loss function takes a hypothesis and outputs a number in that is meant to be a rating of how good the hypothesis is, where small values are good and large values are bad. For the case of binary classification (where ), we will set , i.e. the probability that our predictor gets the label on a freshly sampled point wrong (although in practice, one might also be interested in loss functions that penalize false positives more strongly than false negatives, or vice versa). A perfect predictor will have loss 0, but in the setting of a joint distribution , this will generally be impossible. How good the best possible predictor will perform depends on how well the feature vector describes the domain points. With a poor choice of features, the learner is doomed to fail from the beginning (like if you want it to recognize spam emails, but you only show it aspects of the emails such that there is at best a statistically weak correlation between them and whether the email is spam or not-spam). Of course, we cannot actually evaluate our loss function since we don’t know , but it is nonetheless possible to make statements about the expected loss of the predictor which a learner outputs.
Finally, we define an empirical loss function which describes the error of a predictor on the training sequence, i.e. . Unlike the regular loss function, the empirical loss function is something we can evaluate. We use the term true error to refer to the output of , and empirical error to refer to the output of or .
This describes (almost) all objects of the learning model. Now the question is how to implement the learner . Formally, is described as simply an algorithm, so we have complete freedom to do whatever we want.
Generalization / Prior Knowledge / Overfitting
The learner takes a training sequence, does whatever it wants with it, and outputs a predictor. How would we go about defining ? Clearly, this is a difficult question, but a reasonable-sounding starting point is to choose a predictor which performs well on the training sequence . Given that is representative of the environment (recall that it was generated by ), the hope is that such a predictor would also perform well on the environment.
This alone does not answer the question, though. Suppose we are given the following training sequence (or -set as we will ignore order for now):
(Black points tasty instances, red points not-tasty instances.) There are many ways to construct a predictor with zero error on this training set. Here is one (namely the predictor that labels points as tasty iff they are in the blue area):
This does not seem like a good predictor. For example, the green point I added (which is meant to be a new instance not part of the training sequence) would almost certainly correspond to a tasty papaya, yet our hypothesis predicts that it is not tasty. Nonetheless, this predictor has zero error on the training sequence (by the definition of ). An even more extreme example of this is , which simply remembers all domain points and outputs a hypothesis that only labels those instances as positive that have occurred in the training sequence with a positive label. The process of outputting such a hypothesis is often called overfitting. It’s said that the hypothesis fits the data “too well,” as the book puts it. But I think this is not quite the right way to look at it. Consider this alternative hypothesis:
This hypothesis seems better, but not because it is a worse fit for the data – both hypotheses have zero empirical error. There is nothing wrong with fitting the data, there is only something wrong with fitting the data at the cost of outputting an “unreasonable looking” predictor. I think the way to arrive at a more accurate description is by looking at it from a Bayesian perspective. We have a prior on how a proper predictor will look like, and this prior says that the first predictor is worse (as illustrated by how it classifies the green point). However, suppose we had a new example right next to the green point, and it was not-tasty. We would probably still dislike that predictor – the one not-tasty papaya could be an outlier. On the other hand, if we had a million sample points right next to the green one, all of which not-tasty, we would eventually admit that something super weird is really going on here and a very particular softness-color combination actually leads to papayas being not-tasty.
I think the correct framing of the overfit/underfit dilemma is that it is about how far to update away from priors based on evidence, or alternatively, how to weigh priors vs evidence. It’s also worth noting that if we had a training sequence that included all possible instances (assuming there is a true labeling function for the sake of this example), we could disregard priors altogether – in that case, would work work perfectly. Analogously, there is always some amount of evidence that should overturn any prior. Of course, in our formal model, there may be infinitely many possible feature vectors, but in the real world, every instance description is bounded so the space of all instances must be finite.
One crude way of deciding how much to weigh priors vs evidence is by committing to a hypothesis class ahead of time – i.e. to a subset – and then finding a predictor that performs maximally well on the training sequence, only taking into considerations predictors in the class we committed to (i.e., a predictor in the set ). It is crude because it only allows the prior to give a binary “yes” or “no” to a hypothesis, rather than a “probably not, but maybe with enough evidence”. Suppose (for the papaya example), we pre-commit to the class of axis-aligned rectangles, as the book suggests (i.e. rectangles with two horizontal and two vertical sides). Consider the following training sequence:
This does not seem like a super unreasonable glimpse into reality – perhaps it is really the case that if green papayas are too soft, they become gross, but if they’re more orange or whatever, then it’s fine if they’re soft. You might not expect that to be the case, but you’d probably assign it some reasonable probability. This is not allowed by the class of axis-aligned rectangles: cording to this class, the only possible predictor is one where both a certain softness area and a certain color area have to be hit, and one factor does not influence the other. Therefore, if we applied a learning algorithm to this training sequence and restricted it to that class, it would have to fit an axis-aligned rectangle in there as best as possible – which would perform much better than chance, but still seems like a suboptimal choice.
A more sophisticated way of answering the “how strongly to weigh evidence vs priors” question is to choose a family of several hypothesis classes, as well as a weighting function that privileges how much we like each class. Then, perhaps the class of axis-aligned rectangles would have the highest initial weighting, but the class of all rectangles would still be rated decently, so that if it performs substantially better – as is the case here (although it would still have nonzero empirical error) – the learning algorithm would end up preferring a non-axis-aligned rectangle. We can also look at this as having uncertainty about the right model in addition to uncertainty about which predictor within the model to choose. More about this later.
Even though I’ve criticized the pre-commitment approach, it is a good place to begin a more formal analysis. In particular, we can now talk about a hypothesis class being “learnable”. Here, being learnable means that, with sufficiently many training instances, one can prove that, a predictor that performs optimally on the training sequence will probably perform almost as well on the real world as the hypothesis which has the lowest real error in the class. Crucially, this holds for every joint probability distribution .
So this definition is not about how well the hypothesis we end up with performs on the real world in absolute terms – that’s not something which is possible to bound, because the hypothesis class may just not contain any good predictor at all (as is the case in the example above if we use the class of axis-aligned rectangles, and of course other classes could be much worse). Rather, it is about how good of a choice the lowest-empirical-error hypothesis is given that we are stuck with this class – i.e. how large is the difference between the real error of [the hypothesis with minimal empirical error] and the real error of the best hypothesis, i.e. [the hypothesis with minimal real error]. Such a bound is important because, since we can’t measure the real error, we cannot output the best hypothesis directly.
There are also two caveats built into the definition. One is the “almost” part; one can never guarantee that the hypothesis with minimal empirical error is exactly as good as the hypothesis with minimal real error because there is randomness in the creation of the training sequence. The second is the “probably” part, one can never upper bound the difference with full certainty because there is always some chance that the training sequence just happens to be unrepresentative. For example, there is always some chance that the training sequence doesn’t contain any positively labeled points, in which case has no way of reliably outputting anything reasonable. However, as the training sequence gets larger, this becomes increasingly unlikely.
The complete formal definition can be stated thus (where we write to denote the predictor which outputs upon receiving the training sequence ):
A hypothesis class is called PAC learnable iff there exists a learner and a function such that, for any probability distribution over , any “gap” , and any failure probability , if the training sequence consists of at least i.i.d. examples sampled via , then with probability at least over the choice of , it holds that for all .
While this defintion is complicated, it is not unnecessarily complicated – the complexity is just an inherent feature of the thing we are trying to define. This kind of learnability is called PAC learnability because the definition states that is probably (hence the ) approximately (hence the correct (hence the “”).
One can prove that, whenever a class is PAC-learnable at all, then the algorithm which chooses its predictor such that is a successful learner. In other words, we can always just choose the predictor that performs best on the training sequence (or one of them if several are tied). The abbreviation ERM stands for empirical risk minimization.
One can also prove that every finite hypothesis class is PAC learnable. In particular, the sample complexity (this is a name for the output of the function which upper-bounds the number of training points needed to be probability approximately correct) is upper bounded by the term .
Let’s reflect on this result a bit. If we use as our learner, then learnability implies that the difference between the true error of and the true error of the best hypothesis in is probably very small. In particular, if consists of hypotheses, and we want a assurance that the difference between and the best predictor in is at most 0.1, then so we if we have a training sequence with 139 labeled points, we’re good. The only remaining error is then the error of the best hypothesis in . This is also called the approximation error, while the error we just bounded is called the estimation error.
Now consider declaring the class to be the set of all predictors such that there exists a C program, whose compiled code is at most 10000 bits long, that implements the predictor. It’s a safe bet that this class has a very low approximation error for almost all practically relevant cases (or if not then we can allow longer programs). Given that the dependence on the size of – in this case, – is only logarithmic, it increases the sample complexity by a factor smaller than . If we then demand an estimation error of at most 0.01 and some very high confidence (note that the dependence on the confidence is also logarithmic, so even demanding a 0.9999 certainty doesn’t change much), we would still end up with a sample complexity of only around a million. Gathering a million examples is probably doable in many cases (although as George points out, assuming that the i.i.d. assumption holds may not be realistic in practice). Thus, we have just derived a general algorithm which, provided there is a way to obtain sufficiently many i.i.d. data points, solves every binary classification task, namely
1. Perform an exhaustive search over all C programs of compiled length in bits at most 10000
2. Throw out all programs that don’t implement a predictor
3. Compute the empirical risk for each one that does
4. Output the one with the lowest empircal risk.
For the output of this algorithm, both the approximation error (i.e. the error of the best C program) and the estimation error (i.e. the difference between the error of the hypothesis outputs and that of the best C program) are going to be very low!
Back in the real world, step 1 does of course mean that this algorithm is utterly useless in practice, but only due to its runtime. With unbounded computing power, this approach would work wonderfully. In principle, everything (well, at least every binary classification task) is learnable from a relatively small set of examples, certainly fewer examples than what is often available in practice. It doesn’t seem completely unreasonable to consider this some (small) amount of evidence on the importance of talent vs data, and perhaps as decent evidence on the inference power of a really smart AI?
The next post will discuss some theoretical limits of learning and prove the No-Free-Lunch theorem.