RNNs and CNNs are both pretty simple conceptually, and to me they fall into the class of “things I would have invented if I had been working on that problem,” so I suspect that the original inventors knew what they were doing. (Random forests were not as intuitive to me, but then I saw a good explanation and realized what was going on, and again suspect that the inventor knew what they were doing.)
There is a lot of “we threw X at the problem, and maybe it worked?” throughout all of science, especially when it comes to ML (and statistics more broadly), because people don’t really see why the algorithms work.
I remember once learning that someone had discretized a continuous variable so that they could fit a Hidden Markov Model to it. “Why not use a Kalman filter?” I asked, and got back “well, why not use A, B, or C?”. At that point I realized that they didn’t know that a Kalman filter is basically the continuous equivalent of a HMM (and thus obviously more appropriate, especially since they didn’t have any strong reason to suspect non-Gaussianity), and so ended the conversation.
Unfortunately I can’t easily find a link to the presentation: it was a talk on Mondrian random forests by Yee Whye Teh back in 2014. I don’t think it was necessarily anything special about the presentation, since I hadn’t put much thought into them before then.
The very short version is it would be nice if classifiers had fuzzy boundaries—if you look at the optimization underlying things like logistic regression, it turns out that if the underlying data is linearly separable it’ll make the boundary as sharp as possible, and put it in a basically arbitrary spot. Random forests will, by averaging many weak classifiers, create one ‘fuzzy’ classifier that gets the probabilities mostly right in a computationally cheap fashion.
(This comment is way more opaque than I’d like, but most of the ways I’d want to elaborate on it require a chalkboard.)
This is related to making a strong learner (really accurate) out of weak learners (barely better than majority).
It is actually somewhat non-obvious this should even be possible.
The famous example here is boosting, and in particular “AdaBoost.” The reason boosting et al. work well is actually kind of interesting and I think still not entirely understood.
I didn’t really get Vaniver’s explanation below, there are margin methods that draw the line in a sensible way that have nothing to do with weak learners at all.
Start with the base model, the decision tree. It’s simple and provides representations that may be actually understandable, which is rare in ML, but it has a problem: it sucks. Well, not always, but for many tasks it sucks. Its main limitation is that it can’t efficiently represent linear relations unless the underlying hyperplane is parallel to one of the input feature axes. And most practical tasks involve linear relations + a bit of non-linearity. Training a decision tree on these tasks tends to yield very large trees that overfit (essentially, you end up storing the training set in the tree which then acts like a lookup table).
Fortunately, it was discovered that if you take a linear combination of the outputs a sizeable-but-not-exceptionally-large number of appropriately trained decision trees, then you can get good performances on real-world tasks. In fact it turns out that the coefficients of the linear combination aren’t terribly important, a simple averaging will do.
So the issue is how do you appropriately train these decision trees. You want these trees to be independent from each other conditioned on the true relation as much as possible. This means that ideally you would have to train each of them on a different training set sampled from the underlying true distribution, that is, you would have to have enough training data for each tree. But training data is expensive (ok, used to be expensive in the pre-big data era) and we want to learn an effective model from as few data as possible. The second requirement is that each decision tree must not overfit. In the tradeoff between overfitting and underfitting, you prefer underfitting the invdividual models, since model averaging at the end can take care of it.
Random forests use two tricks to fulfill these requirements:
The first one is Bootstrap aggregating, aka “bagging”: instead of gathering from the true distribution m training sets of n examples each for each of your m decision trees, you generate m-1 alternate sets by resampling with replacement your original training set.of n examples. It turns out that, for reasons not entirely well understood, these m datasets behave in many ways as if they were independently sampled from the true distribution. This is an application of a technique known in statistics as Bootstrapping), which has some asymptotic theoretical guarantees under certain conditions that probably don’t apply here, but nevertheless empirically it often works well.
The second one is the Random subspace method, which is just a fancy term for throwing features away at random, in a different way for each decision tree. This makes it more difficult for each decision tree to overfit, since it has a reduced number of degrees of freedom, and specifically it makes it more difficult to get high training accuracy by relying on recognizing some spurious pattern that appears in the training set but is disrupted by throwing some features away. Note that you are not throwing away some features from the whole model. It’s only the internal decision trees that each train on limited information, but the model overall still trains, with high probability, on all the information contained in all features. The individual trees underfit compared to trees trained on all features, but the averaging at the end compensates for this. Yes, there are some tasks where throwing features away is guaranteed to hurt the accuracy of the individual decision trees to the point of making the task impossible, e.g. the parity problem, but for practical tasks, again for reasons not entirely well understood, it works reasonably well.
With these tricks random forests manage to be the state of the art technique for a large class of ML tasks: any supervised task (in particular classification) that is difficult enough that simple linear methods won’t suffice, but is not difficult enough that you need a very big dataset (or is that difficult but the big dataset is not available) where neural networks would dominate (or a task that has logical depth greater that three, like the aforementioned parity problem, although it’s not clear how common these are in practice).
A full understanding of why random forests work would require a Bayesian argument with an assumption about the prior on the data distribution (Solomonoff? Levin? something else?). This is not currently known for random forests and in fact AFAIK has been done only for very simple ML algorithms using simplifying assumptions on the data distribution such as Gaussianity. If that’s the level of rigor you are looking for then I’m afraid that you are not going to find it in the discussion of any practical ML algorithm, at least so far. If you enjoy the discussion of math/statistics based methods even if there are some points that are only justified by empirical evidence rather than proof, then you may find the field interesting.
I find CNNs a lot less intuitive than RNNs. In which context was training many filters and successively apply pooling and again filters to smaller versions of the output an intuitive idea?
In the context of vision. Pooling is not strictly necessary but makes things go a bit faster—the real trick of CNNs is to lock the weights of different parts of the network together so that you go through the exact same process to recognize objects if they’re moved around (rather than having different processes for recognition for different parts of the image).
Ok, so the motivation is to learn templates to do correlation at each image location with. But where would you get the idea from to do the same with the correlation map again? That seems non-obvious to me. Or do you mean biological vision?
Nope, didn’t mean biological vision. Not totally sure I understand your comment, so let me know if I’m rambling.
You can think of lower layers (the ones closer to the input pixels) as “smaller” or “more local,” and higher layers as “bigger,” or “more global,” or “composed of nonlinear combinations of lower-level features.” (EDIT: In fact, this restricted connectivity of neurons is an important insight of CNNs, compared to full NNs.)
So if you want to recognize horizontal lines, the lowest layer of a CNN might have a “short horizontal line” feature that is big when it sees a small, local horizontal line. And of course there is a copy of this feature for every place you could put it in the image, so you can think of its activation as a map of where there are short horizontal lines in your image.
But if you wanted to recognize longer horizontal lines, you’d need to combine several short-horizontal-line detectors together, with a specific spatial orientation (horizontal!). To do this you’d use a feature detector that looked at the map of where there were short horizontal lines, and found short horizontal lines of short horizontal lines, i.e. longer horizontal lines. And of course you’d need to have a copy of this higher-level feature detector for every place you could put it in the map of where there are short lines, so that if you moved the longer horizontal line around, a different copy of of this feature detector would light up—the activation of these copies would form a map of where there were longer horizontal lines in your image.
If you think about the logistics of this, you’ll find that I’ve been lying to you a little bit, and you might also see where pooling comes from. In order for “short horizontal lines of short horizontal lines” to actually correspond to longer horizontal lines, you need to zoom out in spatial dimensions as you go up layers, i.e. pooling or something similar. You can zoom out without pooling by connecting higher-level feature detectors to complete (in terms of the patch of pixels) sets of separated lower-level feature detectors, but this is both conceptually and computationally more complicated.
RNNs and CNNs are both pretty simple conceptually, and to me they fall into the class of “things I would have invented if I had been working on that problem,” so I suspect that the original inventors knew what they were doing. (Random forests were not as intuitive to me, but then I saw a good explanation and realized what was going on, and again suspect that the inventor knew what they were doing.)
There is a lot of “we threw X at the problem, and maybe it worked?” throughout all of science, especially when it comes to ML (and statistics more broadly), because people don’t really see why the algorithms work.
I remember once learning that someone had discretized a continuous variable so that they could fit a Hidden Markov Model to it. “Why not use a Kalman filter?” I asked, and got back “well, why not use A, B, or C?”. At that point I realized that they didn’t know that a Kalman filter is basically the continuous equivalent of a HMM (and thus obviously more appropriate, especially since they didn’t have any strong reason to suspect non-Gaussianity), and so ended the conversation.
Can you give a link to that explanation of random forests?
Unfortunately I can’t easily find a link to the presentation: it was a talk on Mondrian random forests by Yee Whye Teh back in 2014. I don’t think it was necessarily anything special about the presentation, since I hadn’t put much thought into them before then.
The very short version is it would be nice if classifiers had fuzzy boundaries—if you look at the optimization underlying things like logistic regression, it turns out that if the underlying data is linearly separable it’ll make the boundary as sharp as possible, and put it in a basically arbitrary spot. Random forests will, by averaging many weak classifiers, create one ‘fuzzy’ classifier that gets the probabilities mostly right in a computationally cheap fashion.
(This comment is way more opaque than I’d like, but most of the ways I’d want to elaborate on it require a chalkboard.)
This is related to making a strong learner (really accurate) out of weak learners (barely better than majority). It is actually somewhat non-obvious this should even be possible.
The famous example here is boosting, and in particular “AdaBoost.” The reason boosting et al. work well is actually kind of interesting and I think still not entirely understood.
I didn’t really get Vaniver’s explanation below, there are margin methods that draw the line in a sensible way that have nothing to do with weak learners at all.
Start with the base model, the decision tree. It’s simple and provides representations that may be actually understandable, which is rare in ML, but it has a problem: it sucks. Well, not always, but for many tasks it sucks. Its main limitation is that it can’t efficiently represent linear relations unless the underlying hyperplane is parallel to one of the input feature axes. And most practical tasks involve linear relations + a bit of non-linearity. Training a decision tree on these tasks tends to yield very large trees that overfit (essentially, you end up storing the training set in the tree which then acts like a lookup table).
Fortunately, it was discovered that if you take a linear combination of the outputs a sizeable-but-not-exceptionally-large number of appropriately trained decision trees, then you can get good performances on real-world tasks. In fact it turns out that the coefficients of the linear combination aren’t terribly important, a simple averaging will do.
So the issue is how do you appropriately train these decision trees.
You want these trees to be independent from each other conditioned on the true relation as much as possible. This means that ideally you would have to train each of them on a different training set sampled from the underlying true distribution, that is, you would have to have enough training data for each tree. But training data is expensive (ok, used to be expensive in the pre-big data era) and we want to learn an effective model from as few data as possible.
The second requirement is that each decision tree must not overfit. In the tradeoff between overfitting and underfitting, you prefer underfitting the invdividual models, since model averaging at the end can take care of it.
Random forests use two tricks to fulfill these requirements:
The first one is Bootstrap aggregating, aka “bagging”: instead of gathering from the true distribution m training sets of n examples each for each of your m decision trees, you generate m-1 alternate sets by resampling with replacement your original training set.of n examples. It turns out that, for reasons not entirely well understood, these m datasets behave in many ways as if they were independently sampled from the true distribution.
This is an application of a technique known in statistics as Bootstrapping), which has some asymptotic theoretical guarantees under certain conditions that probably don’t apply here, but nevertheless empirically it often works well.
The second one is the Random subspace method, which is just a fancy term for throwing features away at random, in a different way for each decision tree.
This makes it more difficult for each decision tree to overfit, since it has a reduced number of degrees of freedom, and specifically it makes it more difficult to get high training accuracy by relying on recognizing some spurious pattern that appears in the training set but is disrupted by throwing some features away. Note that you are not throwing away some features from the whole model. It’s only the internal decision trees that each train on limited information, but the model overall still trains, with high probability, on all the information contained in all features. The individual trees underfit compared to trees trained on all features, but the averaging at the end compensates for this.
Yes, there are some tasks where throwing features away is guaranteed to hurt the accuracy of the individual decision trees to the point of making the task impossible, e.g. the parity problem, but for practical tasks, again for reasons not entirely well understood, it works reasonably well.
With these tricks random forests manage to be the state of the art technique for a large class of ML tasks: any supervised task (in particular classification) that is difficult enough that simple linear methods won’t suffice, but is not difficult enough that you need a very big dataset (or is that difficult but the big dataset is not available) where neural networks would dominate (or a task that has logical depth greater that three, like the aforementioned parity problem, although it’s not clear how common these are in practice).
A full understanding of why random forests work would require a Bayesian argument with an assumption about the prior on the data distribution (Solomonoff? Levin? something else?). This is not currently known for random forests and in fact AFAIK has been done only for very simple ML algorithms using simplifying assumptions on the data distribution such as Gaussianity. If that’s the level of rigor you are looking for then I’m afraid that you are not going to find it in the discussion of any practical ML algorithm, at least so far. If you enjoy the discussion of math/statistics based methods even if there are some points that are only justified by empirical evidence rather than proof, then you may find the field interesting.
I find CNNs a lot less intuitive than RNNs. In which context was training many filters and successively apply pooling and again filters to smaller versions of the output an intuitive idea?
In the context of vision. Pooling is not strictly necessary but makes things go a bit faster—the real trick of CNNs is to lock the weights of different parts of the network together so that you go through the exact same process to recognize objects if they’re moved around (rather than having different processes for recognition for different parts of the image).
Ok, so the motivation is to learn templates to do correlation at each image location with. But where would you get the idea from to do the same with the correlation map again? That seems non-obvious to me. Or do you mean biological vision?
Nope, didn’t mean biological vision. Not totally sure I understand your comment, so let me know if I’m rambling.
You can think of lower layers (the ones closer to the input pixels) as “smaller” or “more local,” and higher layers as “bigger,” or “more global,” or “composed of nonlinear combinations of lower-level features.” (EDIT: In fact, this restricted connectivity of neurons is an important insight of CNNs, compared to full NNs.)
So if you want to recognize horizontal lines, the lowest layer of a CNN might have a “short horizontal line” feature that is big when it sees a small, local horizontal line. And of course there is a copy of this feature for every place you could put it in the image, so you can think of its activation as a map of where there are short horizontal lines in your image.
But if you wanted to recognize longer horizontal lines, you’d need to combine several short-horizontal-line detectors together, with a specific spatial orientation (horizontal!). To do this you’d use a feature detector that looked at the map of where there were short horizontal lines, and found short horizontal lines of short horizontal lines, i.e. longer horizontal lines. And of course you’d need to have a copy of this higher-level feature detector for every place you could put it in the map of where there are short lines, so that if you moved the longer horizontal line around, a different copy of of this feature detector would light up—the activation of these copies would form a map of where there were longer horizontal lines in your image.
If you think about the logistics of this, you’ll find that I’ve been lying to you a little bit, and you might also see where pooling comes from. In order for “short horizontal lines of short horizontal lines” to actually correspond to longer horizontal lines, you need to zoom out in spatial dimensions as you go up layers, i.e. pooling or something similar. You can zoom out without pooling by connecting higher-level feature detectors to complete (in terms of the patch of pixels) sets of separated lower-level feature detectors, but this is both conceptually and computationally more complicated.