I’ve been hearing about all this amazing stuff done with recurrent neural networks, convolutional neural networks, random forests, etc. The problem is that it feels like voodoo to me. “I’ve trained my program to generate convincing looking C code! It gets the indentation right, but the variable use is a bit off. Isn’t that cool?” I’m not sure, it sounds like you don’t understand what your program is doing. That’s pretty much why I’m not studying machine learning right now. What do you think?
ML is search. If you have more parameters, you can do more, but the search problem is harder. Deep NN is a way to parallelize the search problem with # of grad students (by tweaks, etc.), also a general template to guide local-search-via-gradient (e.g. make it look for “interesting” features in the data).
I don’t mean to be disparaging, btw. I think it is an important innovation to use human AND computer time intelligently to solve bigger problems.
In some sense it is voodoo (not very interpretable) but so what? Lots of other solutions to problems are, too. Do you really understand how your computer hardware or your OS work? So what if you don’t?
In some sense it is voodoo (not very interpretable)
There is research in that direction, particularly in the field of visual object recognising convolutional networks. It is possible to interpret what a neural net is looking for.
There is an interesting angle to this—I think it maps to the difference between (traditional) statistics and data science.
In traditional stats you are used to small, parsimonious models. In these small models each coefficient, each part of the model is separable in a way, it is meaningful and interpretable by itself. The big thing to avoid is overfitting.
In data science (and/or ML) a lot of models are of the sprawling black-box kind where coefficients are not separable and make no sense outside of the context of the whole model. These models aren’t traditionally parsimonious either. Also, because many usual metrics scale badly to large datasets, overfitting has to be managed differently.
In traditional stats you are used to small, parsimonious models. In these small models each coefficient, each part of the model is separable in a way, it is meaningful and interpretable by itself. The big thing to avoid is overfitting.In traditional stats you are used to small, parsimonious models. In these small models each coefficient, each part of the model is separable in a way, it is meaningful and interpretable by itself. The big thing to avoid is overfitting.
Keep in mind that traditional stats also includes semi-parametric and non-parametric methods. These give you models which basically manage overfitting by making complexity scale with the amount of data, i.e. they’re by no means “small” or “parsimonious” in the general case. And yes, they’re more similar to the ML stuff but you still get a lot more guarantees.
Also, because many usual metrics scale badly to large datasets, overfitting has to be managed differently.
I get the impression that ML folks have to be way more careful about overfitting because their methods are not going to find the ‘best’ fit—they’re heavily non-deterministic. This means that an overfitted model has basically no real chance of successfully extrapolating from the training set. This is a problem that traditional stats doesn’t have—in that case, your model will still be optimal in some appropriate sense, no matter how low your measures of fit are.
I think I am giving up on correcting “google/wikipedia experts,” it’s just a waste of time, and a losing battle anyways. (I mean the GP here).
I get the impression that ML folks have to be way more careful about overfitting because their methods are
not going to find the ‘best’ fit—they’re heavily non-deterministic. This means that an overfitted model has
basically no real chance of successfully extrapolating from the training set. This is a problem that
traditional stats doesn’t have—in that case, your model will still be optimal in some appropriate sense, no
matter how low your measures of fit are.
That said, this does not make sense to me. Bias variance tradeoffs are fundamental everywhere.
Lots of neural networks at an individual vertex level are a logistic regression model, or something similar, -- I think I understand those pretty well. Similarly: “I think I understand 16-bit adders pretty well.”
I did my PhD thesis on a machine learning problem. I initially used deep learning but after a while I became frustrated with how opaque it was so I switched to using a graphical model where I had explicitly defined the variables and their statistical relationships. My new model worked but it required several months of trying out different models and tweaking parameters, not to mention a whole lot of programming things from scratch. Deep learning is opaque but it has the advantage that you can get good results rapidly without thinking a lot about the problem. That’s probably the main reason that it’s used.
RNNs and CNNs are both pretty simple conceptually, and to me they fall into the class of “things I would have invented if I had been working on that problem,” so I suspect that the original inventors knew what they were doing. (Random forests were not as intuitive to me, but then I saw a good explanation and realized what was going on, and again suspect that the inventor knew what they were doing.)
There is a lot of “we threw X at the problem, and maybe it worked?” throughout all of science, especially when it comes to ML (and statistics more broadly), because people don’t really see why the algorithms work.
I remember once learning that someone had discretized a continuous variable so that they could fit a Hidden Markov Model to it. “Why not use a Kalman filter?” I asked, and got back “well, why not use A, B, or C?”. At that point I realized that they didn’t know that a Kalman filter is basically the continuous equivalent of a HMM (and thus obviously more appropriate, especially since they didn’t have any strong reason to suspect non-Gaussianity), and so ended the conversation.
Unfortunately I can’t easily find a link to the presentation: it was a talk on Mondrian random forests by Yee Whye Teh back in 2014. I don’t think it was necessarily anything special about the presentation, since I hadn’t put much thought into them before then.
The very short version is it would be nice if classifiers had fuzzy boundaries—if you look at the optimization underlying things like logistic regression, it turns out that if the underlying data is linearly separable it’ll make the boundary as sharp as possible, and put it in a basically arbitrary spot. Random forests will, by averaging many weak classifiers, create one ‘fuzzy’ classifier that gets the probabilities mostly right in a computationally cheap fashion.
(This comment is way more opaque than I’d like, but most of the ways I’d want to elaborate on it require a chalkboard.)
This is related to making a strong learner (really accurate) out of weak learners (barely better than majority).
It is actually somewhat non-obvious this should even be possible.
The famous example here is boosting, and in particular “AdaBoost.” The reason boosting et al. work well is actually kind of interesting and I think still not entirely understood.
I didn’t really get Vaniver’s explanation below, there are margin methods that draw the line in a sensible way that have nothing to do with weak learners at all.
Start with the base model, the decision tree. It’s simple and provides representations that may be actually understandable, which is rare in ML, but it has a problem: it sucks. Well, not always, but for many tasks it sucks. Its main limitation is that it can’t efficiently represent linear relations unless the underlying hyperplane is parallel to one of the input feature axes. And most practical tasks involve linear relations + a bit of non-linearity. Training a decision tree on these tasks tends to yield very large trees that overfit (essentially, you end up storing the training set in the tree which then acts like a lookup table).
Fortunately, it was discovered that if you take a linear combination of the outputs a sizeable-but-not-exceptionally-large number of appropriately trained decision trees, then you can get good performances on real-world tasks. In fact it turns out that the coefficients of the linear combination aren’t terribly important, a simple averaging will do.
So the issue is how do you appropriately train these decision trees. You want these trees to be independent from each other conditioned on the true relation as much as possible. This means that ideally you would have to train each of them on a different training set sampled from the underlying true distribution, that is, you would have to have enough training data for each tree. But training data is expensive (ok, used to be expensive in the pre-big data era) and we want to learn an effective model from as few data as possible. The second requirement is that each decision tree must not overfit. In the tradeoff between overfitting and underfitting, you prefer underfitting the invdividual models, since model averaging at the end can take care of it.
Random forests use two tricks to fulfill these requirements:
The first one is Bootstrap aggregating, aka “bagging”: instead of gathering from the true distribution m training sets of n examples each for each of your m decision trees, you generate m-1 alternate sets by resampling with replacement your original training set.of n examples. It turns out that, for reasons not entirely well understood, these m datasets behave in many ways as if they were independently sampled from the true distribution. This is an application of a technique known in statistics as Bootstrapping), which has some asymptotic theoretical guarantees under certain conditions that probably don’t apply here, but nevertheless empirically it often works well.
The second one is the Random subspace method, which is just a fancy term for throwing features away at random, in a different way for each decision tree. This makes it more difficult for each decision tree to overfit, since it has a reduced number of degrees of freedom, and specifically it makes it more difficult to get high training accuracy by relying on recognizing some spurious pattern that appears in the training set but is disrupted by throwing some features away. Note that you are not throwing away some features from the whole model. It’s only the internal decision trees that each train on limited information, but the model overall still trains, with high probability, on all the information contained in all features. The individual trees underfit compared to trees trained on all features, but the averaging at the end compensates for this. Yes, there are some tasks where throwing features away is guaranteed to hurt the accuracy of the individual decision trees to the point of making the task impossible, e.g. the parity problem, but for practical tasks, again for reasons not entirely well understood, it works reasonably well.
With these tricks random forests manage to be the state of the art technique for a large class of ML tasks: any supervised task (in particular classification) that is difficult enough that simple linear methods won’t suffice, but is not difficult enough that you need a very big dataset (or is that difficult but the big dataset is not available) where neural networks would dominate (or a task that has logical depth greater that three, like the aforementioned parity problem, although it’s not clear how common these are in practice).
A full understanding of why random forests work would require a Bayesian argument with an assumption about the prior on the data distribution (Solomonoff? Levin? something else?). This is not currently known for random forests and in fact AFAIK has been done only for very simple ML algorithms using simplifying assumptions on the data distribution such as Gaussianity. If that’s the level of rigor you are looking for then I’m afraid that you are not going to find it in the discussion of any practical ML algorithm, at least so far. If you enjoy the discussion of math/statistics based methods even if there are some points that are only justified by empirical evidence rather than proof, then you may find the field interesting.
I find CNNs a lot less intuitive than RNNs. In which context was training many filters and successively apply pooling and again filters to smaller versions of the output an intuitive idea?
In the context of vision. Pooling is not strictly necessary but makes things go a bit faster—the real trick of CNNs is to lock the weights of different parts of the network together so that you go through the exact same process to recognize objects if they’re moved around (rather than having different processes for recognition for different parts of the image).
Ok, so the motivation is to learn templates to do correlation at each image location with. But where would you get the idea from to do the same with the correlation map again? That seems non-obvious to me. Or do you mean biological vision?
Nope, didn’t mean biological vision. Not totally sure I understand your comment, so let me know if I’m rambling.
You can think of lower layers (the ones closer to the input pixels) as “smaller” or “more local,” and higher layers as “bigger,” or “more global,” or “composed of nonlinear combinations of lower-level features.” (EDIT: In fact, this restricted connectivity of neurons is an important insight of CNNs, compared to full NNs.)
So if you want to recognize horizontal lines, the lowest layer of a CNN might have a “short horizontal line” feature that is big when it sees a small, local horizontal line. And of course there is a copy of this feature for every place you could put it in the image, so you can think of its activation as a map of where there are short horizontal lines in your image.
But if you wanted to recognize longer horizontal lines, you’d need to combine several short-horizontal-line detectors together, with a specific spatial orientation (horizontal!). To do this you’d use a feature detector that looked at the map of where there were short horizontal lines, and found short horizontal lines of short horizontal lines, i.e. longer horizontal lines. And of course you’d need to have a copy of this higher-level feature detector for every place you could put it in the map of where there are short lines, so that if you moved the longer horizontal line around, a different copy of of this feature detector would light up—the activation of these copies would form a map of where there were longer horizontal lines in your image.
If you think about the logistics of this, you’ll find that I’ve been lying to you a little bit, and you might also see where pooling comes from. In order for “short horizontal lines of short horizontal lines” to actually correspond to longer horizontal lines, you need to zoom out in spatial dimensions as you go up layers, i.e. pooling or something similar. You can zoom out without pooling by connecting higher-level feature detectors to complete (in terms of the patch of pixels) sets of separated lower-level feature detectors, but this is both conceptually and computationally more complicated.
Anyway, you gain understanding of complicated techniques by studying them and practicing them. You won’t understand them unless you study them—so I’m not sure why you are complaining about lack of understanding before even trying.
it sounds like you don’t understand what your program is doing
That is ambiguous. Do you mean the final output program or the ML program?
Most ML programs seem pretty straight-forward to me (search, as Ilya said); the black magic is the choice of hyperparameters. How do people know how many layers they need? Also, I think time to learn is a bit opaque, but probably easy to measure. In particular, by mentioning both CNN and RNN, you imply that the C and R are mysterious, while they seem to me the most comprehensible part of the choices.
But your further comments suggest that you mean the program generated by the ML algorithms. This isn’t new. Genetic algorithms and neural nets have been producing incomprehensible results for decades. What has changed is that new learning algorithms have pushed neural nets further and judicious choice of hyperparameters have allowed them to exploit more data and more computer power, while genetic algorithms seem to have run out of steam. The bigger the network or algorithm that is the output, the more room for it to be incomprehensible.
“I’ve trained my program to generate convincing looking C code! It gets the indentation right, but the variable use is a bit off. Isn’t that cool?”
What this is really saying is: “Hey, convincing-looking C code can be modeled by a RNN, i.e. a state-transition version (“recurrent”) of a complex non-linear model which is ultimately a generalization of logistic regression (“neural network”)! And the model can be practically ‘learned’, i.e. fitted empirically, albeit with no optimality or accuracy guarantees of any kind. The variable use is a bit off, though. Isn’t this cool/Does this tell us anything important?”
Yeah. Maybe Norvig is right and it’s much easier to implement Google Translate with what I call “voodoo” than without it. That’s a good point, I need to think some more.
Many machine learning techniques work, but in ways we don’t really understand.
If (1), I shouldn’t study machine learning
I agree with (1). Could you explain (2)? Is it that you would want to use neural networks etc. to gain insight about other concrete problems, and question their usefulness as a tool in that regard? Is it that you would not like to use a magical back box as part of a production system?
EDIT I’m using “machine learning” here to mean the sort of fuzzy blackbox techniques that don’t have easy interpretations, not techniques like logistic regression where it is clearer what they do
I agree that this is a huge problem, but RNNs and CNNs aren’t the whole of ML (random forests are a different category of algorithm). You should study the ML that has the prettiest math. Try VC theory, Pearl’s work on graphical models, AIT, and MaxEnt as developed by Jaynes and applied by della Pietra to statistical machine translation. Hinton’s early work on topics like Boltzmann machines and Wake-Sleep algorithm is also quite “deep”.
Have fun with generative models such as variational Bayesian neural networks, generative adversarial networks, applications of Fokker–Planck/Langevin/Hamiltonian dynamics to ML and NNs in particular, and so on. There are certainly lots of open problems for the mathematically inclined which are much more interesting than “Look ma, my neural networks made psychedelic artwork and C-looking code with more or less matched parentheses”.
For instance, this paper provides pointers to some of these methods and describes a class of failure modes that are still difficult to address.
There are some pretty amazing actually useful applications for larger and larger feasible ML spaces. Everyone studying CS or seriously undertaking any computer engineering should at least learn the fundamentals (I’d recommend the Coursera ML class).
And most should not spend a huge fraction of their study time on it unless it really catches your fancy. But rather than saying “that’s why I’m not studying ML right now”, I’d like to hear the X in “that’s why I’m focusing on X over ML right now”.
The trippy pictures and the vaguely C-looking code are just cool stunts, not serious experiments. People may be tempted to fell into the hype, sometimes a reality check is helpful.
This said, neural networks really do well in difficult tasks such as visual object recognition and machine translation, indeed for reasons that are not fully understood.
Sounds like a good reason to study the field in order to understand why they can do what they do, and why they can’t do what they can’t do, doesn’t it?
I’ve been hearing about all this amazing stuff done with recurrent neural networks, convolutional neural networks, random forests, etc. The problem is that it feels like voodoo to me. “I’ve trained my program to generate convincing looking C code! It gets the indentation right, but the variable use is a bit off. Isn’t that cool?” I’m not sure, it sounds like you don’t understand what your program is doing. That’s pretty much why I’m not studying machine learning right now. What do you think?
ML is search. If you have more parameters, you can do more, but the search problem is harder. Deep NN is a way to parallelize the search problem with # of grad students (by tweaks, etc.), also a general template to guide local-search-via-gradient (e.g. make it look for “interesting” features in the data).
I don’t mean to be disparaging, btw. I think it is an important innovation to use human AND computer time intelligently to solve bigger problems.
In some sense it is voodoo (not very interpretable) but so what? Lots of other solutions to problems are, too. Do you really understand how your computer hardware or your OS work? So what if you don’t?
There is research in that direction, particularly in the field of visual object recognising convolutional networks. It is possible to interpret what a neural net is looking for.
http://yosinski.com/deepvis
I guess the difference is that an RNN might not be understandable even by the person who created and trained it.
There is an interesting angle to this—I think it maps to the difference between (traditional) statistics and data science.
In traditional stats you are used to small, parsimonious models. In these small models each coefficient, each part of the model is separable in a way, it is meaningful and interpretable by itself. The big thing to avoid is overfitting.
In data science (and/or ML) a lot of models are of the sprawling black-box kind where coefficients are not separable and make no sense outside of the context of the whole model. These models aren’t traditionally parsimonious either. Also, because many usual metrics scale badly to large datasets, overfitting has to be managed differently.
Keep in mind that traditional stats also includes semi-parametric and non-parametric methods. These give you models which basically manage overfitting by making complexity scale with the amount of data, i.e. they’re by no means “small” or “parsimonious” in the general case. And yes, they’re more similar to the ML stuff but you still get a lot more guarantees.
I get the impression that ML folks have to be way more careful about overfitting because their methods are not going to find the ‘best’ fit—they’re heavily non-deterministic. This means that an overfitted model has basically no real chance of successfully extrapolating from the training set. This is a problem that traditional stats doesn’t have—in that case, your model will still be optimal in some appropriate sense, no matter how low your measures of fit are.
I think I am giving up on correcting “google/wikipedia experts,” it’s just a waste of time, and a losing battle anyways. (I mean the GP here).
That said, this does not make sense to me. Bias variance tradeoffs are fundamental everywhere.
I don’t think any one person understands the Linux kernel anymore. It’s just too big. Same with modern CPUs.
An RNN is something that one person can create and then fail to understand. That’s not like the Linux kernel at all.
Correction: An RNN is something that a person working with a powerful general optimizer can create and then fail to understand.
A human without the optimizer can create RNNs by hand—but only of the small and simple variety.
Although the Linux kernel and modern CPUs are piecewise-understandable, whereas neural networks are not.
Lots of neural networks at an individual vertex level are a logistic regression model, or something similar, -- I think I understand those pretty well. Similarly: “I think I understand 16-bit adders pretty well.”
I did my PhD thesis on a machine learning problem. I initially used deep learning but after a while I became frustrated with how opaque it was so I switched to using a graphical model where I had explicitly defined the variables and their statistical relationships. My new model worked but it required several months of trying out different models and tweaking parameters, not to mention a whole lot of programming things from scratch. Deep learning is opaque but it has the advantage that you can get good results rapidly without thinking a lot about the problem. That’s probably the main reason that it’s used.
RNNs and CNNs are both pretty simple conceptually, and to me they fall into the class of “things I would have invented if I had been working on that problem,” so I suspect that the original inventors knew what they were doing. (Random forests were not as intuitive to me, but then I saw a good explanation and realized what was going on, and again suspect that the inventor knew what they were doing.)
There is a lot of “we threw X at the problem, and maybe it worked?” throughout all of science, especially when it comes to ML (and statistics more broadly), because people don’t really see why the algorithms work.
I remember once learning that someone had discretized a continuous variable so that they could fit a Hidden Markov Model to it. “Why not use a Kalman filter?” I asked, and got back “well, why not use A, B, or C?”. At that point I realized that they didn’t know that a Kalman filter is basically the continuous equivalent of a HMM (and thus obviously more appropriate, especially since they didn’t have any strong reason to suspect non-Gaussianity), and so ended the conversation.
Can you give a link to that explanation of random forests?
Unfortunately I can’t easily find a link to the presentation: it was a talk on Mondrian random forests by Yee Whye Teh back in 2014. I don’t think it was necessarily anything special about the presentation, since I hadn’t put much thought into them before then.
The very short version is it would be nice if classifiers had fuzzy boundaries—if you look at the optimization underlying things like logistic regression, it turns out that if the underlying data is linearly separable it’ll make the boundary as sharp as possible, and put it in a basically arbitrary spot. Random forests will, by averaging many weak classifiers, create one ‘fuzzy’ classifier that gets the probabilities mostly right in a computationally cheap fashion.
(This comment is way more opaque than I’d like, but most of the ways I’d want to elaborate on it require a chalkboard.)
This is related to making a strong learner (really accurate) out of weak learners (barely better than majority). It is actually somewhat non-obvious this should even be possible.
The famous example here is boosting, and in particular “AdaBoost.” The reason boosting et al. work well is actually kind of interesting and I think still not entirely understood.
I didn’t really get Vaniver’s explanation below, there are margin methods that draw the line in a sensible way that have nothing to do with weak learners at all.
Start with the base model, the decision tree. It’s simple and provides representations that may be actually understandable, which is rare in ML, but it has a problem: it sucks. Well, not always, but for many tasks it sucks. Its main limitation is that it can’t efficiently represent linear relations unless the underlying hyperplane is parallel to one of the input feature axes. And most practical tasks involve linear relations + a bit of non-linearity. Training a decision tree on these tasks tends to yield very large trees that overfit (essentially, you end up storing the training set in the tree which then acts like a lookup table).
Fortunately, it was discovered that if you take a linear combination of the outputs a sizeable-but-not-exceptionally-large number of appropriately trained decision trees, then you can get good performances on real-world tasks. In fact it turns out that the coefficients of the linear combination aren’t terribly important, a simple averaging will do.
So the issue is how do you appropriately train these decision trees.
You want these trees to be independent from each other conditioned on the true relation as much as possible. This means that ideally you would have to train each of them on a different training set sampled from the underlying true distribution, that is, you would have to have enough training data for each tree. But training data is expensive (ok, used to be expensive in the pre-big data era) and we want to learn an effective model from as few data as possible.
The second requirement is that each decision tree must not overfit. In the tradeoff between overfitting and underfitting, you prefer underfitting the invdividual models, since model averaging at the end can take care of it.
Random forests use two tricks to fulfill these requirements:
The first one is Bootstrap aggregating, aka “bagging”: instead of gathering from the true distribution m training sets of n examples each for each of your m decision trees, you generate m-1 alternate sets by resampling with replacement your original training set.of n examples. It turns out that, for reasons not entirely well understood, these m datasets behave in many ways as if they were independently sampled from the true distribution.
This is an application of a technique known in statistics as Bootstrapping), which has some asymptotic theoretical guarantees under certain conditions that probably don’t apply here, but nevertheless empirically it often works well.
The second one is the Random subspace method, which is just a fancy term for throwing features away at random, in a different way for each decision tree.
This makes it more difficult for each decision tree to overfit, since it has a reduced number of degrees of freedom, and specifically it makes it more difficult to get high training accuracy by relying on recognizing some spurious pattern that appears in the training set but is disrupted by throwing some features away. Note that you are not throwing away some features from the whole model. It’s only the internal decision trees that each train on limited information, but the model overall still trains, with high probability, on all the information contained in all features. The individual trees underfit compared to trees trained on all features, but the averaging at the end compensates for this.
Yes, there are some tasks where throwing features away is guaranteed to hurt the accuracy of the individual decision trees to the point of making the task impossible, e.g. the parity problem, but for practical tasks, again for reasons not entirely well understood, it works reasonably well.
With these tricks random forests manage to be the state of the art technique for a large class of ML tasks: any supervised task (in particular classification) that is difficult enough that simple linear methods won’t suffice, but is not difficult enough that you need a very big dataset (or is that difficult but the big dataset is not available) where neural networks would dominate (or a task that has logical depth greater that three, like the aforementioned parity problem, although it’s not clear how common these are in practice).
A full understanding of why random forests work would require a Bayesian argument with an assumption about the prior on the data distribution (Solomonoff? Levin? something else?). This is not currently known for random forests and in fact AFAIK has been done only for very simple ML algorithms using simplifying assumptions on the data distribution such as Gaussianity. If that’s the level of rigor you are looking for then I’m afraid that you are not going to find it in the discussion of any practical ML algorithm, at least so far. If you enjoy the discussion of math/statistics based methods even if there are some points that are only justified by empirical evidence rather than proof, then you may find the field interesting.
I find CNNs a lot less intuitive than RNNs. In which context was training many filters and successively apply pooling and again filters to smaller versions of the output an intuitive idea?
In the context of vision. Pooling is not strictly necessary but makes things go a bit faster—the real trick of CNNs is to lock the weights of different parts of the network together so that you go through the exact same process to recognize objects if they’re moved around (rather than having different processes for recognition for different parts of the image).
Ok, so the motivation is to learn templates to do correlation at each image location with. But where would you get the idea from to do the same with the correlation map again? That seems non-obvious to me. Or do you mean biological vision?
Nope, didn’t mean biological vision. Not totally sure I understand your comment, so let me know if I’m rambling.
You can think of lower layers (the ones closer to the input pixels) as “smaller” or “more local,” and higher layers as “bigger,” or “more global,” or “composed of nonlinear combinations of lower-level features.” (EDIT: In fact, this restricted connectivity of neurons is an important insight of CNNs, compared to full NNs.)
So if you want to recognize horizontal lines, the lowest layer of a CNN might have a “short horizontal line” feature that is big when it sees a small, local horizontal line. And of course there is a copy of this feature for every place you could put it in the image, so you can think of its activation as a map of where there are short horizontal lines in your image.
But if you wanted to recognize longer horizontal lines, you’d need to combine several short-horizontal-line detectors together, with a specific spatial orientation (horizontal!). To do this you’d use a feature detector that looked at the map of where there were short horizontal lines, and found short horizontal lines of short horizontal lines, i.e. longer horizontal lines. And of course you’d need to have a copy of this higher-level feature detector for every place you could put it in the map of where there are short lines, so that if you moved the longer horizontal line around, a different copy of of this feature detector would light up—the activation of these copies would form a map of where there were longer horizontal lines in your image.
If you think about the logistics of this, you’ll find that I’ve been lying to you a little bit, and you might also see where pooling comes from. In order for “short horizontal lines of short horizontal lines” to actually correspond to longer horizontal lines, you need to zoom out in spatial dimensions as you go up layers, i.e. pooling or something similar. You can zoom out without pooling by connecting higher-level feature detectors to complete (in terms of the patch of pixels) sets of separated lower-level feature detectors, but this is both conceptually and computationally more complicated.
Clarke’s Third Law :-)
Anyway, you gain understanding of complicated techniques by studying them and practicing them. You won’t understand them unless you study them—so I’m not sure why you are complaining about lack of understanding before even trying.
That is ambiguous. Do you mean the final output program or the ML program?
Most ML programs seem pretty straight-forward to me (search, as Ilya said); the black magic is the choice of hyperparameters. How do people know how many layers they need? Also, I think time to learn is a bit opaque, but probably easy to measure. In particular, by mentioning both CNN and RNN, you imply that the C and R are mysterious, while they seem to me the most comprehensible part of the choices.
But your further comments suggest that you mean the program generated by the ML algorithms. This isn’t new. Genetic algorithms and neural nets have been producing incomprehensible results for decades. What has changed is that new learning algorithms have pushed neural nets further and judicious choice of hyperparameters have allowed them to exploit more data and more computer power, while genetic algorithms seem to have run out of steam. The bigger the network or algorithm that is the output, the more room for it to be incomprehensible.
What this is really saying is: “Hey, convincing-looking C code can be modeled by a RNN, i.e. a state-transition version (“recurrent”) of a complex non-linear model which is ultimately a generalization of logistic regression (“neural network”)! And the model can be practically ‘learned’, i.e. fitted empirically, albeit with no optimality or accuracy guarantees of any kind. The variable use is a bit off, though. Isn’t this cool/Does this tell us anything important?”
Is it for reasons similar to the Strawman Chompsky view in this essay by Peter Norvig?
Yeah. Maybe Norvig is right and it’s much easier to implement Google Translate with what I call “voodoo” than without it. That’s a good point, I need to think some more.
-- Letter from James Clerk Maxwell to Michael Faraday, in the setup of a Steam Punk universe I just now invented
Here’s how I read your question.
Many machine learning techniques work, but in ways we don’t really understand.
If (1), I shouldn’t study machine learning
I agree with (1). Could you explain (2)? Is it that you would want to use neural networks etc. to gain insight about other concrete problems, and question their usefulness as a tool in that regard? Is it that you would not like to use a magical back box as part of a production system?
EDIT I’m using “machine learning” here to mean the sort of fuzzy blackbox techniques that don’t have easy interpretations, not techniques like logistic regression where it is clearer what they do
I agree that this is a huge problem, but RNNs and CNNs aren’t the whole of ML (random forests are a different category of algorithm). You should study the ML that has the prettiest math. Try VC theory, Pearl’s work on graphical models, AIT, and MaxEnt as developed by Jaynes and applied by della Pietra to statistical machine translation. Hinton’s early work on topics like Boltzmann machines and Wake-Sleep algorithm is also quite “deep”.
Yeah, I suppose our instincts agree, because I’ve already studied all these things except the last two :-)
Have fun with generative models such as variational Bayesian neural networks, generative adversarial networks, applications of Fokker–Planck/Langevin/Hamiltonian dynamics to ML and NNs in particular, and so on. There are certainly lots of open problems for the mathematically inclined which are much more interesting than “Look ma, my neural networks made psychedelic artwork and C-looking code with more or less matched parentheses”.
For instance, this paper provides pointers to some of these methods and describes a class of failure modes that are still difficult to address.
There are some pretty amazing actually useful applications for larger and larger feasible ML spaces. Everyone studying CS or seriously undertaking any computer engineering should at least learn the fundamentals (I’d recommend the Coursera ML class).
And most should not spend a huge fraction of their study time on it unless it really catches your fancy. But rather than saying “that’s why I’m not studying ML right now”, I’d like to hear the X in “that’s why I’m focusing on X over ML right now”.
The trippy pictures and the vaguely C-looking code are just cool stunts, not serious experiments. People may be tempted to fell into the hype, sometimes a reality check is helpful.
This said, neural networks really do well in difficult tasks such as visual object recognition and machine translation, indeed for reasons that are not fully understood.
Sounds like a good reason to study the field in order to understand why they can do what they do, and why they can’t do what they can’t do, doesn’t it?
Might want to take a look into the library google just open sourced http://tensorflow.org/