Valiant is not talking specifically about AdaBoost, although AdaBoost was the first of these algorithms and is well known due to wide proliferation. See this which succinctly describes the differences of some of the different boosters out there. In particular the linked paper by Philip Long at Google is really nice for showing the limitations of boosters and also understanding the fact that boosters are really nothing more than a specialized gradient descent if you recast them in the right way.
I’m not sure boosting is the most powerful general learning method now known—the support vector machine seems more powerful
This is not what is meant here. SVM is a classification rule itself, whereas boosting is a metarule that operates on classification rules themselves and attempts to make coherent use of multiple decision rules each with different degrees of confidence and error. It makes no sense to compare the usefulness of SVMs to the usefulness of boosting, boosting operates on SVMs. To boot, generalized kernel learning methods, sparse dictionary coding, bag-of-words, and Reproducing Kernel Hilbert Space methods all have many cases where they are vastly superior to SVM. For that matter, even simpler methods like Fisher Linear Discriminant can outperform SVM in a lot of practical cases. And SVM lacks much extension to fully unsupervised learning.
I think Valiant, whose office sits down the hall from my adviser’s and who I have frequent conversations with, is on the money with this stuff.
It makes no sense to compare the usefulness of SVMs to the usefulness of boosting
If an SVM outperforms a boosted-whatever, then it does make sense to compare them.
boosting operates on SVMs
Except that in practice no one uses SVMs as the base learners for boosting (as far as I know). I don’t think it would work very well, since basic SVMs are linear models, and adding multiple linear models is useless. Boosting is usually done with decision trees or decision stumps.
bag-of-words
That is a feature representation, and it has little to do with the learning method. You could encode a text as bag-of-words, and train an SVM on these features.
Reproducing Kernel Hilbert Space methods
Kernel SVM ″is″ a RKHS method, in fact, it is basically the prototypical one.
bag-of-words That is a feature representation, and it has little to do with the learning method. You could encode a text as bag-of-words, and train an SVM on these features.
Yes, sure, but the most generic way is just to look at a historgram distance between word occurrences. I guess that would generically fall under k-means or similar methods, but that’s what I was referring to by citing bag-of-words as a method on its own. Of course you can mix and match and cascade all of these to produce different methods.
Valiant is not talking specifically about AdaBoost, although AdaBoost was the first of these algorithms and is well known due to wide proliferation. See this which succinctly describes the differences of some of the different boosters out there. In particular the linked paper by Philip Long at Google is really nice for showing the limitations of boosters and also understanding the fact that boosters are really nothing more than a specialized gradient descent if you recast them in the right way.
This is not what is meant here. SVM is a classification rule itself, whereas boosting is a metarule that operates on classification rules themselves and attempts to make coherent use of multiple decision rules each with different degrees of confidence and error. It makes no sense to compare the usefulness of SVMs to the usefulness of boosting, boosting operates on SVMs. To boot, generalized kernel learning methods, sparse dictionary coding, bag-of-words, and Reproducing Kernel Hilbert Space methods all have many cases where they are vastly superior to SVM. For that matter, even simpler methods like Fisher Linear Discriminant can outperform SVM in a lot of practical cases. And SVM lacks much extension to fully unsupervised learning.
I think Valiant, whose office sits down the hall from my adviser’s and who I have frequent conversations with, is on the money with this stuff.
Yes, sure, but the most generic way is just to look at a historgram distance between word occurrences. I guess that would generically fall under k-means or similar methods, but that’s what I was referring to by citing bag-of-words as a method on its own. Of course you can mix and match and cascade all of these to produce different methods.