Rectified activation units (rectifiers) are essential for state-of-the-art neural networks. In this work, we study rectifier neural networks for image classification from two aspects. First, we propose a Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit. PReLU improves model fitting with nearly zero extra computational cost and little overfitting risk. Second, we derive a robust initialization method that particularly considers the rectifier nonlinearities. This method enables us to train extremely deep rectified models directly from scratch and to investigate deeper or wider network architectures. Based on our PReLU networks (PReLU-nets), we achieve 4.94% top-5 test error on the ImageNet 2012 classification dataset. This is a 26% relative improvement over the ILSVRC 2014 winner (GoogLeNet, 6.66%). To our knowledge, our result is the first to surpass human-level performance (5.1%, Russakovsky et al.) on this visual recognition challenge.
(Surprised it wasn’t a Baidu team who won.) I suppose now we’ll need even harder problem sets for deep learning… Maybe video? Doesn’t seem like a lot of work on that yet compared to static image recognition.
Training Deep Neural Networks is complicated by the fact that the distribution of each layer’s inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters.
...The current reported best result on the ImageNet Large Scale Visual Recognition Competition is reached by the Deep Image ensemble of traditional models Wu et al. (2015). Here we report a top-5 validation error of 4.9% (and 4.82% on the test set), which improves upon the previous best result despite using 15X fewer parameters and lower resolution receptive field. Our system exceeds the estimated accuracy of human raters according to Russakovsky et al. (2014).
… About ~3% is an optimistic estimate without my “silly errors”.
...I don’t at all intend this post to somehow take away from any of the recent results: I’m very impressed with how quickly multiple groups have improved from 6.6% down to ~5% and now also below! I did not expect to see such rapid progress. It seems that we’re now surpassing a dedicated human labeler. And imo, when we are down to 3%, we’d matching the performance of a hypothetical super-dedicated fine-grained expert human ensemble of labelers.
Human performance on image-recognition surpassed by MSR? “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, He et al 2015 (Reddit; emphasis added):
(Surprised it wasn’t a Baidu team who won.) I suppose now we’ll need even harder problem sets for deep learning… Maybe video? Doesn’t seem like a lot of work on that yet compared to static image recognition.
The record has apparently been broken again: “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift” (HN, Reddit), Ioffe & Szegedy 2015:
On the human-level accuracy rate: