Method (expand all | collapse all) | Accuracy (%) | |
---|---|---|

GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism (Nov 2018, arXiv 2018) | 91.3% | |

GPipe is a scalable pipeline parallelism library that enables learning of giant deep neural networks. It partitions network layers across accelerators and pipelines execution to achieve high hardware utilization. It leverages recomputation to minimize activation memory usage. For example, using partitions over 8 accelerators, it is able to train networks that are 25x larger, demonstrating its scalability. It also guarantees that the computed gradients remain consistent regardless of the number of partitions. It achieves an almost linear speed up without any changes in the model parameters: when using 4x more accelerators, training the same model is up to 3.5x faster. We train a 557 million parameters AmoebaNet model on ImageNet and achieve a new state-of-the-art 84.3% top-1 / 97.0% top-5 accuracy on ImageNet 2012 dataset. Finally, we use this learned model to finetune multiple popular image classification datasets and obtain competitive results, including pushing the CIFAR-10 accuracy to 99% and CIFAR-100 accuracy to 91.3%. |
||

AutoAugment: Learning Augmentation Policies from Data (May 2018, arXiv 2018) | 89.33% | |

In this paper, we take a closer look at data augmentation for images, and describe a simple procedure called AutoAugment to search for improved data augmentation policies. Our key insight is to create a search space of data augmentation policies, evaluating the quality of a particular policy directly on the dataset of interest. In our implementation, we have designed a search space where a policy consists of many sub-policies, one of which is randomly chosen for each image in each mini-batch. A sub-policy consists of two operations, each operation being an image processing function such as translation, rotation, or shearing, and the probabilities and magnitudes with which the functions are applied. We use a search algorithm to find the best policy such that the neural network yields the highest validation accuracy on a target dataset. Our method achieves state-of-the-art accuracy on CIFAR-10, CIFAR-100, SVHN, and ImageNet (without additional data). On ImageNet, we attain a Top-1 accuracy of 83.54%. On CIFAR-10, we achieve an error rate of 1.48%, which is 0.65% better than the previous state-of-the-art. Finally, policies learned from one dataset can be transferred to work well on other similar datasets. For example, the policy learned on ImageNet allows us to achieve state-of-the-art accuracy on the fine grained visual classification dataset Stanford Cars, without fine-tuning weights pre-trained on additional data. Code to train Wide-ResNet, Shake-Shake and ShakeDrop models with AutoAugment policies can be found at https://github.com/tensorflow/models/tree/master/research/autoaugment |
||

ShakeDrop Regularization (Feb 2018, ICLR 2018) | 87.81% | |

This paper proposes a powerful regularization method named ShakeDrop regularization. ShakeDrop is inspired by Shake-Shake regularization that decreases error rates by disturbing learning. While Shake-Shake can be applied to only ResNeXt which has multiple branches, ShakeDrop can be applied to not only ResNeXt but also ResNet, Wide ResNet and PyramidNet in a memory efficient way. Important and interesting feature of ShakeDrop is that it strongly disturbs learning by multiplying even a negative factor to the output of a convolutional layer in the forward training pass. The effectiveness of ShakeDrop is confirmed by experiments on CIFAR-10/100 and Tiny ImageNet datasets. |
||

Improved Regularization of Convolutional Neural Networks with Cutout (Aug 2017, arXiv 2017) | 84.80% | |

Convolutional neural networks are capable of learning powerful representational spaces, which are necessary for tackling complex learning tasks. However, due to the model capacity required to capture such representations, they are often susceptible to overfitting and therefore require proper regularization in order to generalize well. In this paper, we show that the simple regularization technique of randomly masking out square regions of input during training, which we call cutout, can be used to improve the robustness and overall performance of convolutional neural networks. Not only is this method extremely easy to implement, but we also demonstrate that it can be used in conjunction with existing forms of data augmentation and other regularizers to further improve model performance. We evaluate this method by applying it to current state-of-the-art architectures on the CIFAR-10, CIFAR-100, and SVHN datasets, yielding new state-of-the-art results with almost no additional computational cost. We also show improved performance in the low-data regime on the STL-10 dataset. |
||

Drop-Activation: Implicit Parameter Reduction and Harmonic Regularization (Nov 2018, arXiv 2018) | 83.80% | |

Overfitting frequently occurs in deep learning. In this paper, we propose a novel regularization method called Drop-Activation to reduce overfitting and improve generalization. The key idea is to \emph{drop} nonlinear activation functions by setting them to be identity functions randomly during training time. During testing, we use a deterministic network with a new activation function to encode the average effect of dropping activations randomly. Experimental results on CIFAR-10, CIFAR-100, SVHN, and EMNIST show that Drop-Activation generally improves the performance of popular neural network architectures. Furthermore, unlike dropout, as a regularizer Drop-Activation can be used in harmony with standard training and regularization techniques such as Batch Normalization and AutoAug. Our theoretical analyses support the regularization effect of Drop-Activation as implicit parameter reduction and its capability to be used together with Batch Normalization. |
||

Densely Connected Convolutional Networks (Aug 2016, arXiv 2016) | 82.82% | |

Random Erasing Data Augmentation (Aug 2017, arXiv 2017) | 82.35% | |

Wide Residual Networks (May 2016, arXiv 2017) | 81.70% | |

Residual Networks of Residual Networks: Multilevel Residual Networks (Aug 2016, arXiv 2017) | 80.27% | |

Residual Attention Network for Image Classification (Apr 2017, arXiv 2017) | 79.55% | |

Fast and Accurate Deep Network Learning by Exponential Linear Units (Nov 2015) | 75.72% | |

Spatially-sparse convolutional neural networks (Sep 2014) | 75.7% | |

Fractional Max-Pooling (Dec 2014) | 73.61% | |

Scalable Bayesian Optimization Using Deep Neural Networks (Feb 2015, ICML 2015) | 72.60% | |

Competitive Multi-scale Convolution (Nov 2015) | 72.44% | |

All you need is a good init (Nov 2015, ICLR 2016) | 72.34% | |

Batch-normalized Maxout Network in Network (Nov 2015) | 71.14% | |

On the Importance of Normalisation Layers in Deep Learning with Piecewise Linear Activation Units (Aug 2015) | 70.80% | |

Learning Activation Functions to Improve Deep Neural Networks (Dec 2014, ICLR 2015) | 69.17% | |

Stacked What-Where Auto-encoders (Jun 2015) | 69.12% | |

Multi-Loss Regularized Deep Neural Network (CSVT 2015) | 68.53% | |

Spectral Representations for Convolutional Neural Networks (Jun 2015, NIPS 2015) | 68.40% | |

Recurrent Convolutional Neural Network for Object Recognition (CVPR 2015) | 68.25% | |

Training Very Deep Networks (Jul 2015, NIPS 2015) | 67.76% | |

Deep Convolutional Neural Networks as Generic Feature Extractors (IJCNN 2015) | 67.68% | |

Generalizing Pooling Functions in Convolutional Neural Networks: Mixed, Gated, and Tree (Sep 2015, AISTATS 2016) | 67.63% | |

HD-CNN: Hierarchical Deep Convolutional Neural Network for Large Scale Visual Recognition (ICCV 2015) | 67.38% | |

Universum Prescription: Regularization using Unlabeled Data (Nov 2015) | 67.16% | |

Striving for Simplicity: The All Convolutional Net (Dec 2014, ICLR 2015) | 66.29% | |

Deep Networks with Internal Selective Attention through Feedback Connections (Jul 2014, NIPS 2014) | 66.22% | |

Deeply-Supervised Nets (Sep 2014) | 65.43% | |

Deep Representation Learning with Target Coding (AAAI 2015) | 64.77% | |

Network in Network (Dec 2013, ICLR 2014) | 64.32% | |

Discriminative Transfer Learning with Tree-based Priors (NIPS 2013) | 63.15% | |

Improving Deep Neural Networks with Probabilistic Maxout Units (Dec 2013, ICLR 2014) | 61.86% | |

Maxout Networks (Feb 2013, ICML 2013) | 61.43% | |

Stable and Efficient Representation Learning with Nonnegativity Constraints (ICML 2014) | 60.8% | |

Empirical Evaluation of Rectified Activations in Convolution Network (May 2015, ICML workshop 2015) | 59.75% | |

Stochastic Pooling for Regularization of Deep Convolutional Neural Networks (Jan 2013) | 57.49% | |

Learning Smooth Pooling Regions for Visual Recognition (BMVC 2013) | 56.29% | |

Beyond Spatial Pyramids: Receptive Field Learning for Pooled Image Features (CVPR 2012) | 54.23% |