mixup: Beyond Emperical Risk Minimization

Abstract

Problems of Large Deep Neural Network
- Memorization
- Sensitivity to Adversarial examples
⇒ Solution: mixup
mixup
- training using convex combinations of pairs of examples and their labels
- mixup regularizes the neural network to favor simple linear behavior in-between training examples
- Effects of using mixup
  - mixup improves the generalization of SOTA neural network archtectures
  - reduces the memorization of corrupt labels (Section 3.4)
  - increases the robustness to adversarial examples (Section 3.5)
  - stabilizes the training of generative adversarial networks (Section 3.7)

1. Introduction

Shared commonalities of successful large deep neural networks
- Trained to minimize average error over the training data - Emperical Risk Minimization (ERM) principle
- The size of SOTA neural networks scales linearly with the number of training examples
⇒ Contradiction!
- Classical result in learning theory (Vapnik & Chervonenkis, 1971):
  
  the convergence of ERM is guaranteed as long as the size of the learning machine does not increase with the number of training data.
Challenges toward the suitability of ERM
- ERM allows large neural networks to memorize the training data even in the presence of strong regularization (Zhang et al., 2017)
- Neural networks trained with ERM change their predictions drastically when evaluated on adversarial examples (Szegedy et al., 2014)
  
  → ERM might be unable to explain or provide generalization on testing distributions that differ only slightly from the training data...?
How to train on similar but different examples to the training data → Data Augmentation
- Additional virtual examples can be drawn from the vicinity distribution of the training examples to enlarge the support of the training distribution
- Setbacks of conventional data augmentation methods:
  1. While data augmentation consistently leads to improved generalization
  ⇒ the procedure is dataset-dependent, and thus requires the use of expert knowledge
  1. Data augmentation assumes that the examples in the vicinity share the same class
  ⇒ does not model the vicinity relation across examples of different classes

Contribution of this Research

Simple and Data-augnostic Data Augmetation routine: mixup
mixup constructs virtual training examples
Effects of facilitating mixup
- mixup allows a new state-of-the-art performance in the CIFAR-10, CIFAR100, and ImageNet-2012 image classification datasets
- increases the robustness of neural networks when learning from corrupt labels, or facing adversarial examples
- improves generalization on speech and tabular data, and can be used to stabilize the training of GANs

2. From Emperical Risk Minimization to mixup

Concepts of mixup

In supervised learning, we minimize the average of the loss function ℓ over the data distribution P, also known as the expected risk:
The distribution P is unknown in most practical situations. Using the training data D, we may approximate P by the empirical distribution:

where $\delta (x=x_{ i }, y={ y }{ i })$ is a Dirac mass centered at (${ x }{ i },{ y }_{ i }$)
Using the empirical distribution ${ P }_{ \delta }$, we can now approximate the expected risk by the empirical risk:

→ Learning the function f by minimizing ${ R }_{ \delta }(f)$ is known as the Empirical Risk Minimization (ERM) principle (Vapnik, 1998)
While efficient to compute, the empirical risk ${ R }_{ \delta }(f)$ monitors the behaviour of f only at a finite set of n examples.
When considering functions with a number parameters comparable to n (e.g. large neural networks)

→ Memorize the training data (Zhang et al., 2017) ⇒ Leads to the undesirable behaviour of f outside the training data (Szegedy et al., 2014)
There are other options to approximate the true distribution P
- In the Vicinal Risk Minimization (VRM) principle (Chapelle et al., 2000), the distribution P is approximated by
  
  where $v$ is a vicinity distribution that measures the probability of finding the virtual feature-target pair ($\tilde { x, } \tilde { y }$ ) in the vicinity of the training feature-target pair