Problems of Large Deep Neural Network
⇒ Solution: mixup
mixup
Shared commonalities of successful large deep neural networks
⇒ Contradiction!
Classical result in learning theory (Vapnik & Chervonenkis, 1971):
the convergence of ERM is guaranteed as long as the size of the learning machine does not increase with the number of training data.
Challenges toward the suitability of ERM
ERM allows large neural networks to memorize the training data even in the presence of strong regularization (Zhang et al., 2017)
Neural networks trained with ERM change their predictions drastically when evaluated on adversarial examples (Szegedy et al., 2014)
→ ERM might be unable to explain or provide generalization on testing distributions that differ only slightly from the training data...?
How to train on similar but different examples to the training data → Data Augmentation
Additional virtual examples can be drawn from the vicinity distribution of the training examples to enlarge the support of the training distribution
Setbacks of conventional data augmentation methods:
⇒ the procedure is dataset-dependent, and thus requires the use of expert knowledge
⇒ does not model the vicinity relation across examples of different classes
Simple and Data-augnostic Data Augmetation routine: mixup
mixup constructs virtual training examples
Effects of facilitating mixup
In supervised learning, we minimize the average of the loss function ℓ over the data distribution P, also known as the expected risk:
The distribution P is unknown in most practical situations. Using the training data D, we may approximate P by the empirical distribution:
where $\delta (x=x_{ i }, y={ y }{ i })$ is a Dirac mass centered at (${ x }{ i },{ y }_{ i }$)
Using the empirical distribution ${ P }_{ \delta }$, we can now approximate the expected risk by the empirical risk:
→ Learning the function f by minimizing ${ R }_{ \delta }(f)$ is known as the Empirical Risk Minimization (ERM) principle (Vapnik, 1998)
While efficient to compute, the empirical risk ${ R }_{ \delta }(f)$ monitors the behaviour of f only at a finite set of n examples.
When considering functions with a number parameters comparable to n (e.g. large neural networks)
→ Memorize the training data (Zhang et al., 2017) ⇒ Leads to the undesirable behaviour of f outside the training data (Szegedy et al., 2014)
There are other options to approximate the true distribution P
In the Vicinal Risk Minimization (VRM) principle (Chapelle et al., 2000), the distribution P is approximated by
where $v$ is a vicinity distribution that measures the probability of finding the virtual feature-target pair ($\tilde { x, } \tilde { y }$ ) in the vicinity of the training feature-target pair