
Olena Shmahalo/Quanta Magazine
In the machine learning world, the sizes of artificial neural networks — and their outsize successes — are creating conceptual conundrums. When a network named AlexNet won an annual image recognition competition in 2012, it had about 60 million parameters. These parameters, fine-tuned during training, allowed AlexNet to recognize images that it had never seen before. Two years later, a network named VGG wowed the competition with more than 130 million such parameters. Some artificial neural networks, or ANNs, now have billions of parameters.
These massive networks — astoundingly successful at tasks such as classifying images, recognizing speech and translating text from one language to another — have begun to dominate machine learning and artificial intelligence. Yet they remain enigmatic. The reason behind their amazing power remains elusive.
But a number of researchers are showing that idealized versions of these powerful networks are mathematically equivalent to older, simpler machine learning models called kernel machines. If this equivalence can be extended beyond idealized neural networks, it may explain how practical ANNs achieve their astonishing results.
Part of the mystique of artificial neural networks is that they seem to subvert traditional machine learning theory, which leans heavily on ideas from statistics and probability theory. In the usual way of thinking, machine learning models — including neural networks, trained to learn about patterns in sample data in order to make predictions about new data — work best when they have just the right number of parameters.
If the parameters are too few, the learned model can be too simple and fail to capture all the nuances of the data it’s trained on. Too many and the model becomes overly complex, learning the patterns in the training data with such fine granularity that it cannot generalize when asked to classify new data, a phenomenon called overfitting. “It’s a balance between somehow fitting your data too well and not fitting it well at all. You want to be in the middle,” said Mikhail Belkin, a machine learning researcher at the University of California, San Diego.
Mikahil Belkin of the University of California, San Diego, is excited about the potential of kernel machines to help explain the remarkable success of artificial neural networks.

By all accounts, deep neural networks like VGG have way too many parameters and should overfit. But they don’t. Instead, such networks generalize astoundingly well to new data — and until recently, no one knew why. It wasn’t for lack of trying. For example, Naftali Tishby, a computer scientist and neuroscientist at the Hebrew University of Jerusalem who died in August, argued that deep neural networks first fit the training data and then discard irrelevant information (by going through an information bottleneck), which helps them generalize. But others have argued that this doesn’t happen in all types of deep neural networks, and the idea remains controversial.
Now, the mathematical equivalence of kernel machines and idealized neural networks is providing clues to why or how these over-parameterized networks arrive at (or converge to) their solutions. Kernel machines are algorithms that find patterns in data by projecting the data into extremely high dimensions. By studying the mathematically tractable kernel equivalents of idealized neural networks, researchers are learning why deep nets, despite their shocking complexity, converge during training to solutions that generalize well to unseen data.
“A neural network is a little bit like a Rube Goldberg machine. You don’t know which part of it is really important,” said Belkin. “I think reducing [them] to kernel methods — because kernel methods don’t have all this complexity — somehow allows us to isolate the engine of what’s going on.”
Kernel methods, or kernel machines, rely on an area of mathematics with a long history. It goes back to the 19th-century German mathematician Carl Friedrich Gauss, who came up with the eponymous Gaussian kernel, which maps a variable x to a function with the familiar shape of a bell curve. The modern usage of kernels took off in the early 20th century, when the English mathematician James Mercer used them for solving integral equations. By the 1960s, kernels were being used in machine learning to tackle data that was not amenable to simple techniques of classification.
Understanding kernel methods requires starting with algorithms in machine learning called linear classifiers. Let’s say that cats and dogs can be classified using data in only two dimensions, meaning that you need two features (say the size of the snout, which we can plot on the x-axis, and the size of the ears, which goes on the y-axis) to tell the two types of animals apart. Plot this labeled data on the xy-plane, and cats should be in one cluster and dogs in another.
One can then train a linear classifier using the labeled data to find a straight line that separates the two clusters. This involves finding the coefficients of the equation representing the line. Now, given new unlabeled data, it’s easy to classify it as a dog or a cat by seeing which side of the line it falls on.
Dog and cat lovers, however, would be aghast at such oversimplification. Actual data about the snouts and ears of the many types of cats and dogs almost certainly can’t be divided by a linear separator. In such situations, when the data is linearly inseparable, it can be transformed or projected into a higher-dimensional space. (One simple way to do this would be to multiply the value of two features to create a third; maybe there is something about the correlation between the sizes of the snouts and ears that separates dogs from cats.)
More generally, looking at the data in higher-dimensional space makes it easier to find a linear separator, known as a hyperplane when the space has more than three dimensions. When this hyperplane is projected back to the lower dimensions, it’ll take the shape of a nonlinear function with curves and wiggles that separates the original lower-dimensional data into two clusters.
When we’re working with real data, though, it’s often computationally inefficient — and sometimes impossible — to find the coefficients of the hyperplane in high dimensions. But it isn’t for kernel machines.