Support Vector Machine (SVM)

What's the idea of SVM?

SVM (also called Maximum Margin Classifier) is an algorithm that takes the data as an input and outputs a line/hyperplane that separates those classes if possible.

Suppose that we need to separate two classes of a dataset. The task is to find a line to separate them. However, there are many lines which can do that (countless number of lines). How can we choose the best one?

An idea of support vectors (samples on the margin) and SVM (find the optimal hyperplane).

An idea of support vectors (samples on the margin) and SVM (find the optimal hyperplane).

More mathematical details on finding a hyperplane

Using SVM with kernel trick

Most of the time, we cannot separate classes in the current dataset easily (not linearly separable data). We need to use kernel trick first (transform from the current dimension to a higher dimension) and then we use SVM. These classes are not linearly separable.

An idea of kernel and SVM. Transform from 1D to 2D. Data is not linearly separable in the input space but it is linearly separable in the feature space obtained by a kernel.

An idea of kernel and SVM. Transform from 1D to 2D. Data is not linearly separable in the input space but it is linearly separable in the feature space obtained by a kernel.

An idea of kernel and SVM. Transform from 2D to 3D. Data is not linearly separable in the input space but it is linearly separable in the feature space obtained by a kernel.

An idea of kernel and SVM. Transform from 2D to 3D. Data is not linearly separable in the input space but it is linearly separable in the feature space obtained by a kernel.

More mathematical details

A kernel is a dot product in some feature space:

$$ K(\mathbf{x}_i, \mathbf{x}_j) = \Phi(\mathbf{x}_i, \mathbf{x}_j). $$

It also measures the similarity between two points $\mathbf{x}_i$ and $\mathbf{x}_j$.

We have some popular kernels,

Linear kernel: $K(\mathbf{x}_i, \mathbf{x}_j) = \mathbf{x}_i \cdot \mathbf{x}_j$. We use kernel = 'linear' in sklearn.svm.SVM. Linear kernels are rarely used in practice.
Gaussian kernel (or Radial Basic Function -- RBF): $K(\mathbf{x}_i, \mathbf{x}_j) = \exp(-\gamma\Vert \mathbf{x}_i - \mathbf{x}_j \Vert^2)$. It's used the most. We use kernel = 'rbf' (default) with keyword gamma for $\gamma$ (must be greater than 0) in sklearn.svm.SVM.
Exponential kernel: $K(\mathbf{x}_i, \mathbf{x}_j) = \exp(-\gamma\Vert \mathbf{x}_i - \mathbf{x}_j \Vert)$.
Polynomial kernel: $K(\mathbf{x}_i, \mathbf{x}_j) = (r+\gamma\mathbf{x}_i \cdot \mathbf{x}_j)^d$. We use kernel = 'poly' with keyword degree for $d$ and coef0 for $r$ in sklearn.svm.SVM. It's more popular than RBF in NLP. The most common degree is $$d = 2$$ (quadratic), since larger degrees tend to overfit on NLP problems. (ref)
Hybrid kernel: $K(\mathbf{x}_i, \mathbf{x}_j) = (p+\mathbf{x}_i \cdot \mathbf{x}_j)^q\exp(-\gamma\Vert \mathbf{x}_i - \mathbf{x}_j \Vert^2)$.
Sigmoidal: $K(\mathbf{x}_i, \mathbf{x}_j) = \tanh(\gamma\mathbf{x}_i \cdot \mathbf{x}_j+r)$. We use kernel = 'sigmoid' with keyword coef0 for $$r$$ in sklearn.svm.SVM.