SVM (also called Maximum Margin Classifier) is an algorithm that takes the data as an input and outputs a line/hyperplane that separates those classes if possible.
Suppose that we need to separate two classes of a dataset. The task is to find a line to separate them. However, there are many lines which can do that (countless number of lines). How can we choose the best one?
An idea of support vectors (samples on the margin) and SVM (find the optimal hyperplane).
Most of the time, we cannot separate classes in the current dataset easily (not linearly separable data). We need to use kernel trick first (transform from the current dimension to a higher dimension) and then we use SVM. These classes are not linearly separable.
An idea of kernel and SVM. Transform from 1D to 2D. Data is not linearly separable in the input space but it is linearly separable in the feature space obtained by a kernel.
An idea of kernel and SVM. Transform from 2D to 3D. Data is not linearly separable in the input space but it is linearly separable in the feature space obtained by a kernel.
A kernel is a dot product in some feature space:
$$ K(\mathbf{x}_i, \mathbf{x}_j) = \Phi(\mathbf{x}_i, \mathbf{x}_j). $$
It also measures the similarity between two points $\mathbf{x}_i$ and $\mathbf{x}_j$.
We have some popular kernels,
kernel = 'linear'
in sklearn.svm.SVM
. Linear kernels are rarely used in practice.kernel = 'rbf'
(default) with keyword gamma
for $\gamma$ (must be greater than 0) in sklearn.svm.SVM
.kernel = 'poly'
with keyword degree
for $d$ and coef0
for $r$ in sklearn.svm.SVM
. It's more popular than RBF in NLP. The most common degree is $$d = 2$$ (quadratic), since larger degrees tend to overfit on NLP problems. (ref)kernel = 'sigmoid'
with keyword coef0
for $$r$$ in sklearn.svm.SVM
.