Mathematics behind Support Vector Machine

Rana singh
6 min readOct 3, 2019

--

The support vector machine works well both for regression and classification problems very popular in the late 1990s.

there are many hyperplanes which separate positive and negative point but we need to choose optimal hyperplane. From logistic regression, we know that the point very near to the hyperplane has the probability of lie in that class is near to 0.5 where points far away from the plane have probability very near to 0.9. so we need hyperplane that separate positive point from negative points as widely as possible. so optimal hyperplane will be mid of points of two class perpendicularly.

W could be any vector, need not be a unit vector.

Margin maximizing hyperplane:

let's draw a line parallel to the perpendicular hyperplane which passes through the nearest positive and negative points on both sides of the Middle hyperplane( π) let's say π+ and π- plane. π+ and π- are parallel to π as well as parallel to each other. The distance between them π+ and π- is called margin. the aim of SVM is to find hyperplane which maximizes the margin. if the margin is high, accuracy improves on the unseen dataset.

Support vector:

the points through which π+ and π- pass through are called support vector.

Mathematical formulation:

Hard margin classification

we want to W and b by maximizing margin.

the above formula is called hard margin SVM in which data is literally separable. here we are minimizing misclassification. In this no error allowed.

Soft margin classification:

we are minimizing margin distance and average zita distance where n is a number of points. average distance is the average distance of a misclassified point from a correct hyperplane. whenever zita present, there is a misclassified point.

  1. C is hyperparameter which is always positive. if C increases, you are giving more importance to not making a mistake. so there is a tendency of overfitting.
  2. If C decreases, you are giving more importance to margin distance. so there is a tendency of underfitting.

C behave exactly opposite to lambda in logistic regression.

Loss:

here z=y*f(Wx+b) decide the loss value. suppose y>0 and f(Wx+b)>0 than z>0 means loss is zero. if y<0 and f(Wx+b)>0 than z<0 means loss is one means incorrectly classified points.

Hing loss can be written as

max(0,(1-z)) also called zita.

The dual form of SVM formulation

if α =0 for SVs and α>0 is for non-SVs.

Kernal trick:

if you do not apply kernel trick, it is called linear SVM. so in linear SVM using find margin maximizing hyperplane in space of Xi.

in case of kernelization, we take the data in d dimensional space and internally it is doing feature transform implicitly which converts to d’ dimensional space where d’>d.

finding the right kernel is important to factor in SVM instead of feature transform which done internally during kernelization. In Quadratic polynomial kernelizations, after expending (1+sum(a,b))², we will get different featured transform internally. so the task is to find the right kernel.

RBF-Kernel

RBF kernel defined in the below figure where sigma is hyperparameter. as the distance between a,b increases, K(a,b) decreases. K(x1,x2)>K(x1,x3) because the distance is more in the second case where a and b are two-point.

kernal is like your similarity function where distance is your dis-similarity function.

x-axis is distance and the y-axis is kernels. if σ=1 than only points between 8–12 have similar value. rest of them are dissimilar. as we increase distance for particular σ, Kernel decreases.

Note: Many other kernels is string kernel used for text related problem, genome-kernel for the bio-related problem, graph-kernel used for graph-related problem. Always choose the right kernel to replace the feature transform.

Training SVM

SVM can be trained using SGD but there is special algorithm used to train SVM. Sequential minimal optimization(SMO) used to train the SVM. libsum library used to train the SVM.

training time complexity (O(n)²) for kernel SVM. means if n is large, we don't use SVM.

there is another concept used in SVM is nu-SVM. where nu value lie between 0–1. nu value always greater than the fraction of error and less than a fraction of supports vectors.

the fraction of errors≤ nu ≤ fraction of Support vectors

Ex. nu=0.01, n=100000 means the fraction of error >1% and number of support vectors is 1000.

Support vector regression:

hyperparameter is epsilon( ε). we want an error to be less than ε like (y-y’ <ε). you can also fit kernal here. RBF-SVR behaves like KNNregression.

  1. if ε is low, error on training data is decrease. chance of overfitting increase.
  2. if ε is high, error on training data increase. chance of underfit increase.

https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html

Critical points in SVM

Feature engineering and feature transformation: it all about finding the right kernel for the given problem. if nothing has given use RBF kernel.

Decision surface:

  1. Linear SVM: in this case, we use linear hyperplane
  2. Kernal-SVM: in d dimension case separated by non-linear surface but when converted to d’ after kernelization, it uses again linear hyperplane

Interpretability and feature importance:

For linear SVM we can easily get feature importance, but for kernel SVM, we use a forward feature selection method to get feature importance.

Outlier: it has minimal impact on SVM. only number of Support vectors matters

Bias variance:

  1. C increase means overfit the model. high variance model
  2. C decreases means underfit of the model. high bias model

in the case of RBF-kernel we have two hyperparameters. 1)C and 2) σ. use grid-search or random-search to find hyperparameter.

Large d: its good in case of SVM because of svm itself project d to higher dimensional d’. use RBF if not having right kernel.

Large n: The train time is typically high. this is a worse case.

Large k(number of support vectors): not good for SVM because minimal k is good for SVM.

===============Thanks=================

Reference:

Google image

Applied AI

# http://cs229.stanford.edu/notes/cs229-notes3.pdf

https://www.youtube.com/watch?v=kgu53eRFERc&feature=youtu.be

--

--

No responses yet