While many classifiers exist that can classify linearly separable data such as logistic regression, Support Vector Machines can handle highly non-linear problems using a kernel trick which implicitly maps the input vectors to higher-dimensional feature spaces. The transformation rearranges the dataset in such a way that it is then linearly solvable. In this article we are going to look at how SVM works, learn about kernel functions, hyperparameters and pros and cons of SVM along with some of the real life applications of SVM.
Support Vector Machines (SVMs), also known as support vector networks, are a family of extremely powerful models which use method based learning and can be used in classification and regression problems. They aim at finding decision boundaries that separate observations with differing class memberships. In other words, SVM is a discriminative classifier formally defined by a separating hyperplane.
Method Based Learning
There are several learning models namely:
- Association rules based
- Ensemble method based
- Deep Learning based
- Clustering method based
- Regression Analysis based
- Bayesian method based
- Dimensionality reduction based
- Instance based
- Kernel method based
Let us understand what Kernel method based learning is all about.
In simple terms, a kernel is a similarity function which is fed into a machine learning algorithm. It accepts two inputs and suggests the similarity. For example, suppose we want to classify images, the input data is a key-value pair (image, label). The image data is taken into consideration, features are computed, and a vector of features are fed into the Machine learning algorithm. But in the case of similarity functions, a kernel function can be defined which internally computes the similarity between images, and then feeds into the learning algorithm along with the images and label data. The outcome of this is a classifier.
Perceptron frameworks or Support vector machines work with kernels and use vectors only. Here, the machine learning algorithms are expressed as dot products so that kernel functions can be used.
Feature vectors generally prefer kernels. Its ease of computing makes it one of the key reasons, also, feature vectors need more storage space in comparison to dot products. You can write
Machine learning algorithms to use dot products and later map them to use kernels. This completely avoids the usage of feature vectors. This allows us to work with highly complex, efficient-to-compute, and yet high performing kernels effortlessly, without really developing multi-dimensional vectors.
Kernel functions
Let us understand what kernel functions are: The figure shown below represents a 1D function using a simple 1-Dimensional example. Assume that given points are as follows, it will depict a vertical line and no other vertical lines will separate the dataset.
Now, if we consider a 2-Dimensional representation, as shown in the figure below, there is a hyperplane (an arbitrary line in 2-Dimensions) which separates red and blue points, which can be separated using Support Vector Machines.
As we keep increasing dimensional space, the need to be able to separate data will eventually decrease. This mapping, x -> (x, x2), is called the kernel function. In case of growing dimensional space, the computations become more complex and kernel trick needs to be applied to address these computations cheaply.
What is Support Vector Machine?
Support Vector Machine (SVM) is a supervised machine learning algorithm which can be used for both classification or regression challenges. However, it is mostly used in classification problems. In this algorithm, each data is plotted in n-dimensional space (where n is the number of features you have) with the value of each feature being the value of a particular coordinate. After that, we perform classification by locating the hyperplane which differentiates both the classes.
Let us create a dataset to understand support vector classification:
# importing scikit learn with make_blobs
from sklearn.datasets.samples_generator import make_blobs
# creating datasets X containing n_samples
# Y containing two classes
X, Y = make_blobs(n_samples=500, centers=2,
random_state=0, cluster_std=0.40)
# plotting scatters
plt.scatter(X[:, 0], X[:, 1], c=Y, s=50, cmap='spring');
plt.show()
Support vector machine is based on the concept of decision planes that define decision boundaries. A decision plane is one that separates between a set of objects with different class memberships. For example, in the figure mentioned below, there are objects which belong to either class Green or Red. The separating line defines a boundary on the right side of which all objects are Green and to the left of which all objects are Red. Any new object (white circle) falling to the right is labeled, i.e., classified, as Green (or classified as Red should it fall to the left of the separating line).
Support vector machines not only draw a line between two classes, but consider a region about the line of some given width. Here’s an example of what it can look like:
# creating line space between -1 to 3.5
xfit = np.linspace(-1, 3.5)
# plotting scatter
plt.scatter(X[:, 0], X[:, 1], c=Y, s=50, cmap='spring')
# plot a line between the different sets of data
for m, b, d in [(1, 0.65, 0.33), (0.5, 1.6, 0.55), (-0.2, 2.9, 0.2)]:
yfit = m * xfit + b
plt.plot(xfit, yfit, '-k')
plt.fill_between(xfit, yfit - d, yfit + d, edgecolor='none',
color='#AAAAAA', alpha=0.4)
plt.xlim(-1, 3.5);
plt.show()
Another scenario, where it is clear that a full separation of the Green and Red objects would require a curve (which is more complex than a line). Classification tasks based on drawing separating lines to distinguish between objects of different class memberships are known as hyperplane classifiers. Support Vector Machines are particularly suited to handle such tasks.
The figure below shows the basic idea behind Support Vector Machines. Here you will see that the original objects (left side of the schematic) mapped, are rearranged using a set of mathematical functions called kernels. This process of rearranging objects is known as mapping or transformation. You will notice that the right side of the schematic is linearly separable. All we can do is find an optimal line that will separate red and green objects.
What is a hyperplane?
The goal of Support Vector Machine is to find the hyperplane which separates these two objects or classes. Let us consider another figure which shows some of the possible hyperplanes which can help in separating or dividing the dataset. It is the choice of the best hyperplane which is also the goal. The best hyperplane is defined by the extent to which a maximum margin is left for both classes. The margin is the distance between the hyperplane and the closest point in the classification.
Let us consider two hyperplanes among all and then check the margins represented by M1 and M2. You will notice that margin M1 > M2, so the choice of the hyperplane which separates the best one is the new plane between the green and blue planes.
How do we find the right hyperplane?
Now, let us represent the new plane by a linear equation as:
f(x) = ax + b
Let us consider that this equation delivers all values ≥ 1 from the green triangle class and ≤ -1 for the gold star class. The distance of this plane from the closest points in both the classes is at least one; the modulus is one.
f(x) ≥ 1 for triangles and f(x) ≤ 1 or |f(x)| = 1 for star
The distance between the hyperplane and the point can be computed using the following equation.
M1 = |f(x)| / ||a|| = 1 / ||a||
The total margin is 1 / ||a|| + 1 / ||a|| = 2 / ||a|.
In order to maximize the separability, we will have to maximize the ||a|| value. This particular value is known as a weight vector. We can minimize the weight value which is a non-linear optimization task. One of the methods is to use the Karush-Kuhn-Tucker (KKT) condition, using the Lagrange multiplier λi.
What is a support vector in SVM?
Let's take an example of two points between the two attributes X and Y. We need to find a point between these two points that has a maximum distance between these points. This requirement is represented in the graph depicted next. The optimal point is depicted using the red circle.
The maximum margin weight vector is parallel to the line from (1, 1) to (2, 3). The weight vector is at (1,2), and this becomes a decision boundary that is halfway between and in perpendicular, that passes through (1.5, 2).
So, y = x1 +2x2 − 5.5 and the geometric margin is computed as √5.
Following are the steps to compute SVMs:
With w = (a, 2a) for the functions of the points (1,1) and (2,3) can be represented as shown here:
a + 2a + ω0 = -1 for the point (1,1)
2a + 6a + ω0 = 1 for the point (2,3)
The weights can be computed as follows:
These are the support vectors:
Lastly, the final equation is as follows:
Large Margin Intuition
In logistic regression, the output of linear function is taken and the value is squashed within the range of [0,1] using the sigmoid function. If the value is greater than a threshold value, say 0.5, label 1 is assigned else label 0.
In case of support vector machines, the linear function is taken and if the output is greater than 1 and we identify it with one class and if the output is -1, it is identified with another class. Since the threshold values are changed to 1 and -1 in SVM, we obtain this reinforcement range of values([-1,1]) which acts as margin.
Cost Function and Gradient Updates
In the SVM algorithm, we maximize the margin between the data points and the hyperplane. The loss function that helps maximize the margin is called the hinge loss.
Hinge loss function (function on the left can be represented as a function on the right) If the predicted value and the actual value are of the same sign, the cost is 0 . If not, we calculate the loss value. We also add a regularization parameter the
cost function. The objective of the regularization parameter is to balance the margin maximization and loss. After adding the regularization parameter, the cost functions looks as below.
Loss function for SVM Now that we have the loss function, we take partial derivatives with respect to the weights to find the gradients. Using gradients, we can update our weights.
Gradients When there is no misclassification, i.e our model correctly predicts the class of our data point, we only have to update the gradient from the regularization parameter.
Gradient Update — No misclassification When there is a misclassification, i.e our model makes a mistake on the prediction of the class of our data point, we include the loss along with the regularization parameter to perform gradient update.
Gradient Update — Misclassification Let us start with a code and import the necessary libraries:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn import metrics
from sklearn.preprocessing import MinMaxScaler
pd.set_option('display.max_columns', None)
Read the Wisconsin Breast Cancer dataset using pandas.read_csv function into an object 'data' from the current directory
data = pd.read_csv('wisconsin.csv')
After reading the data, we have prepared the data as per requirement. Feature scaling is a method used to standardize the range of independent variables or features of data. The min-max scaling (or min-max normalization) shrinks the range of feature such that the range is in between 0 and 1 (or -1 to 1 if there are negative values).
sclr = MinMaxScaler()
predictor_sc = sclr.fit_transform(predictor)predictor_sc.shape
Split the scaled data into train-test split:
x_train_sc,x_test_sc, y_train, y_test = train_test_split(predictor_sc, target, test_size = 0.30, random_state=101)
print("Scaled train and test split")
print("x_train ",x_train_sc.shape)
print("x_test ",x_test_sc.shape)
print("y_train ",y_train.shape)
print("y_test ",y_test.shape)
Scaled train and test split
x_train (398, 30)
x_test (171, 30)
y_train (398,)
y_test (171,)
But what happens when there is no clear hyperplane?
Support Vector Machines can probably help you to find a separating hyperplane but only if it exists. There are certain cases when it is not possible to define a hyperplane, this happens due to noise in the data. Another possible reason can be a non-linear boundary. The first graph below depicts noise and the next one shows a non-linear boundary.
There might be cases where there is no possibility to define a hyperplane, which can happen due to noise in the data. In fact, another reason can be a non-linear boundary as well. The following first graph depicts noise and the second one shows a non-linear boundary.
For such problems which arise due to noise in the data, the best way is to reduce the margin itself and introduce slack.
The non-linear boundary problem can be solved if we introduce a kernel. Some of the kernel functions that can be introduced are mentioned below:
A radial basis function is a real-valued function whose value is dependent on the distance between the input and some fixed point. In machine learning, the radial basis function kernel, or RBF kernel, is a popular kernel function used in various kernelized learning algorithms.
The RBF kernel on two samples x and x', represented as feature vectors in some input space, is defined as:
Applying SVM with default hyperparameters
Let us get back to the example and apply SVM after data pre-processsing with default hyperparameters.
Linear Kernel
from sklearn import svm
svm2 = svm.SVC(kernel='linear')
svm2
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
model2 = svm2.fit(x_train_sc, y_train)
y_pred2 = svm2.predict(x_test_sc)
print('Accuracy Score’)
print(metrics.accuracy_score(y_test,y_pred2))
Accuracy Score:0.9707602339181286
Gaussian Kernel
svm3 = svm.SVC(kernel='rbf')
svm3
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
model3 = svm3.fit(x_train_sc, y_train)
y_pred3 = svm3.predict(x_test_sc)
print('Accuracy Score’)
print(metrics.accuracy_score(y_test, y_pred3))
Accuracy Score:0.935672514619883
Polynomial Kernel
svm4 = svm.SVC(kernel='poly')
svm4
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='poly',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
model4 = svm4.fit(x_train_sc, y_train)
y_pred4 = svm4.predict(x_test_sc)
print('Accuracy Score’)
print(metrics.accuracy_score(y_test,y_pred4))
Accuracy Score:0.6198830409356725
How to tune Parameters of SVM?
Kernel: Kernel in support vector machine is responsible for the transformation of the input data into the required format. Some of the kernels used in support vector machines are linear, polynomial and radial basis function (RBF). In order to create a non-linear hyperplane, we use RBF and Polynomial function, and for complex applications, you should use more advanced kernels to separate classes that are nonlinear in nature. With this transformation, you can obtain accurate classifiers.
Regularization: Using the Scikit-learn’s C parameters and adjusting we can maintain regularization. C denotes a penalty parameter representing an error or any form of misclassification. This misclassification allows you to understand how much of the error is actually bearable. This helps you nullify the compensation between the misclassified term and the decision boundary. With a smaller C value, you obtain hyperplane of small margin and with a larger C value, hyperplane of larger value is obtained.
Gamma: Lower value of Gamma creates a loose fit of the training dataset. On the other hand, a high value of gamma allows the model to get fit more appropriately. A low value of gamma will only provide consideration to the nearby points for the calculation of a separate plane. However, the high value of gamma will consider all the data-points to calculate the final separation line.
Do we need to tune parameters always??
You do not need to tune parameter in all cases. There are inbuilt functions in sklearn tool kit which can be used.
Tuning Hyperparameters
The 'C' and 'gamma' hyperparameter
C is the parameter for the soft margin cost function, which controls the influence of each individual support vector. This process involves trading error penalty for stability. Small C tends to emphasize the margin while ignoring the outliers in the training data(Soft Margin), while large C may tend to overfit the training data(Hard Margin). Thus for a very large values we can cause overfitting of the model and for a very small value of C we can cause underfitting.Thus the value of C must be chosen in such a manner that it generalises the unseen data well.
The gamma parameter is the inverse of the standard deviation of the RBF kernel (Gaussian function), which is used as a similarity measure between two points. A small gamma value define a Gaussian function with a large variance. In this case, two points can be considered similar even if are far from each other. On the other hand, a large gamma value define a Gaussian function with a small variance and in this case, two points are considered similar just if they are close to each other.
Taking kernel as linear and tuning C hyperparameter
C_range=list(range(1,26))
acc_score=[]
for c in C_range:
svc = svm.SVC(kernel='linear', C=c)
scores = cross_val_score(svc, predictor_sc, target, cv=10,
scoring='accuracy')
acc_score.append(scores.mean())
print(acc_score)
[0.9772210699161695, 0.9772210699161695, 0.9806995938121164,
0.9824539797770286, 0.9789754558810818, 0.9789452078472042,
0.9806995938121164, 0.9789452078472041, 0.9789452078472041,
0.9789452078472041, 0.9806995938121164, 0.9789452078472041,
0.9789452078472041, 0.9772210699161695, 0.9772210699161695,
0.9772210699161695, 0.9772210699161695, 0.9754666839512574,
0.9754666839512574, 0.9754666839512574, 0.9754666839512574,
0.9754666839512574, 0.9754666839512574, 0.9754666839512574,
0.9754666839512574]
Let us visualize the above points:
import matplotlib.pyplot as plt
%matplotlib inline
C_Val_list = list(range(1,26))
plt.plot(C_Val_list,acc_score)
plt.xticks(np.arange(0,27,2))
plt.xlabel('Value of C for SVC')
plt.ylabel('Cross-Validated Accuracy')
From the plot we can see that accuracy has been close to 98% somewhere in between C=4 and C=5 and then it drops.
#Taking a close look at the cross-validation accuracy in the range C(4,5)
C_range=list(np.arange(4,5,0.2))
acc_score=[]
for c in C_range:
svc = svm.SVC(kernel='linear', C=c)
scores = cross_val_score(svc, predictor_sc, target, cv=10,
scoring='accuracy')
acc_score.append(scores.mean())
print(acc_score)
[0.9824539797770286, 0.9806995938121164, 0.9789754558810818,
0.9789754558810818, 0.9789754558810818]
Accuracy score is highest for C=4
Taking kernel as gaussian and tuning gamma hyperparameter
gamma_range=[0.0001,0.001,0.01,0.1,1,10,100]
acc_score=[]
for g in gamma_range:
svc = svm.SVC(kernel='rbf', gamma=g)
scores = cross_val_score(svc, predictor_sc, target, cv=10,
scoring='accuracy')
acc_score.append(scores.mean())
print(acc_score)
[0.6274274047186933, 0.6274274047186933, 0.9195035001296346,
0.9561651974764496, 0.9806995938121164, 0.9420026359000951,
0.6274274047186933]
Let us visualize the above points:
gamma_range=[0.0001,0.001,0.01,0.1,1,10,100]
# plotting the value of gamma for SVM versus the cross-validated accuracy
plt.plot(gamma_range,acc_score)
plt.xlabel('Value of gamma for SVC ')
plt.xticks(np.arange(0.0001,100,5))
plt.ylabel('Cross-Validated Accuracy')
Text(0,0.5,'Cross-Validated Accuracy')
For gamma between 5 and 100 the kernel performs very poorly.
Let us take a closer look at the cross-validated accuracy for gamma value in between 0 and 5.
gamma_range=list(np.arange(0.1,5,0.1))
acc_score=[]
for g in gamma_range:
svc = svm.SVC(kernel='rbf', gamma=g)
scores = cross_val_score(svc, predictor_sc, target, cv=10,
scoring='accuracy')
acc_score.append(scores.mean())
print(acc_score)
[0.9561651974764496, 0.9718952553798289, 0.9754051075965776,
0.9737122979863452, 0.9806995938121164, 0.9806995938121164,
0.9806995938121164, 0.9806995938121164, 0.9806995938121164,
0.9806995938121164, 0.9789754558810818, 0.9754969319851352,
0.9754969319851352, 0.9754969319851352, 0.9754969319851352,
0.9737727940541007, 0.9737727940541007, 0.9737727940541007,
0.9737727940541007, 0.9720184080891883, 0.9720184080891883,
0.9720184080891883, 0.9720184080891883, 0.9720184080891883,
0.9720184080891883, 0.9702326938034741, 0.9702326938034741,
0.9702326938034741, 0.9702326938034741, 0.9702326938034741,
0.9702326938034741, 0.9702326938034741, 0.9702326938034741,
0.9666925935528475, 0.9666925935528475, 0.9684167314838821,
0.9684167314838821, 0.9684167314838821, 0.9701711174487941,
0.9701711174487941, 0.96838540316308, 0.9649068792671333,
0.9649068792671333, 0.9649068792671333, 0.9649068792671333,
0.9649068792671333, 0.9649068792671333, 0.963152493302221,
0.963152493302221]
gamma_range=list(np.arange(0.1,5,0.1))
plt.plot(gamma_range,acc_score)
plt.xlabel('Value of gamma for SVC ')
#plt.xticks(np.arange(0.0001,5,5))
plt.ylabel('Cross-Validated Accuracy')
Text(0,0.5,'Cross-Validated Accuracy')
The highest cross-validated accuracy for rbf kernel remains constant in between gamma=0.5 and gamma=1
Taking polynomial kernel and tuning degree hyperparameter
degree=[2,3,4,5,6]
acc_score=[]
for d in degree:
svc = svm.SVC(kernel='poly', degree=d)
scores = cross_val_score(svc, predictor_sc, target, cv=10,
scoring='accuracy')
acc_score.append(scores.mean())
print(acc_score)
[0.8350974418805635, 0.6450652493302222, 0.6274274047186933,
0.6274274047186933, 0.6274274047186933]
plt.plot(degree,acc_score)
plt.xlabel('degrees for SVC ')
plt.ylabel('Cross-Validated Accuracy')
Text(0,0.5,'Cross-Validated Accuracy')
Score is high for second degree polynomial. There is drop in the accuracy score as degree of polynomial increases.Thus increase in polynomial degree results in high complexity of the model.
Advantages and Disadvantages of Support Vector Machine
Advantages of SVM
- SVM Classifiers offer good accuracy and perform faster prediction compared to Naïve Bayes algorithm.
- SVM guarantees optimality due to the nature of Convex Optimization, the solution will always be global minimum not a local minimum.
- SVMcan be access it conveniently, be it from Python or Matlab.
- SVM can be used for both linearly separable as well as non-linearly separable data. Linearly separable data is the hard margin however, non-linearly separable data poses a soft margin.
- SVM provides compliance to the semi-supervised learning models as well. It can be implemented in both labelled and unlabelled data. The only thing it requires is a condition to the minimization problem which is known as the Transductive SVM.
- Feature Mapping used to be complex with respect to computation of the overall training performance of the model. With the help of Kernel Trick, SVM can carry out the feature mapping using simple dot product.
- SVM works well with a clear margin of separation and with high dimensional space.
Disadvantages of SVM
- SVM is not at all capable of handling text structures. It leads to bad performance as it results in the loss of sequential information.
- SVM is not suitable for large datasets because of its high training time and it also takes more time in training compared to Naïve Bayes.
- SVM works poorly with overlapping classes and is also sensitive to the type of kernel used.
- In cases where the number of features for each data point exceeds the number of training data samples , the SVM under performs.
Applications of SVM in Real World
Support vector machines depend on supervised learning algorithms. The main goal of using SVM is to classify unseen data correctly. SVMs can be used to solve various real-world problems:
- Face detection – SVM can be used to classify parts of the image as a face and non-face and create a square boundary around the face.
- Text and hypertext categorization – SVM allows text and hypertext categorization for both inductive and transductive models. It uses training data for classification of documents into different categories. It categorizes based on the score generated and then compares with the threshold value.
- Classification of images – SVMs enhances search accuracy for image classification. In comparison to the traditional query-based searching techniques, SVM provides better accuracy.
- Bioinformatics – It includes classification of proteins and classification of cancer. SVM is used for identifying the classification of genes, patients on the basis of genes and other biological problems.
- Protein fold and remote homology detection – SVM algorithms are applied for protein remote homology detection.
- Handwriting recognition – SVMs are used widely to recognize handwritten characters.
- Generalized predictive control(GPC) – You can use SVM based GPC in order to control chaotic dynamics with useful parameters.
Summary
In this article, we looked at the machine learning algorithm, Support Vector Machine in detail. We have discussed the concept behind support vector machines, how it works, the process of implementation in Python. We also looked into how to tune its parameters and make efficient models. Lastly, we came across the advantages and disadvantages of SVM along with various real world applications of support vector machines.
We have covered most of the topics related to algorithms in our series of machine learning blogs, click here. If you are inspired by the opportunities provided by machine learning, enrol in our Data Science and Machine Learning Courses for more lucrative career options in this landscape.