While many classifiers exist that can classify linearly separable data such as logistic regression, Support Vector Machines can handle highly non-linear problems using a kernel trick which implicitly maps the input vectors to higher-dimensional feature spaces. The transformation rearranges the dataset in such a way that it is then linearly solvable. In this article we are going to look at how SVM works, learn about kernel functions, hyperparameters and pros and cons of SVM along with some of the real life applications of SVM.
Support Vector Machines (SVMs), also known as support vector networks, are a family of extremely powerful models which use method based learning and can be used in classification and regression problems. They aim at finding decision boundaries that separate observations with differing class memberships. In other words, SVM is a discriminative classifier formally defined by a separating hyperplane.
There are several learning models namely:
Let us understand what Kernel method based learning is all about.
In simple terms, a kernel is a similarity function which is fed into a machine learning algorithm. It accepts two inputs and suggests the similarity. For example, suppose we want to classify images, the input data is a key-value pair (image, label). The image data is taken into consideration, features are computed, and a vector of features are fed into the Machine learning algorithm. But in the case of similarity functions, a kernel function can be defined which internally computes the similarity between images, and then feeds into the learning algorithm along with the images and label data. The outcome of this is a classifier.
Perceptron frameworks or Support vector machines work with kernels and use vectors only. Here, the machine learning algorithms are expressed as dot products so that kernel functions can be used.
Feature vectors generally prefer kernels. Its ease of computing makes it one of the key reasons, also, feature vectors need more storage space in comparison to dot products. You can write
Machine learning algorithms to use dot products and later map them to use kernels. This completely avoids the usage of feature vectors. This allows us to work with highly complex, efficient-to-compute, and yet high performing kernels effortlessly, without really developing multi-dimensional vectors.
Let us understand what kernel functions are: The figure shown below represents a 1D function using a simple 1-Dimensional example. Assume that given points are as follows, it will depict a vertical line and no other vertical lines will separate the dataset.
Now, if we consider a 2-Dimensional representation, as shown in the figure below, there is a hyperplane (an arbitrary line in 2-Dimensions) which separates red and blue points, which can be separated using Support Vector Machines.
As we keep increasing dimensional space, the need to be able to separate data will eventually decrease. This mapping, x -> (x, x2), is called the kernel function. In case of growing dimensional space, the computations become more complex and kernel trick needs to be applied to address these computations cheaply.
Support Vector Machine (SVM) is a supervised machine learning algorithm which can be used for both classification or regression challenges. However, it is mostly used in classification problems. In this algorithm, each data is plotted in n-dimensional space (where n is the number of features you have) with the value of each feature being the value of a particular coordinate. After that, we perform classification by locating the hyperplane which differentiates both the classes.
Let us create a dataset to understand support vector classification:
# importing scikit learn with make_blobs from sklearn.datasets.samples_generator import make_blobs
# creating datasets X containing n_samples # Y containing two classes X, Y = make_blobs(n_samples=500, centers=2, random_state=0, cluster_std=0.40)
# plotting scatters plt.scatter(X[:, 0], X[:, 1], c=Y, s=50, cmap='spring'); plt.show()
Support vector machine is based on the concept of decision planes that define decision boundaries. A decision plane is one that separates between a set of objects with different class memberships. For example, in the figure mentioned below, there are objects which belong to either class Green or Red. The separating line defines a boundary on the right side of which all objects are Green and to the left of which all objects are Red. Any new object (white circle) falling to the right is labeled, i.e., classified, as Green (or classified as Red should it fall to the left of the separating line).
Support vector machines not only draw a line between two classes, but consider a region about the line of some given width. Here’s an example of what it can look like:
# creating line space between -1 to 3.5 xfit = np.linspace(-1, 3.5) # plotting scatter plt.scatter(X[:, 0], X[:, 1], c=Y, s=50, cmap='spring') # plot a line between the different sets of data for m, b, d in [(1, 0.65, 0.33), (0.5, 1.6, 0.55), (-0.2, 2.9, 0.2)]: yfit = m * xfit + b plt.plot(xfit, yfit, '-k') plt.fill_between(xfit, yfit - d, yfit + d, edgecolor='none', color='#AAAAAA', alpha=0.4)
plt.xlim(-1, 3.5); plt.show()
Another scenario, where it is clear that a full separation of the Green and Red objects would require a curve (which is more complex than a line). Classification tasks based on drawing separating lines to distinguish between objects of different class memberships are known as hyperplane classifiers. Support Vector Machines are particularly suited to handle such tasks.
The figure below shows the basic idea behind Support Vector Machines. Here you will see that the original objects (left side of the schematic) mapped, are rearranged using a set of mathematical functions called kernels. This process of rearranging objects is known as mapping or transformation. You will notice that the right side of the schematic is linearly separable. All we can do is find an optimal line that will separate red and green objects.
The goal of Support Vector Machine is to find the hyperplane which separates these two objects or classes. Let us consider another figure which shows some of the possible hyperplanes which can help in separating or dividing the dataset. It is the choice of the best hyperplane which is also the goal. The best hyperplane is defined by the extent to which a maximum margin is left for both classes. The margin is the distance between the hyperplane and the closest point in the classification.
Let us consider two hyperplanes among all and then check the margins represented by M1 and M2. You will notice that margin M1 > M2, so the choice of the hyperplane which separates the best one is the new plane between the green and blue planes.
Now, let us represent the new plane by a linear equation as:
f(x) = ax + b
Let us consider that this equation delivers all values ≥ 1 from the green triangle class and ≤ -1 for the gold star class. The distance of this plane from the closest points in both the classes is at least one; the modulus is one.
f(x) ≥ 1 for triangles and f(x) ≤ 1 or |f(x)| = 1 for star
The distance between the hyperplane and the point can be computed using the following equation.
M1 = |f(x)| / ||a|| = 1 / ||a||
The total margin is 1 / ||a|| + 1 / ||a|| = 2 / ||a|.
In order to maximize the separability, we will have to maximize the ||a|| value. This particular value is known as a weight vector. We can minimize the weight value which is a non-linear optimization task. One of the methods is to use the Karush-Kuhn-Tucker (KKT) condition, using the Lagrange multiplier λi.
Let's take an example of two points between the two attributes X and Y. We need to find a point between these two points that has a maximum distance between these points. This requirement is represented in the graph depicted next. The optimal point is depicted using the red circle.
The maximum margin weight vector is parallel to the line from (1, 1) to (2, 3). The weight vector is at (1,2), and this becomes a decision boundary that is halfway between and in perpendicular, that passes through (1.5, 2).
So, y = x1 +2x2 − 5.5 and the geometric margin is computed as √5.
Following are the steps to compute SVMs:
With w = (a, 2a) for the functions of the points (1,1) and (2,3) can be represented as shown here:
a + 2a + ω0 = -1 for the point (1,1)
2a + 6a + ω0 = 1 for the point (2,3)
The weights can be computed as follows:
These are the support vectors:
Lastly, the final equation is as follows:
In logistic regression, the output of linear function is taken and the value is squashed within the range of [0,1] using the sigmoid function. If the value is greater than a threshold value, say 0.5, label 1 is assigned else label 0.
In case of support vector machines, the linear function is taken and if the output is greater than 1 and we identify it with one class and if the output is -1, it is identified with another class. Since the threshold values are changed to 1 and -1 in SVM, we obtain this reinforcement range of values([-1,1]) which acts as margin.
In the SVM algorithm, we maximize the margin between the data points and the hyperplane. The loss function that helps maximize the margin is called the hinge loss.
cost function. The objective of the regularization parameter is to balance the margin maximization and loss. After adding the regularization parameter, the cost functions looks as below.If the predicted value and the actual value are of the same sign, the cost is 0 . If not, we calculate the loss value. We also add a regularization parameter the
Now that we have the loss function, we take partial derivatives with respect to the weights to find the gradients. Using gradients, we can update our weights.
When there is no misclassification, i.e our model correctly predicts the class of our data point, we only have to update the gradient from the regularization parameter.
When there is a misclassification, i.e our model makes a mistake on the prediction of the class of our data point, we include the loss along with the regularization parameter to perform gradient update.
Let us start with a code and import the necessary libraries:
import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.model_selection import cross_val_score, GridSearchCV from sklearn import metrics from sklearn.preprocessing import MinMaxScaler pd.set_option('display.max_columns', None)
Read the Wisconsin Breast Cancer dataset using pandas.read_csv function into an object 'data' from the current directory
data = pd.read_csv('wisconsin.csv')
After reading the data, we have prepared the data as per requirement. Feature scaling is a method used to standardize the range of independent variables or features of data. The min-max scaling (or min-max normalization) shrinks the range of feature such that the range is in between 0 and 1 (or -1 to 1 if there are negative values).
sclr = MinMaxScaler() predictor_sc = sclr.fit_transform(predictor)predictor_sc.shape
Split the scaled data into train-test split:
x_train_sc,x_test_sc, y_train, y_test = train_test_split(predictor_sc, target, test_size = 0.30, random_state=101) print("Scaled train and test split") print("x_train ",x_train_sc.shape) print("x_test ",x_test_sc.shape) print("y_train ",y_train.shape) print("y_test ",y_test.shape)
Scaled train and test split x_train (398, 30) x_test (171, 30) y_train (398,) y_test (171,)
Support Vector Machines can probably help you to find a separating hyperplane but only if it exists. There are certain cases when it is not possible to define a hyperplane, this happens due to noise in the data. Another possible reason can be a non-linear boundary. The first graph below depicts noise and the next one shows a non-linear boundary.
There might be cases where there is no possibility to define a hyperplane, which can happen due to noise in the data. In fact, another reason can be a non-linear boundary as well. The following first graph depicts noise and the second one shows a non-linear boundary.
For such problems which arise due to noise in the data, the best way is to reduce the margin itself and introduce slack.
The non-linear boundary problem can be solved if we introduce a kernel. Some of the kernel functions that can be introduced are mentioned below:
A radial basis function is a real-valued function whose value is dependent on the distance between the input and some fixed point. In machine learning, the radial basis function kernel, or RBF kernel, is a popular kernel function used in various kernelized learning algorithms.
The RBF kernel on two samples x and x', represented as feature vectors in some input space, is defined as:
Let us get back to the example and apply SVM after data pre-processsing with default hyperparameters.
from sklearn import svm svm2 = svm.SVC(kernel='linear') svm2
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False)
model2 = svm2.fit(x_train_sc, y_train) y_pred2 = svm2.predict(x_test_sc) print('Accuracy Score’) print(metrics.accuracy_score(y_test,y_pred2))
svm3 = svm.SVC(kernel='rbf') svm3
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False)
model3 = svm3.fit(x_train_sc, y_train) y_pred3 = svm3.predict(x_test_sc) print('Accuracy Score’) print(metrics.accuracy_score(y_test, y_pred3))
svm4 = svm.SVC(kernel='poly') svm4
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma='auto', kernel='poly', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False)
model4 = svm4.fit(x_train_sc, y_train) y_pred4 = svm4.predict(x_test_sc) print('Accuracy Score’) print(metrics.accuracy_score(y_test,y_pred4))
Kernel: Kernel in support vector machine is responsible for the transformation of the input data into the required format. Some of the kernels used in support vector machines are linear, polynomial and radial basis function (RBF). In order to create a non-linear hyperplane, we use RBF and Polynomial function, and for complex applications, you should use more advanced kernels to separate classes that are nonlinear in nature. With this transformation, you can obtain accurate classifiers.
Regularization: Using the Scikit-learn’s C parameters and adjusting we can maintain regularization. C denotes a penalty parameter representing an error or any form of misclassification. This misclassification allows you to understand how much of the error is actually bearable. This helps you nullify the compensation between the misclassified term and the decision boundary. With a smaller C value, you obtain hyperplane of small margin and with a larger C value, hyperplane of larger value is obtained.
Gamma: Lower value of Gamma creates a loose fit of the training dataset. On the other hand, a high value of gamma allows the model to get fit more appropriately. A low value of gamma will only provide consideration to the nearby points for the calculation of a separate plane. However, the high value of gamma will consider all the data-points to calculate the final separation line.
You do not need to tune parameter in all cases. There are inbuilt functions in sklearn tool kit which can be used.
C is the parameter for the soft margin cost function, which controls the influence of each individual support vector. This process involves trading error penalty for stability. Small C tends to emphasize the margin while ignoring the outliers in the training data(Soft Margin), while large C may tend to overfit the training data(Hard Margin). Thus for a very large values we can cause overfitting of the model and for a very small value of C we can cause underfitting.Thus the value of C must be chosen in such a manner that it generalises the unseen data well.
The gamma parameter is the inverse of the standard deviation of the RBF kernel (Gaussian function), which is used as a similarity measure between two points. A small gamma value define a Gaussian function with a large variance. In this case, two points can be considered similar even if are far from each other. On the other hand, a large gamma value define a Gaussian function with a small variance and in this case, two points are considered similar just if they are close to each other.
C_range=list(range(1,26)) acc_score= for c in C_range: svc = svm.SVC(kernel='linear', C=c) scores = cross_val_score(svc, predictor_sc, target, cv=10, scoring='accuracy') acc_score.append(scores.mean()) print(acc_score)
[0.9772210699161695, 0.9772210699161695, 0.9806995938121164, 0.9824539797770286, 0.9789754558810818, 0.9789452078472042, 0.9806995938121164, 0.9789452078472041, 0.9789452078472041, 0.9789452078472041, 0.9806995938121164, 0.9789452078472041, 0.9789452078472041, 0.9772210699161695, 0.9772210699161695, 0.9772210699161695, 0.9772210699161695, 0.9754666839512574, 0.9754666839512574, 0.9754666839512574, 0.9754666839512574, 0.9754666839512574, 0.9754666839512574, 0.9754666839512574, 0.9754666839512574]
Let us visualize the above points:
import matplotlib.pyplot as plt %matplotlib inline C_Val_list = list(range(1,26)) plt.plot(C_Val_list,acc_score) plt.xticks(np.arange(0,27,2)) plt.xlabel('Value of C for SVC') plt.ylabel('Cross-Validated Accuracy')
From the plot we can see that accuracy has been close to 98% somewhere in between C=4 and C=5 and then it drops.
#Taking a close look at the cross-validation accuracy in the range C(4,5) C_range=list(np.arange(4,5,0.2)) acc_score= for c in C_range: svc = svm.SVC(kernel='linear', C=c) scores = cross_val_score(svc, predictor_sc, target, cv=10, scoring='accuracy') acc_score.append(scores.mean()) print(acc_score)
[0.9824539797770286, 0.9806995938121164, 0.9789754558810818, 0.9789754558810818, 0.9789754558810818] Accuracy score is highest for C=4
gamma_range=[0.0001,0.001,0.01,0.1,1,10,100] acc_score= for g in gamma_range: svc = svm.SVC(kernel='rbf', gamma=g) scores = cross_val_score(svc, predictor_sc, target, cv=10, scoring='accuracy') acc_score.append(scores.mean()) print(acc_score)
[0.6274274047186933, 0.6274274047186933, 0.9195035001296346, 0.9561651974764496, 0.9806995938121164, 0.9420026359000951, 0.6274274047186933]
Let us visualize the above points: gamma_range=[0.0001,0.001,0.01,0.1,1,10,100]
# plotting the value of gamma for SVM versus the cross-validated accuracy plt.plot(gamma_range,acc_score) plt.xlabel('Value of gamma for SVC ') plt.xticks(np.arange(0.0001,100,5)) plt.ylabel('Cross-Validated Accuracy')
For gamma between 5 and 100 the kernel performs very poorly.
Let us take a closer look at the cross-validated accuracy for gamma value in between 0 and 5.
gamma_range=list(np.arange(0.1,5,0.1)) acc_score= for g in gamma_range: svc = svm.SVC(kernel='rbf', gamma=g) scores = cross_val_score(svc, predictor_sc, target, cv=10, scoring='accuracy') acc_score.append(scores.mean()) print(acc_score)
[0.9561651974764496, 0.9718952553798289, 0.9754051075965776, 0.9737122979863452, 0.9806995938121164, 0.9806995938121164, 0.9806995938121164, 0.9806995938121164, 0.9806995938121164, 0.9806995938121164, 0.9789754558810818, 0.9754969319851352, 0.9754969319851352, 0.9754969319851352, 0.9754969319851352, 0.9737727940541007, 0.9737727940541007, 0.9737727940541007, 0.9737727940541007, 0.9720184080891883, 0.9720184080891883, 0.9720184080891883, 0.9720184080891883, 0.9720184080891883, 0.9720184080891883, 0.9702326938034741, 0.9702326938034741, 0.9702326938034741, 0.9702326938034741, 0.9702326938034741, 0.9702326938034741, 0.9702326938034741, 0.9702326938034741, 0.9666925935528475, 0.9666925935528475, 0.9684167314838821, 0.9684167314838821, 0.9684167314838821, 0.9701711174487941, 0.9701711174487941, 0.96838540316308, 0.9649068792671333, 0.9649068792671333, 0.9649068792671333, 0.9649068792671333, 0.9649068792671333, 0.9649068792671333, 0.963152493302221, 0.963152493302221]
gamma_range=list(np.arange(0.1,5,0.1)) plt.plot(gamma_range,acc_score) plt.xlabel('Value of gamma for SVC ') #plt.xticks(np.arange(0.0001,5,5)) plt.ylabel('Cross-Validated Accuracy') Text(0,0.5,'Cross-Validated Accuracy')
The highest cross-validated accuracy for rbf kernel remains constant in between gamma=0.5 and gamma=1
degree=[2,3,4,5,6] acc_score= for d in degree: svc = svm.SVC(kernel='poly', degree=d) scores = cross_val_score(svc, predictor_sc, target, cv=10, scoring='accuracy') acc_score.append(scores.mean()) print(acc_score)
[0.8350974418805635, 0.6450652493302222, 0.6274274047186933, 0.6274274047186933, 0.6274274047186933]
plt.plot(degree,acc_score) plt.xlabel('degrees for SVC ') plt.ylabel('Cross-Validated Accuracy') Text(0,0.5,'Cross-Validated Accuracy')
Score is high for second degree polynomial. There is drop in the accuracy score as degree of polynomial increases.Thus increase in polynomial degree results in high complexity of the model.
Support vector machines depend on supervised learning algorithms. The main goal of using SVM is to classify unseen data correctly. SVMs can be used to solve various real-world problems:
In this article, we looked at the machine learning algorithm, Support Vector Machine in detail. We have discussed the concept behind support vector machines, how it works, the process of implementation in Python. We also looked into how to tune its parameters and make efficient models. Lastly, we came across the advantages and disadvantages of SVM along with various real world applications of support vector machines.
We have covered most of the topics related to algorithms in our series of machine learning blogs,click here. If you are inspired by the opportunities provided by machine learning, enrol in our Data Science and Machine Learning Courses for more lucrative career options in this landscape.
Your email address will not be published. Required fields are marked *