Support Vector Machines in Machine Learning (SVM): 2022 Guide

Read it in 22 Mins

Last updated on
21st Oct, 2022
Published
09th Oct, 2019
Views
29,124
Support Vector Machines in Machine Learning (SVM): 2022 Guide

Machine Learning has been advancing over the years, and so have the questions that are being asked in interviews. The focus has gone from the fundamental mathematics questions on Simple Linear Regression and Logistic Regression to more advanced Machine Learning algorithms. One such popular model that is computationally acceptable and has great results is Support Vector Machines in Machine Learning. 

A model that is used for classification, as well as regression, fundamental knowledge on support vector regression and classification, should be known by all Data Science enthusiasts. In this article, we will go over the SVM algorithm or SVM machine learning to learn more about how to model works in depth so that you can add the Support Vector Machine algorithm to your Data Science toolkit. Check KnowledgeHut data science part-time Bootcamp for a better understanding of Support Vector Machines in Machine Learning. 

What are Support Vector Machines (SVM) in Machine Learning? 

The SVM model or Support Vector Machine model is a popular set of supervised learning models that are used for regression as well as classification analysis. It is a model based on the statistical learning framework and is known for being robust and effective in multiple use cases. Based on a non-probabilistic binary linear classifier, a support vector machine is used for separating different classes with the help of various kernels. 

One of the main reasons companies are leaning towards support vector machine models as compared to other models is because Support Vector Machines have significantly higher accuracy that can be leveraged while using decreased computation from the system. One quick point to note here – SVM applications are generally implemented in the field of classification. 

The question as to which kernel to choose while performing minimal computation is huge, especially when we deal with larger datasets. This is done using something called the “kernel trick”. We will deep-dive into this topic in detail in a later section. Let us first get an intuition of support vector machines by looking at a few examples. 

Why are SVMs Used in Machine Learning? 

The two main reasons why support vector machines used in machine learning are: 

  • Relatively High Accuracy: One of the main advantages of a support vector machine is that, as compared to more fundamental algorithms, it has a much higher relative accuracy. This means that when deploying the model in the real world, we see better results from the machine learning models implemented. 
  • Minimal Computation Time: Due to the “kernel trick”, the computation time of SVM support vector machines is reduced, which means that as data scientists, we are able to get better results in a reduced time while utilizing fewer resources. This is a win-win, as we can get better results without affecting hardware utilization costs and even at a faster time. 

Types of Support Vector Machines Algorithm

In this section, we will understand more about the types of SVM based on the kind of data that we use. This is more specific to classification as that is the primary use case for Support Vector Machines. 

1. Linear SVM 

The Linear Support Vector Machine algorithm is used when we have linearly separable data. In simple language, if we have a dataset that can be classified into two groups using a simple straight line, we call it linearly separable data, and the classifier used for this is known as Linear SVM Classifier. 

2. Non-Linear SVM 

The non-linear support vector machine algorithm is used when we have non-linearly separable data. In simple language, if we have a dataset that cannot be classified into two groups using a simple straight line, we call it non-linear separable data, and the classifier used for this is known as a Non-Linear SVM classifier. 

Hyperplane and Support Vectors in SVM Algorithm

In this section, we will discuss more Hyperplane and Support Vectors in SVM: 

1. Hyperplane 

When given a set of points, there can be multiple ways to separate the classes in an n-dimensional space. The way that SVM works, it transforms the lower dimensional data into higher dimensional data and then separates out the points. There are multiple ways to separate the data, and these can be called Decision Boundaries. However, the main idea behind SVM classification is to find the best possible decision boundary. The hyperplane is the optimal, generalized and best-fit boundary for the support vector machine classifier. 

For instance, in a two-dimensional space, as discussed in our example, the hyperplane will be a straight line. In contrast, if the data exists in a three-dimensional space, then the hyperplane will exist in two dimensions. A good rule of thumb is that for an n-dimensional space, the hyperplane will generally have an n-1 dimension. 

The aim is to create a hyperplane that has the highest possible margin to create a generalized model. This indicates that there will be a maximum distance between data points. 

2. Support Vectors 

The term support vector indicates that we have supporting vectors to the main hyperplane. If we have the maximum distance between the support vectors, it is an indication of the best fit. So, support vectors are the vectors that pass through the closest points to the hyperplane and affect the overall position of the hyperplane. 

How Do We Find the Right Hyperplane? 

Now, we come to a great question, how do we find the right hyperplane? Let us try to visualize and understand the two ways that we find the right hyperplane: 

1. Maximize Margin Between Support Vectors 

The recommended way to find the right hyperplane is by maximizing the distance between the support vectors. Below, we visualize what this will look like in a two-dimensional space, this can also be done in an n-dimensional space, but it will be difficult for us to visualize. 

2. Transform Lower Dimensional Data into Higher Dimensional Data 

When we transform lower dimensional data into higher dimensional data, with the help of new features created, it separates the points in a higher dimension, and we can then pass a hyperplane with more efficiency to segregate out the data.  

This is done with the help of the following steps: 

  1. Augment the data with some non-linear features that are computed using the existing features 
  2. Find the separating hyperplane in the higher dimensional space 
  3. Project the points back to the original space 

How Does SVM Work in Machine Learning? 

SVM works based on the principle of maximizing the distance between the support vectors. This ensures that we have the maximum margin possible between points, thus, giving us a generalized model. The aim of Support Vector Machine classification is to maximize the margin between the Support Vectors. You can learn more about SVM in Machine Learning through the data science boot camp. 

1. Linearly Separable Data 

We use kernels in support vector machines. SVM kernels are functions based on which we can transform the data so that it is easier to fit a hyperplane to segregate the points better. 

Linearly separable points consist of points that can be separated by a simple straight line. The line has to have the largest margin possible between the closest points to form a generalized SVM model. 

2. Non-linear Data 

Non-linear data is data that cannot be separated via a simple straight line. We can separate out the classes by mapping the data into a higher dimensional space such that we are able to classify the points. Here, we use derived higher dimensional features from the dataset itself. For instance, with a dataset that is present on the X and Y axis, we will use features such as X2, Y2, and XY to make a higher dimensional model, project the data, make the hyperplane, and then revert the data to its original space. 

This is done using a clever trick we will discuss in the next section. In the end, the figure will look like the below figure, separating the two classes in the same original space.  

3. The Kernel Trick 

The kernel trick is the “superpower” of Support Vector Machines. A Support Vector Machine uses kernels, ls, which is a function based on which the points can be segregated. The points that are no-linearly separable are projects to a higher dimensional space. 

Q. So, what is the “trick” here? 

A. SVM represents the non-linear data points in a fashion where the data points are transformed and then find the hyperplane. However, the points remain the same, and they have not been transformed. 

This trick is the reason that the seeming transformation of the points from a lower to a higher dimension is known as the kernel trick. 

SVM Kernel Functions

We have been talking about SVM kernels for a while now. Let us briefly go over some of the important kernel functions that help transform the data to pass hyperplanes to segregate the data. All the neat tricks we talk about are math; the transformations of the data are performed using linear algebra. We are going to go into a little bit of mathematics now, as this will help give you an intuition of the kernel. 

1. Linear Kernel Function 

The linear kernel is primarily leveraged for linearly separable data. It is used for points that have a linear relationship. 

2. Polynomial Kernel Function

The polynomial kernel function is used by leveraging the dot product and transforming the data to an n-dimension. This helps represent the data with a higher dimension leveraging newly transformed data points. 

3. RBF (Radial basis Function)

This is one of the most common and widely used functions as a kernel, which behaves similarly to a weighted nearest neighbor model. It can transform the given data into infinite dimensions and then leverage the weighted nearest neighbor model to identify the observations that have the biggest influence on the new data point for the classification. The ‘Radial’ function in RBF can either be Laplace or Gaussian. We can decide this based on the ‘Gamma’ hyperparameter. 

4. Sigmoid Function

The sigmoid function is found in use cases such as neural networks, where it is used as an activation function (Tanh). It is also known as the hyperbolic tangent function and has certain use cases where it can segregate the data better. 

That was the support vector machine explained. We have now learned about the various kernels that are used in support vector machine functions. Next, we will go over the SVM classifier python code. 

Simple SVM Classifier [Step-by-Step] 

In this section, we will look over the SVM implementation in Python. We will quickly go over an example of Python code to see a Support Vector Machine in action: 

1. Import the Required Libraries 

The Support Vector Machine can be used from the SVC python library, which stands for Support Vector Classifier. It is a supervised learning algorithm that is used to perform classification and can be found in the sci-kit learn. We can look at two use cases of a dataset that has a linear and non-linear distribution for this python showcase. 

from sklearn.svm import svc # “Support vector classifier” 

2. Import the Required Dataset 

Generally, in this case, we should import the required dataset, perform the necessary pre-processing steps, and then analyze and visualize the data. Here, in this case, we will generate two blobs to highlight the power of support vector machines using kernels that we have discussed in previous sections. 

Furthermore, our dataset will look something like this, where we would like to showcase the linear separation of the data.

Now, if we were to use a linear discriminative classifier, we would attempt to find an optimal straight line between the two sets of data such that we are able to segregate the datasets. There are various lines that can be drawn to segregate the datasets.  

Confused about which one to choose? Remember what we discussed in the previous sections? Our aim is to maximize the margin. In the next section, we will discuss exactly this. 

3. Maximize the Margin 

Now, we need to fit the linear support vector machine so that we can plot the optimal hyperplane to get the best fit model. In this case, we will be using the linear kernel, as the points in the X and Y axis have a linear relationship.

We use the linear kernel within the support vector classifier (svc) from the Sciket learn package to segregate the datasets appropriately. The aim of the dividing decision boundary is to maximize the margin between the diverse groups of points. Some of the points touch the line and are indicated separately. These points are critical and are known as support vectors, and they are stored in the support_vectors_ attribute of the function. 

4. Fit the Support Vector Machine Classifier 

Based on hyperparameter tuning, it is to be decided what the best possible model for the given dataset will be. We notice the support vectors here, and the position of the straight dividing line (called hyperplane for n-dimensional data) will change based on how the margins can be maximized. 

Based on the parameters and the number of rows in the train and test data, the position and accuracy of the SVM model will vary. 

5. Decide Kernel Type Based on Data Distribution Type 

Based on the data distribution, it is possible also to have a non-linear dataset distribution that can be solved using other kernels. For instance, if we were to use a linear kernel on a non-linearly distributed dataset, we would see a plot that looks similar to the following. 

If we were to project and transform the two-dimensional data onto a three-dimensional space, it would look like the following.

Here, in this case, if we used the RBF kernel, the plot would look like the below image, where we have successfully segregated and mapped the hyperplane back to the original points.  

In this section, we successfully went over some simple Python code to generate relevant datasets and displayed how Support Vector Machines can be used to generate fairly accurate models with minimal computation using the kernel trick. In the next section, we will go over some of the applications of Support Vector Machines which can also be learned via the best online data science courses. 

Applying SVM with Default Hyperparameters

Let us return to the example and apply SVM after pre-processing data with default hyperparameters.  

1. Linear Kernel 

from sklearn import svm  
svm2 = svm.SVC(kernel='linear' 
svm2 
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0 
decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear' 
max_iter=-1, probability=False, random_state=None, shrinking=True 
tol=0.001, verbose=False) 
model2 = svm2.fit(x_train_sc, y_train)  
y_pred2 = svm2.predict(x_test_sc)  
print('Accuracy Score’)  
print(metrics.accuracy_score(y_test,y_pred2))
Accuracy Score:0.9707602339181286

2. Gaussian Kernel 

svm3 = SVM.SVC(kernel='rbf')  
svm3  
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, 
decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',  
max_iter=-1, probability=False, random_state=None, shrinking=True 
tol=0.001, verbose=False  
model3 = svm3.fit(x_train_sc, y_train)  
y_pred3 = svm3.predict(x_test_sc)  
print('Accuracy Score’ 
print(metrics.accuracy_score(y_test, y_pred3)) 
Accuracy Score:0.935672514619883  

3. Polynomial Kernel

svm4 = SVM.SVC(kernel='poly' 
svm4
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0 
decision_function_shape='ovr', degree=3, gamma='auto', kernel='poly' 
max_iter=-1, probability=False, random_state=None, shrinking=True 
tol=0.001, verbose=False)  
model4 = svm4.fit(x_train_sc, y_train)  
y_pred4 = svm4.predict(x_test_sc)  
print('Accuracy Score’ 
print(metrics.accuracy_score(y_test,y_pred4)) 
Accuracy Score:0.6198830409356725  

How to Tune Parameters of SVM? 

1. Kernel

Kernel in the support vector machine is responsible for transforming the input data into the required format. Some of the kernels used in support vector machines are linear, polynomial, and radial basis functions (RBF). To create a non-linear hyperplane, we use RBF and the Polynomial function. For complex applications, you should use more advanced kernels to separate nonlinear classes. With this transformation, you can obtain accurate classifiers.  

2. Regularization

Using the Scikit-learn’s C parameters and adjusting, we can maintain regularization. C denotes a penalty parameter representing an error or any form of misclassification. This misclassification allows you to understand how much error is bearable. This helps you nullify the compensation between the misclassified term and the decision boundary. With a smaller C value, you obtain a hyperplane of a small margin, and with a larger C value, a hyperplane of a larger value is obtained.  

3. Gamma

The lower value of Gamma creates a loose fit of the training dataset. On the other hand, a high value of gamma allows the model to get fit more appropriately. A low value of gamma will only provide consideration to the nearby points for the calculation of a separate plane. However, the high gamma value will consider all the data points to calculate the final separation line. 

Examples of SVM

Q. What is the main goal of a Classification Algorithm?

The main goal of a classification model in Machine Learning is to separate different classes of points effectively and generalize. When doing this in a two-dimensional (2-D) plane means drawing a straight line so that we can linearly separate out two classes of points in a manner that the future points also have a high probability of the points being separated out accurately. 

Using the below support vector machine example, we will also introduce some new terminology: 

Let us understand some simple terminology: 

  1. Hyperplane: Similar to how a line can separate points in a two-dimensional space, a hyperplane is the plane that separates out points in an n-dimensional space 
  2. Positive Hyperplane: The dotted line that we can see in the figure, situated in the positive region, is called the positive hyperplane. The positive hyperplane passes through the first point in the positive space. 
  3. Negative Hyperplane: The dotted line that we see in the figure that is situated in the negative region is called the negative hyperplane. The negative hyperplane passes through the first point in the negative space. 
  4. Hard Margin shows that the SVM model is trying to work extremely well on the dataset and can cause overfitting. This is used in linearly separable data, only in linearly separable data. 
  5. A soft Margin indicates that the model is flexible in terms of fitting the dataset and so will not cause overfitting. This is used in most cases when the data is not linearly separable. It allows some extent of misclassification to make the model fit better on the test dataset. 
  6. Maximum Margin Hyperplane: The decision boundary (indicated in the above figure as a solid line) is the decision boundary based on which the points are bifurcated. 

The idea behind selecting the decision boundary is that the larger the margin (difference between the positive hyperplane and negative hyperplane), the lesser the generalization error, as when we have smaller margins with the decision boundaries, it tends to lead to overfitting. 

Besides this simple yet effective example, a support vector machine is used to perform more complex use cases such as the categorization of text, classification of images and even face detection. 

Applications of Support Vector Machine

In this section, we will go over some of the use cases of support vector machines: 

  • Email Classification: A support vector machine can be used for email classification, where we decide if an email is spam or ham 
  • Face Detection: Leveraging SVM, we can perform face recognition, where we train the model on the dataset, and we can predict. In the below image, we can see the training dataset. Moreover, in the below image, we can see the test dataset, where the text in red indicates that the image has been incorrectly predicted. 

We can also get the metrics such as precision, recall and f1-score for the same. 

  • Text Categorization: Categorization of both inductive and transductive models is used for training, and it uses different scores generated to compare with the threshold value. 
  • Handwriting Recognition: SVM can also be used for handwriting recognition, where we are able to convert hand-written text to machine-readable text. 
  • Bioinformatics: This includes cancer classification and protein classification, where we use SVM to identify the classification of patients and genes based on biological markers. 

Advantages and Disadvantages of Support Vector Machine

In this section, we will go over the advantages of SVM and disadvantages of SVM: 

Advantages of SVM 

  • When there is a clear margin of separation between various classes, SVM works well. 
  • Memory efficiency is one of the key advantages of SVM, as it uses a subset of training points as part of the decision function support vectors). 
  • SVM tends to be an optimized algorithm when the data exists in a high dimensional space. 
  • It works well when there is a higher number of columns than the number of rows. 
  • It is possible to use various kernel functions to make better models. 

Disadvantages of SVM 

  • SVM can achieve better results with smaller datasets with higher dimensions, as large datasets may take a longer time. 
  • We need to be aware that when we have a large number of features, we need to avoid overfitting in the data and regularization of terms is critical. 
  • Due to the kernel trick, SVM works by projecting the data into a higher dimension, so there is no probabilistic explanation to perform classification. This can be done, but it uses a computationally intensive five-fold cross-validation. 

Conclusion

The Support Vector Machine (SVM) is a machine that is supervised to learn algorithms used for both classification and regression. The SVM algorithm’s objective is to find a hyperplane in an N-dimensional space that distinctly classifies data points. In this blog, we went over end-to-end questions that will be asked in an interview. We got to know about the Kernel trick and understood the various terminology associated with Support Vector Machines. 

We also went through some simple coding examples and how the margin can be maximized with the help of Support Vectors. You can go for KnowledgeHut’s data science part-time boot camp if you are looking forward to getting an excellent job placement. 

Frequently Asked Questions (FAQs) 

1. What are Support Vector Machines with eExamples? 

A support vector machine is a set of supervised learning models that can be used for classification as well as regression. It works well on high dimensional data and has fairly high accuracy and minimal computation time, especially with smaller training datasets. Unfortunately, it does not provide a probabilistic estimation of the points. It is used for linearly and non-linearly separable data and is used in cases such as email classification, text categorization, face detection, and handwriting recognition. 

2. Which Type of Classifier is SVM? 

SVMs are maximal-margin classifiers, as compared to some other algorithms, such as Naïve Bayes, which are probabilistic-based classifiers. Besides this, there are other kernel functions we can use as well. 

3. Which is Better, SVM or Neural Network? 

Depending on the use case, the model needs to be chosen. The prediction time for neural networks is faster than SVM. The processing for the parameters of SVM increases linearly with the increase in the size of the input. In many use cases, neural networks perform better but can be computationally expensive. 

4. What are the Advantages of SVM? 

There are multiple advantages of SVM. It has better memory efficiency and works well with high-dimensional data (Where the number of columns is more than the number of rows). It also works well when the size of the data is small. It also has the advantage of the kernel trick, where various kernels can be used to segregate the data better. 

5. CaWe Use SVM for Regression? 

Yes, SVM can be used for regression, and the Support Vector Regression (SVR) package in Python can be used for the same. It uses similar principles as support vector machines but for regression problems. 

Profile

Anish Mahapatra

Author

A Lead Data Science consultant for multiple Fortune 500 clients, Anish Mahapatra has helped over 2000+ professionals enter the field of Data Science. MSc in Data Science and a technical writer for the top Data Science publications, he is always happy to help learners. You can follow him on LinkedIn and Instagram.