Search

Series List Filter

Machine Learning Algorithms: [With Essentials, Principles, Types & Examples covered]

The advancements in Science and Technology are making every step of our daily life more comfortable. Today, the use of Machine learning systems, which is an integral part of Artificial Intelligence, has spiked and is seen playing a remarkable role in every user’s life. For instance, the widely popular, Virtual Personal Assistant being used for playing a music track or setting an alarm, face detection or voice recognition applications are the awesome examples of the machine learning systems that we see everyday. Machine learning, a subset of artificial intelligence, is the ability of a system to learn or predict the user’s needs and perform an expected task without human intervention. The inputs for the desired predictions are taken from user’s previously performed tasks or from relative examples.Why should you choose Machine Learning?Wonder why one should choose Machine Learning? Simply put, machine learning makes complex tasks much easier.  It makes the impossible possible!The following scenarios explain why we should opt for machine learning:During facial recognition and speech processing, it would be tedious to write the codes manually to execute the process, that's where machine learning comes handy.For market analysis, figuring customer preferences or fraud detection, machine learning has become essential.For the dynamic changes that happen in real-time tasks, it would be a challenging ordeal to solve through human intervention alone.Essentials of Machine Learning AlgorithmsTo state simply, machine learning is all about predictions – a machine learning, thinking and predicting what’s next. Here comes the question – what will a machine learn, how will a machine analyze, what will it predict.You have to understand two terms clearly before trying to get answers to these questions:DataAlgorithmDataData is what that is fed to the machine. For example, if you are trying to design a machine that can predict the weather over the next few days, then you should input the past ‘data’ that comprise maximum and minimum air temperatures, the speed of the wind, amount of rainfall, etc. All these come under ‘data’ that your machine will learn, and then analyse later.If we observe carefully, there will always be some pattern or the other in the input data we have. For example, the maximum and minimum ranges of temperatures may fall in the same bracket; or speeds of the wind may be slightly similar for a given season, etc. But, machine learning helps analyse such patterns very deeply. And then it predicts the outcomes of the problem we have designed it for.AlgorithmWhile data is the ‘food’ to the machine, an algorithm is like its digestive system. An algorithm works on the data. It crushes it; analyses it; permutates it; finds the gaps and fills in the blanks.Algorithms are the methods used by machines to work on the data input to them.What to consider before finalizing a Machine Learning algorithm?Depending on the functionality expected from the machine, algorithms range from very basic to highly complex. You should be wise in making a selection of an algorithm that suits your ML needs. Careful consideration and testing are needed before finalizing an algorithm for a purpose.For example, linear regression works well for simple ML functions such as speech analysis. In case, accuracy is your first choice, then slightly higher level functionalities such as Neural networks will do.This concept is called ‘The Explainability- Accuracy Tradeoff’. The following diagram explains this better:Image SourceBesides, with regards to machine learning algorithms, you need to remember the following aspects very clearly:No algorithm is an all-in-one solution to any type of problem; an algorithm that fits a scenario is not destined to fit in another one.Comparison of algorithms mostly does not make sense as each one of it has its own features and functionality. Many factors such as the size of data, data patterns, accuracy needed, the structure of the dataset, etc. play a major role in comparing two algorithms.The Principle behind Machine Learning AlgorithmsAs we learnt, an algorithm churns the given data and finds a pattern among them. Thus, all machine learning algorithms, especially the ones used for supervised learning, follow one similar principle:If the input variables or the data is X and you expect the machine to give a prediction or output Y, the machine will work on as per learning a target function ‘f’, whose pattern is not known to us.Thus, Y= f(X) fits well for every supervised machine learning algorithm. This is otherwise also called Predictive Modeling or Predictive Analysis, which ultimately provides us with the best ever prediction possible with utmost accuracy.Types of Machine Learning AlgorithmsDiving further into machine learning, we will first discuss the types of algorithms it has. Machine learning algorithms can be classified as:Supervised, andUnsupervisedSemi-supervised algorithmsReinforcement algorithmsA brief description of the types of  algorithms is given below:1. Supervised machine learning algorithmsIn this method, to get the output for a new set of user’s input, a model is trained to predict the results by using an old set of inputs and its relative known set of outputs. In other words, the system uses the examples used in the past.A data scientist trains the system on identifying the features and variables it should analyze. After training, these models compare the new results to the old ones and update their data accordingly to improve the prediction pattern.An example: If there is a basket full of fruits, based on the earlier specifications like color, shape and size given to the system, the model will be able to classify the fruits.There are 2 techniques in supervised machine learning and a technique to develop a model is chosen based on the type of data it has to work on.A) Techniques used in Supervised learningSupervised algorithms use either of the following techniques to develop a model based on the type of data.RegressionClassification1. Regression Technique In a given dataset, this technique is used to predict a numeric value or continuous values (a range of numeric values) based on the relation between variables obtained from the dataset.An example would be guessing the price of a house based after a year, based on the current price, total area, locality and number of bedrooms.Another example is predicting the room temperature in the coming hours, based on the volume of the room and current temperature.2. Classification Technique This is used if the input data can be categorized based on patterns or labels.For example, an email classification like recognizing a spam mail or face detection which uses patterns to predict the output.In summary, the regression technique is to be used when predictable data is in quantity and Classification technique is to be used when predictable data is about predicting a label.B) Algorithms that use Supervised LearningSome of the machine learning algorithms which use supervised learning method are:Linear RegressionLogistic RegressionRandom ForestGradient Boosted TreesSupport Vector Machines (SVM)Neural NetworksDecision TreesNaive BayesWe shall discuss some of these algorithms in detail as we move ahead in this post.2. Unsupervised machine learning algorithmsThis method does not involve training the model based on old data, I.e. there is no “teacher” or “supervisor” to provide the model with previous examples.The system is not trained by providing any set of inputs and relative outputs.  Instead, the model itself will learn and predict the output based on its own observations.For example, consider a basket of fruits which are not labeled/given any specifications this time. The model will only learn and organize them by comparing Color, Size and shape.A. Techniques used in unsupervised learningWe are discussing these techniques used in unsupervised learning as under:ClusteringDimensionality ReductionAnomaly detectionNeural networks1. ClusteringIt is the method of dividing or grouping the data in the given data set based on similarities.Data is explored to make groups or subsets based on meaningful separations.Clustering is used to determine the intrinsic grouping among the unlabeled data present.An example where clustering principle is being used is in digital image processing where this technique plays its role in dividing the image into distinct regions and identifying image border and the object.2. Dimensionality reductionIn a given dataset, there can be multiple conditions based on which data has to be segmented or classified.These conditions are the features that the individual data element has and may not be unique.If a dataset has too many numbers of such features, it makes it a complex process to segregate the data.To solve such type of complex scenarios, dimensional reduction technique can be used, which is a process that aims to reduce the number of variables or features in the given dataset without loss of important data.This is done by the process of feature selection or feature extraction.Email Classification can be considered as the best example where this technique was used.3. Anomaly DetectionAnomaly detection is also known as Outlier detection.It is the identification of rare items, events or observations which raise suspicions by differing significantly from the majority of the data.Examples of the usage are identifying a structural defect, errors in text and medical problems.4. Neural NetworksA Neural network is a framework for many different machine learning algorithms to work together and process complex data inputs.It can be thought of as a “complex function” which gives some output when an input is given.The Neural Network consists of 3 parts which are needed in the construction of the model.Units or NeuronsConnections or Parameters.Biases.Neural networks are into a wide range of applications such as coastal engineering, hydrology and medicine where they are being used in identifying certain types of cancers.B. Algorithms that use unsupervised learningSome of the most common algorithms in unsupervised learning are:hierarchical clustering,k-meansmixture modelsDBSCANOPTICS algorithmAutoencodersDeep Belief NetsHebbian LearningGenerative Adversarial NetworksSelf-organizing mapWe shall discuss some of these algorithms in detail as we move ahead in this post.3.Semi Supervised AlgorithmsIn case of semi-supervised algorithms, as the name goes, it is a mix of both supervised and unsupervised algorithms. Here both labelled and unlabelled examples exist, and in many scenarios of semi-supervised learning, the count of unlabelled examples is more than that of labelled ones.Classification and regression form typical examples for semi-supervised algorithms.The algorithms under semi-supervised learning are mostly extensions of other methods, and the machines that are trained in the semi-supervised method make assumptions when dealing with unlabelled data.Examples of Semi Supervised Learning:Google Photos are the best example of this model of learning. You must have observed that at first, you define the user name in the picture and teach the features of the user by choosing a few photos. Then the algorithm sorts the rest of the pictures accordingly and asks you in case it gets any doubts during classification.Comparing with the previous supervised and unsupervised types of learning models, we can make the following inferences for semi-supervised learning:Labels are entirely present in case of supervised learning, while for unsupervised learning they are totally absent. Semi-supervised is thus a hybrid mix of both these two.The semi-supervised model fits well in cases where cost constraints are present for machine learning modelling. One can label the data as per cost requirements and leave the rest of the data to the machine to take up.Another advantage of semi-supervised learning methods is that they have the potential to exploit the unlabelled data of a group in cases where data carries important unexploited information.4. Reinforcement LearningIn this type of learning, the machine learns from the feedback it has received. It constantly learns and upgrades its existing skills by taking the feedback from the environment it is in.Markov’s Decision process is the best example of reinforcement learning.In this mode of learning, the machine learns iteratively the correct output. Based on the reward obtained from each iteration,the machine knows what is right and what is wrong. This iteration keeps going till the full range of probable outputs are covered.Process of Reinforcement LearningThe steps involved in reinforcement learning are as shown below:Input state is taken by the agentA predefined function indicates the action to be performedBased on the action, the reward is obtained by the machineThe resulting pair of feedback and action is stored for future purposesExamples of Reinforcement Learning AlgorithmsComputer based games such as chessArtificial hands that are based on roboticsDriverless cars/ self-driven carsMost Used Machine Learning Algorithms - ExplainedIn this section, let us discuss the following most widely used machine learning algorithms in detail:Decision TreesNaive Bayes ClassificationThe AutoencoderSelf-organizing mapHierarchical clusteringOPTICS algorithm1. Decision TreesThis algorithm is an example of supervised learning.A Decision tree is a pictorial representation or a graphical representation which depicts every possible outcome of a decision.The various elements involved here are node, branch and leaf where ‘node’ represents an ‘attribute’, ‘branch’ representing a ‘decision’ and ‘leaf’ representing an ‘outcome’ of the feature after applying that particular decision.A decision tree is just an analogy of how a human thinks to take a decision with yes/no questions.The below decision tree explains a school admission procedure rule, where Age is primarily checked, and if age is < 5, admission is not given to them. And for the kids who are eligible for admission, a check is performed on Annual income of parents where if it is < 3 L p.a. the students are further eligible to get a concession on the fees.2. Naive Bayes ClassificationThis supervised machine learning algorithm is a powerful and fast classifying algorithm, using the Bayes rule in determining the conditional probability and to predict the results.Its popular uses are, face recognition, filtering spam emails, predicting the user inputs in chat by checking communicated text and to label news articles as sports, politics etc.Bayes Rule: The Bayes theorem defines a rule in determining the probability of occurrence of an “Event” when information about “Tests” is provided.“Event” can be considered as the patient having a Heart disease while “tests” are the positive conditions that match with the event3. The AutoencoderIt comes under the category of unsupervised learning using neural networking techniques.An autoencoder is intended to learn or encode a representation for a given data set.This also involves the process of dimensional reduction which trains the network to remove the "noise" signal.In hand, with the reduction, it also works in reconstruction where the model tries to rebuild or generate a representation from the reduced encoding which is equivalent to the original input.I.e. without the loss of important and needed information from the given input, an Autoencoder removes or ignores the unnecessary noise and also works on rebuilding the output.Pic sourceThe most common use of Autoencoder is an application that converts black and white image to color. Based on the content and object in the image (like grass, water, sky, face, dress) coloring is processed.4. Self-organizing mapThis comes under the unsupervised learning method.Self-Organizing Map uses the data visualization technique by operating on a given high dimensional data.The Self-Organizing Map is a two-dimensional array of neurons: M = {m1,m2,......mn}It reduces the dimensions of the data to a map, representing the clustering concept by grouping similar data together.SOM reduces data dimensions and displays similarities among data.SOM uses clustering technique on data without knowing the class memberships of the input data where several units compete for the current object.In short, SOM converts complex, nonlinear statistical relationships between high-dimensional data into simple geometric relationships on a low-dimensional display.5. Hierarchical clusteringHierarchical clustering uses one of the below clustering techniques to determine a hierarchy of clusters.Thus produced hierarchy resembles a tree structure which is called a “Dendrogram”.The techniques used in hierarchical clustering are:K-Means,DBSCAN,Gaussian Mixture Models.The 2 methods in finding hierarchical clusters are:Agglomerative clusteringDivisive clusteringAgglomerative clusteringThis is a bottom-up approach, where each data point starts in its own cluster.These clusters are then joined greedily, by taking the two most similar clusters together and merging them.Divisive clusteringInverse to Agglomerative, this uses a top-down approach, wherein all data points start in the same cluster after which a parametric clustering algorithm like K-Means is used to divide the cluster into two clusters.Each cluster is further divided into two clusters until a desired number of clusters are hit.6. OPTICS algorithmOPTICS is an abbreviation for ordering points to identify the clustering structure.OPTICS works in principle like an extended DB Scan algorithm for an infinite number for a distance parameter which is smaller than a generating distance.From a wide range of parameter settings, OPTICS outputs a linear list of all objects under analysis in clusters based on their density.How to Choose Machine Learning Algorithms in Real TimeWhen implementing algorithms in real time, you need to keep in mind three main aspects: Space, Time, and Output.Besides, you should clearly understand the aim of your algorithm:Do you want to make predictions for the future?Are you just categorizing the given data?Is your targeted task simple or comprises of multiple sub-tasks?The following table will show you certain real-time scenarios and help you to understand which algorithm is best suited to each scenario:Real time scenarioBest suited algorithmWhy this algorithm is the best fit?Simple straightforward data set with no complex computationsLinear RegressionIt takes into account all factors involved and predicts the result with simple error rate explanation.For simple computations, you need not spend much computational power; and linear regression runs with minimal computational power.Classifying already labeled data into sub-labelsLogistic RegressionThis algorithm looks at every data point into two subcategories, hence best for sub-labeling.Logistic regression model works best when you have multiple targets.Sorting unlabelled data into groupsK-Means clustering algorithmThis algorithm groups and clusters data by measuring the spatial distance between each point.You can choose from its sub-types - Mean-Shift algorithm and Density-Based Spatial Clustering of Applications with NoiseSupervised text classification (analyzing reviews, comments, etc.)Naive BayesSimplest model that can perform powerful pre-processing and cleaning of textRemoves filler stop words effectivelyComputationally in-expensiveLogistic regressionSorts words one by one and assigns a probabilityRanks next to Naïve Bayes in simplicityLinear Support Vector Machine algorithmCan be chosen when performance mattersBag-of-words modelSuits best when vocabulary and the measure of known words is known.Image classificationConvolutional neural networkBest suited for complex computations such as analyzing visual cortexesConsumes more computational power and gives the best resultsStock market predictionsRecurrent neural networkBest suited for time-series analysis with well-defined and supervised data.Works efficiently in taking into account the relation between data and its time distribution.How to Run Machine Learning Algorithms?Till now you have learned in detail about various algorithms of machine learning, their features, selection and application in real time.When implementing the algorithm in real time, you can do it in any programming language that works well for machine learning.All that you need to do is use the standard libraries of the programming language that you have chosen and work on them, or program everything from scratch.Need more help? You can check these links for more clarity on coding machine learning algorithms in various programming languages.How To Get Started With Machine Learning Algorithms in RHow to Run Your First Classifier in WekaMachine Learning Algorithm Recipes in scikit-learnWhere do we stand in Machine Learning?Machine learning is slowly making strides into as many fields in our daily life as possible. Some businesses are making it strict to have transparent algorithms that do not affect their business privacy or data security. They are even framing regulations and performing audit trails to check if there is any discrepancy in the above-said data policies.The point to note here is that a machine working on machine learning principles and algorithms give output after processing the data through many nonlinear computations. If one needs to understand how a machine predicts, perhaps it can be possible only through another machine learning algorithm!Applications of Machine LearningCurrently, the role of Machine learning and Artificial Intelligence in human life is intertwined. With the advent of evolving technologies, AI and ML have marked their existence in all possible aspects.Machine learning finds a plethora of applications in several domains of our day to day life. An exhaustive list of fields where machine learning is currently in use now is shown in the diagram here. An explanation for the same follows further below:Financial Services: Banks and financial services are increasingly relying on machine learning to identify financial fraud, portfolio management, identify and suggest good options for investment for customers.Police Department: Apps based on facial recognition and other techniques of machine learning are being used by the police to identify and get hold of criminals.Online Marketing and Sales: Machine learning is helping companies a great deal in studying the shopping and spending patterns of customers and in making personalized product recommendations to them. Machine learning also eases customer support, product recommendations and advertising ideas for e-commerce.Healthcare: Doctors are using machine learning to predict and analyze the health status and disease progress of patients. Machine learning has proven its accuracy in detecting health condition, heartbeat, blood pressure and in identifying certain types of cancer. Advanced techniques of machine learning are being implemented in robotic surgery too.Household Applications: Household appliances that use face detection and voice recognition are gaining popularity as security devices and personal virtual assistants at homes.Oil and Gas: In analyzing underground minerals and carrying out the exploration and mining, geologists and scientists are using machine learning for improved accuracy and reduced investments.Transport: Machine learning can be used to identify the vehicles that are moving in prohibited zones for traffic control and safety monitoring purposes.Social Media: In social media, spam is a big nuisance. Companies are using machine learning to filter spam. Machine learning also aptly solves the purpose of sentiment analysis in social media.Trading and Commerce: Machine learning techniques are being implemented in online trading to automate the process of trading. Machines learn from the past performances of trading and use this knowledge to make decisions about future trading options.Future of Machine LearningMachine learning is already making a difference in the way businesses are offering their services to us, the customers. Voice-based search and preferences based ads are just basic functionalities of how machine learning is changing the face of businesses.ML has already made an inseparable mark in our lives. With more advancement in various fields, ML will be an integral part of all AI systems. ML algorithms are going to be made continuously learning with the day-to-day updating information.With the rapid rate at which ongoing research is happening in this field, there will be more powerful machine learning algorithms to make the way we live even more sophisticated!From 2013- 2017, the patents in the field of machine learning has recorded a growth of 34%, according to IFI Claims Patent Services (Patent Analytics). Also, 60% of the companies in the world are using machine learning for various purposes.A peek into the future trends and growth of machine learning through the reports of Predictive Analytics and Machine Learning (PAML) market shows a 21% CAGR by 2021.ConclusionUltimately, machine learning should be designed as an aid that would support mankind. The notion that automation and machine learning are threats to jobs and human workforce is pretty prevalent. It should always be remembered that machine learning is just a technology that has evolved to ease the life of humans by reducing the needed manpower and to offer increased efficiency at lower costs that too in a shorter time span. The onus of using machine learning in a responsible manner lies in the hands of those who work on/with it.However, stay tuned to an era of artificial intelligence and machine learning that makes the impossible possible and makes you witness the unseen!AI is likely to be the best thing or the worst thing to happen to humanity. – Stephen Hawking
Rated 4.5/5 based on 16 customer reviews

Machine Learning Algorithms: [With Essentials, Principles, Types & Examples covered]

9750
  • by Animikh Aich
  • 03rd May, 2019
  • Last updated on 23rd Sep, 2019
  • 20 mins read
Machine Learning Algorithms: [With Essentials, Principles, Types & Examples covered]

The advancements in Science and Technology are making every step of our daily life more comfortable. Today, the use of Machine learning systems, which is an integral part of Artificial Intelligence, has spiked and is seen playing a remarkable role in every user’s life. 

For instance, the widely popular, Virtual Personal Assistant being used for playing a music track or setting an alarm, face detection or voice recognition applications are the awesome examples of the machine learning systems that we see everyday. 

Machine learning, a subset of artificial intelligence, is the ability of a system to learn or predict the user’s needs and perform an expected task without human intervention. The inputs for the desired predictions are taken from user’s previously performed tasks or from relative examples.

Why should you choose Machine Learning?

Wonder why one should choose Machine Learning? Simply put, machine learning makes complex tasks much easier.  It makes the impossible possible!

The following scenarios explain why we should opt for machine learning:

Why should you choose Machine Learning?

  1. During facial recognition and speech processing, it would be tedious to write the codes manually to execute the process, that's where machine learning comes handy.
  2. For market analysis, figuring customer preferences or fraud detection, machine learning has become essential.
  3. For the dynamic changes that happen in real-time tasks, it would be a challenging ordeal to solve through human intervention alone.

Essentials of Machine Learning Algorithms

To state simply, machine learning is all about predictions – a machine learning, thinking and predicting what’s next. Here comes the question – what will a machine learn, how will a machine analyze, what will it predict.

You have to understand two terms clearly before trying to get answers to these questions:

  • Data
  • Algorithm

Essentials of Machine Learning Algorithms.

Data

Data is what that is fed to the machine. For example, if you are trying to design a machine that can predict the weather over the next few days, then you should input the past ‘data’ that comprise maximum and minimum air temperatures, the speed of the wind, amount of rainfall, etc. All these come under ‘data’ that your machine will learn, and then analyse later.

If we observe carefully, there will always be some pattern or the other in the input data we have. For example, the maximum and minimum ranges of temperatures may fall in the same bracket; or speeds of the wind may be slightly similar for a given season, etc. But, machine learning helps analyse such patterns very deeply. And then it predicts the outcomes of the problem we have designed it for.

Algorithm

 A graphical representation of an Algorithm

While data is the ‘food’ to the machine, an algorithm is like its digestive system. An algorithm works on the data. It crushes it; analyses it; permutates it; finds the gaps and fills in the blanks.

Algorithms are the methods used by machines to work on the data input to them.

What to consider before finalizing a Machine Learning algorithm?

Depending on the functionality expected from the machine, algorithms range from very basic to highly complex. You should be wise in making a selection of an algorithm that suits your ML needs. Careful consideration and testing are needed before finalizing an algorithm for a purpose.

For example, linear regression works well for simple ML functions such as speech analysis. In case, accuracy is your first choice, then slightly higher level functionalities such as Neural networks will do.

This concept is called ‘The Explainability- Accuracy Tradeoff’. The following diagram explains this better:

Explainability-accuracy tradeoff of Machine LearningImage Source

Besides, with regards to machine learning algorithms, you need to remember the following aspects very clearly:

  • No algorithm is an all-in-one solution to any type of problem; an algorithm that fits a scenario is not destined to fit in another one.
  • Comparison of algorithms mostly does not make sense as each one of it has its own features and functionality. Many factors such as the size of data, data patterns, accuracy needed, the structure of the dataset, etc. play a major role in comparing two algorithms.

The Principle behind Machine Learning Algorithms

As we learnt, an algorithm churns the given data and finds a pattern among them. Thus, all machine learning algorithms, especially the ones used for supervised learning, follow one similar principle:

If the input variables or the data is X and you expect the machine to give a prediction or output Y, the machine will work on as per learning a target function ‘f’, whose pattern is not known to us.

Thus, Y= f(X) fits well for every supervised machine learning algorithm. This is otherwise also called Predictive Modeling or Predictive Analysis, which ultimately provides us with the best ever prediction possible with utmost accuracy.

Types of Machine Learning Algorithms

Diving further into machine learning, we will first discuss the types of algorithms it has. Machine learning algorithms can be classified as:

  • Supervised, and
  • Unsupervised
  • Semi-supervised algorithms
  • Reinforcement algorithms

A brief description of the types of  algorithms is given below:

1. Supervised machine learning algorithms

In this method, to get the output for a new set of user’s input, a model is trained to predict the results by using an old set of inputs and its relative known set of outputs. In other words, the system uses the examples used in the past.

A data scientist trains the system on identifying the features and variables it should analyze. After training, these models compare the new results to the old ones and update their data accordingly to improve the prediction pattern.

An example: If there is a basket full of fruits, based on the earlier specifications like color, shape and size given to the system, the model will be able to classify the fruits.

There are 2 techniques in supervised machine learning and a technique to develop a model is chosen based on the type of data it has to work on.

A) Techniques used in Supervised learning

Supervised algorithms use either of the following techniques to develop a model based on the type of data.

  1. Regression
  2. Classification

1. Regression Technique 

  • In a given dataset, this technique is used to predict a numeric value or continuous values (a range of numeric values) based on the relation between variables obtained from the dataset.
  • An example would be guessing the price of a house based after a year, based on the current price, total area, locality and number of bedrooms.
  • Another example is predicting the room temperature in the coming hours, based on the volume of the room and current temperature.

2. Classification Technique 

  • This is used if the input data can be categorized based on patterns or labels.
  • For example, an email classification like recognizing a spam mail or face detection which uses patterns to predict the output.

In summary, the regression technique is to be used when predictable data is in quantity and Classification technique is to be used when predictable data is about predicting a label.

B) Algorithms that use Supervised Learning

Some of the machine learning algorithms which use supervised learning method are:

  • Linear Regression
  • Logistic Regression
  • Random Forest
  • Gradient Boosted Trees
  • Support Vector Machines (SVM)
  • Neural Networks
  • Decision Trees
  • Naive Bayes

We shall discuss some of these algorithms in detail as we move ahead in this post.

2. Unsupervised machine learning algorithms

This method does not involve training the model based on old data, I.e. there is no “teacher” or “supervisor” to provide the model with previous examples.

The system is not trained by providing any set of inputs and relative outputs.  Instead, the model itself will learn and predict the output based on its own observations.

For example, consider a basket of fruits which are not labeled/given any specifications this time. The model will only learn and organize them by comparing Color, Size and shape.

A. Techniques used in unsupervised learning

We are discussing these techniques used in unsupervised learning as under:

  • Clustering
  • Dimensionality Reduction
  • Anomaly detection
  • Neural networks

1. Clustering

  • It is the method of dividing or grouping the data in the given data set based on similarities.
  • Data is explored to make groups or subsets based on meaningful separations.
  • Clustering is used to determine the intrinsic grouping among the unlabeled data present.
  • An example where clustering principle is being used is in digital image processing where this technique plays its role in dividing the image into distinct regions and identifying image border and the object.

2. Dimensionality reduction

  • In a given dataset, there can be multiple conditions based on which data has to be segmented or classified.
  • These conditions are the features that the individual data element has and may not be unique.
  • If a dataset has too many numbers of such features, it makes it a complex process to segregate the data.
  • To solve such type of complex scenarios, dimensional reduction technique can be used, which is a process that aims to reduce the number of variables or features in the given dataset without loss of important data.
  • This is done by the process of feature selection or feature extraction.
  • Email Classification can be considered as the best example where this technique was used.

3. Anomaly Detection

  • Anomaly detection is also known as Outlier detection.
  • It is the identification of rare items, events or observations which raise suspicions by differing significantly from the majority of the data.
  • Examples of the usage are identifying a structural defect, errors in text and medical problems.

4. Neural NetworksNeural Networks in Machine learning

  • A Neural network is a framework for many different machine learning algorithms to work together and process complex data inputs.
  • It can be thought of as a “complex function” which gives some output when an input is given.
  • The Neural Network consists of 3 parts which are needed in the construction of the model.
    • Units or Neurons
    • Connections or Parameters.
    • Biases.

Neural networks are into a wide range of applications such as coastal engineering, hydrology and medicine where they are being used in identifying certain types of cancers.

B. Algorithms that use unsupervised learning

Some of the most common algorithms in unsupervised learning are:

  1. hierarchical clustering,
  2. k-means
  3. mixture models
  4. DBSCAN
  5. OPTICS algorithm
  6. Autoencoders
  7. Deep Belief Nets
  8. Hebbian Learning
  9. Generative Adversarial Networks
  10. Self-organizing map

We shall discuss some of these algorithms in detail as we move ahead in this post.

3.Semi Supervised Algorithms

In case of semi-supervised algorithms, as the name goes, it is a mix of both supervised and unsupervised algorithms. Here both labelled and unlabelled examples exist, and in many scenarios of semi-supervised learning, the count of unlabelled examples is more than that of labelled ones.

Classification and regression form typical examples for semi-supervised algorithms.

The algorithms under semi-supervised learning are mostly extensions of other methods, and the machines that are trained in the semi-supervised method make assumptions when dealing with unlabelled data.

Examples of Semi Supervised Learning:

Google Photos are the best example of this model of learning. You must have observed that at first, you define the user name in the picture and teach the features of the user by choosing a few photos. Then the algorithm sorts the rest of the pictures accordingly and asks you in case it gets any doubts during classification.

Comparing with the previous supervised and unsupervised types of learning models, we can make the following inferences for semi-supervised learning:

  • Labels are entirely present in case of supervised learning, while for unsupervised learning they are totally absent. Semi-supervised is thus a hybrid mix of both these two.
  • The semi-supervised model fits well in cases where cost constraints are present for machine learning modelling. One can label the data as per cost requirements and leave the rest of the data to the machine to take up.
  • Another advantage of semi-supervised learning methods is that they have the potential to exploit the unlabelled data of a group in cases where data carries important unexploited information.

4. Reinforcement Learning

In this type of learning, the machine learns from the feedback it has received. It constantly learns and upgrades its existing skills by taking the feedback from the environment it is in.

Markov’s Decision process is the best example of reinforcement learning.

In this mode of learning, the machine learns iteratively the correct output. Based on the reward obtained from each iteration,the machine knows what is right and what is wrong. This iteration keeps going till the full range of probable outputs are covered.

Process of Reinforcement Learning

The steps involved in reinforcement learning are as shown below:

  1. Input state is taken by the agent
  2. A predefined function indicates the action to be performed
  3. Based on the action, the reward is obtained by the machine
  4. The resulting pair of feedback and action is stored for future purposes

Examples of Reinforcement Learning Algorithms

  • Computer based games such as chess
  • Artificial hands that are based on robotics
  • Driverless cars/ self-driven cars

Most Used Machine Learning Algorithms - Explained

In this section, let us discuss the following most widely used machine learning algorithms in detail:

  1. Decision Trees
  2. Naive Bayes Classification
  3. The Autoencoder
  4. Self-organizing map
  5. Hierarchical clustering
  6. OPTICS algorithm

1. Decision Trees

  • This algorithm is an example of supervised learning.
  • A Decision tree is a pictorial representation or a graphical representation which depicts every possible outcome of a decision.
  • The various elements involved here are node, branch and leaf where ‘node’ represents an ‘attribute’, ‘branch’ representing a ‘decision’ and ‘leaf’ representing an ‘outcome’ of the feature after applying that particular decision.
  • A decision tree is just an analogy of how a human thinks to take a decision with yes/no questions.
  • The below decision tree explains a school admission procedure rule, where Age is primarily checked, and if age is < 5, admission is not given to them. And for the kids who are eligible for admission, a check is performed on Annual income of parents where if it is < 3 L p.a. the students are further eligible to get a concession on the fees.

Decision Trees in Machine Learning Algorithm

2. Naive Bayes Classification

  • This supervised machine learning algorithm is a powerful and fast classifying algorithm, using the Bayes rule in determining the conditional probability and to predict the results.
  • Its popular uses are, face recognition, filtering spam emails, predicting the user inputs in chat by checking communicated text and to label news articles as sports, politics etc.
  • Bayes Rule: The Bayes theorem defines a rule in determining the probability of occurrence of an “Event” when information about “Tests” is provided.

Bayes Rule

  • “Event” can be considered as the patient having a Heart disease while “tests” are the positive conditions that match with the event

3. The Autoencoder

  • It comes under the category of unsupervised learning using neural networking techniques.
  • An autoencoder is intended to learn or encode a representation for a given data set.
  • This also involves the process of dimensional reduction which trains the network to remove the "noise" signal.
  • In hand, with the reduction, it also works in reconstruction where the model tries to rebuild or generate a representation from the reduced encoding which is equivalent to the original input.
  • I.e. without the loss of important and needed information from the given input, an Autoencoder removes or ignores the unnecessary noise and also works on rebuilding the output.

 The Autoencoder

Pic source

  • The most common use of Autoencoder is an application that converts black and white image to color. Based on the content and object in the image (like grass, water, sky, face, dress) coloring is processed.

4. Self-organizing map

  • This comes under the unsupervised learning method.
  • Self-Organizing Map uses the data visualization technique by operating on a given high dimensional data.
  • The Self-Organizing Map is a two-dimensional array of neurons: M = {m1,m2,......mn}
  • It reduces the dimensions of the data to a map, representing the clustering concept by grouping similar data together.
  • SOM reduces data dimensions and displays similarities among data.
  • SOM uses clustering technique on data without knowing the class memberships of the input data where several units compete for the current object.
  • In short, SOM converts complex, nonlinear statistical relationships between high-dimensional data into simple geometric relationships on a low-dimensional display.

5. Hierarchical clustering

  • Hierarchical clustering uses one of the below clustering techniques to determine a hierarchy of clusters.
  • Thus produced hierarchy resembles a tree structure which is called a “Dendrogram”.
  • The techniques used in hierarchical clustering are:
    • K-Means,
    • DBSCAN,
    • Gaussian Mixture Models.
  • The 2 methods in finding hierarchical clusters are:
  1. Agglomerative clustering
  2. Divisive clustering
  • Agglomerative clustering

    • This is a bottom-up approach, where each data point starts in its own cluster.
    • These clusters are then joined greedily, by taking the two most similar clusters together and merging them.
  • Divisive clustering

    • Inverse to Agglomerative, this uses a top-down approach, wherein all data points start in the same cluster after which a parametric clustering algorithm like K-Means is used to divide the cluster into two clusters.
    • Each cluster is further divided into two clusters until a desired number of clusters are hit.

6. OPTICS algorithm

  • OPTICS is an abbreviation for ordering points to identify the clustering structure.
  • OPTICS works in principle like an extended DB Scan algorithm for an infinite number for a distance parameter which is smaller than a generating distance.
  • From a wide range of parameter settings, OPTICS outputs a linear list of all objects under analysis in clusters based on their density.

How to Choose Machine Learning Algorithms in Real Time

When implementing algorithms in real time, you need to keep in mind three main aspects: Space, Time, and Output.

Besides, you should clearly understand the aim of your algorithm:

  • Do you want to make predictions for the future?
  • Are you just categorizing the given data?
  • Is your targeted task simple or comprises of multiple sub-tasks?

The following table will show you certain real-time scenarios and help you to understand which algorithm is best suited to each scenario:

Real time scenarioBest suited algorithmWhy this algorithm is the best fit?
Simple straightforward data set with no complex computationsLinear Regression
  • It takes into account all factors involved and predicts the result with simple error rate explanation.
  • For simple computations, you need not spend much computational power; and linear regression runs with minimal computational power.
Classifying already labeled data into sub-labelsLogistic Regression
  • This algorithm looks at every data point into two subcategories, hence best for sub-labeling.
  • Logistic regression model works best when you have multiple targets.
Sorting unlabelled data into groupsK-Means clustering algorithm
  • This algorithm groups and clusters data by measuring the spatial distance between each point.
  • You can choose from its sub-types - Mean-Shift algorithm and Density-Based Spatial Clustering of Applications with Noise
Supervised text classification (analyzing reviews, comments, etc.)Naive Bayes
  • Simplest model that can perform powerful pre-processing and cleaning of text
  • Removes filler stop words effectively
  • Computationally in-expensive
Logistic regression
  • Sorts words one by one and assigns a probability
  • Ranks next to Naïve Bayes in simplicity
Linear Support Vector Machine algorithm
  • Can be chosen when performance matters
Bag-of-words model
  • Suits best when vocabulary and the measure of known words is known.
Image classificationConvolutional neural network
  • Best suited for complex computations such as analyzing visual cortexes
  • Consumes more computational power and gives the best results
Stock market predictionsRecurrent neural network
  • Best suited for time-series analysis with well-defined and supervised data.
  • Works efficiently in taking into account the relation between data and its time distribution.

How to Run Machine Learning Algorithms?

Till now you have learned in detail about various algorithms of machine learning, their features, selection and application in real time.

When implementing the algorithm in real time, you can do it in any programming language that works well for machine learning.

All that you need to do is use the standard libraries of the programming language that you have chosen and work on them, or program everything from scratch.

Need more help? You can check these links for more clarity on coding machine learning algorithms in various programming languages.

How To Get Started With Machine Learning Algorithms in R

How to Run Your First Classifier in Weka

Machine Learning Algorithm Recipes in scikit-learn

Where do we stand in Machine Learning?

Machine learning is slowly making strides into as many fields in our daily life as possible. Some businesses are making it strict to have transparent algorithms that do not affect their business privacy or data security. They are even framing regulations and performing audit trails to check if there is any discrepancy in the above-said data policies.

The point to note here is that a machine working on machine learning principles and algorithms give output after processing the data through many nonlinear computations. If one needs to understand how a machine predicts, perhaps it can be possible only through another machine learning algorithm!

Applications of Machine Learning

Applications of Machine Learning

Currently, the role of Machine learning and Artificial Intelligence in human life is intertwined. With the advent of evolving technologies, AI and ML have marked their existence in all possible aspects.

Machine learning finds a plethora of applications in several domains of our day to day life. An exhaustive list of fields where machine learning is currently in use now is shown in the diagram here. An explanation for the same follows further below:

  1. Financial Services: Banks and financial services are increasingly relying on machine learning to identify financial fraud, portfolio management, identify and suggest good options for investment for customers.
  2. Police DepartmentApps based on facial recognition and other techniques of machine learning are being used by the police to identify and get hold of criminals.
  3. Online Marketing and Sales: Machine learning is helping companies a great deal in studying the shopping and spending patterns of customers and in making personalized product recommendations to them. Machine learning also eases customer support, product recommendations and advertising ideas for e-commerce.
  4. Healthcare: Doctors are using machine learning to predict and analyze the health status and disease progress of patients. Machine learning has proven its accuracy in detecting health condition, heartbeat, blood pressure and in identifying certain types of cancer. Advanced techniques of machine learning are being implemented in robotic surgery too.
  5. Household Applications: Household appliances that use face detection and voice recognition are gaining popularity as security devices and personal virtual assistants at homes.
  6. Oil and Gas: In analyzing underground minerals and carrying out the exploration and mining, geologists and scientists are using machine learning for improved accuracy and reduced investments.
  7. TransportMachine learning can be used to identify the vehicles that are moving in prohibited zones for traffic control and safety monitoring purposes.
  8. Social Media: In social media, spam is a big nuisance. Companies are using machine learning to filter spam. Machine learning also aptly solves the purpose of sentiment analysis in social media.
  9. Trading and Commerce: Machine learning techniques are being implemented in online trading to automate the process of trading. Machines learn from the past performances of trading and use this knowledge to make decisions about future trading options.

Future of Machine Learning

Machine learning is already making a difference in the way businesses are offering their services to us, the customers. Voice-based search and preferences based ads are just basic functionalities of how machine learning is changing the face of businesses.

ML has already made an inseparable mark in our lives. With more advancement in various fields, ML will be an integral part of all AI systems. ML algorithms are going to be made continuously learning with the day-to-day updating information.

With the rapid rate at which ongoing research is happening in this field, there will be more powerful machine learning algorithms to make the way we live even more sophisticated!

From 2013- 2017, the patents in the field of machine learning has recorded a growth of 34%, according to IFI Claims Patent Services (Patent Analytics). Also, 60% of the companies in the world are using machine learning for various purposes.

A peek into the future trends and growth of machine learning through the reports of Predictive Analytics and Machine Learning (PAML) market shows a 21% CAGR by 2021.

Conclusion

Machine Learning should be designed as an aid that would support mankind.

Ultimately, machine learning should be designed as an aid that would support mankind. The notion that automation and machine learning are threats to jobs and human workforce is pretty prevalent. It should always be remembered that machine learning is just a technology that has evolved to ease the life of humans by reducing the needed manpower and to offer increased efficiency at lower costs that too in a shorter time span. The onus of using machine learning in a responsible manner lies in the hands of those who work on/with it.

However, stay tuned to an era of artificial intelligence and machine learning that makes the impossible possible and makes you witness the unseen!

AI is likely to be the best thing or the worst thing to happen to humanity. – Stephen Hawking

Animikh

Animikh Aich

Computer Vision Engineer

Animikh Aich is a Deep Learning enthusiast, currently working as a Computer Vision Engineer. His work includes three International Conference publications and several projects based on Computer Vision and Machine Learning.

Join the Discussion

Your email address will not be published. Required fields are marked *

Suggested Blogs

Boosting and AdaBoost in Machine Learning

Ensemble learning is a strategy in which a group of models are used to find a solution to a challenging problem, by using a strategy and combining diverse machine learning models into one single predictive model.In general, ensemble methods are mainly used for improving the overall performance accuracy of a model and combine several different models, also known as the base learners, to predict the results, instead of using a single model.In one of the articles related to ensemble learning, we have already discussed about the popular ensemble method, Bootstrap Aggregation. Bagging tries to implement similar learners on small sample populations and then takes a mean of all the predictions. It combines Bootstrapping and Aggregation to form one ensemble model. It basically reduces the variance error and helps to avoid overfitting. In this article we will look into the limitations of bagging and how a boosting algorithm can be used to overcome those limitations. We will also learn about various types of boosting algorithms and implement one of them in Python. Let’s get started.What are the limitations of Bagging?Let us recall the concept of bagging and consider a binary classification problem. We are either classifying an observation as 0 or as 1.In bagging, T bootstrap samples are selected, a classifier is fitted on each of these samples, and the models are trained in parallel. In a Random Forest, decision trees are trained in parallel. Then the results of all classifiers are averaged into a bagging classifier:Formula for a Bagging ClassifierLet us consider 3 classifiers and the result for the classification can either be right or wrong. If we plot the results of the 3 classifiers, there are regions in which the classifiers will be wrong. These regions are represented in red in the figure below.Example case in which Bagging works wellThe above example works pretty well as when one classifier is wrong, the two others are correct. By voting classifier, you can achieve a better accuracy. However, there are cases where Bagging does not work properly, when all classifiers are mistaken to be in the same region.Due to this reason, the intuition behind the discovery of Boosting was the following :instead of training parallel models, one should train models sequentiallyeach model should focus on where the performance of the previous classifier was poorWith this intuition, Boosting algorithm was introduced. Let us understand what Boosting is all about.What is Boosting?Boosting is an ensemble modeling technique which attempts to build a strong classifier from the number of weak classifiers. It is done by building a model using weak models in series. First, a model is built from the training data. Then the second model is built which tries to correct the errors present in the first model. This procedure is continued and models are added until either the complete training data set is predicted correctly or the maximum number of models are added.Boosting being a sequential process, each subsequent model attempts to correct the errors of the previous model. It is focused on reducing the bias unlike bagging. It makes the boosting algorithms prone to overfitting. To avoid overfitting, parameter tuning plays an important role in boosting algorithms, which will be discussed in the later part of this article. Some examples of boosting are XGBoost, GBM, ADABOOST etc..How can boosting identify weak learners?To find weak learners, we apply base learning (ML) algorithms with a different distribution. As each time base learning algorithm is applied, it generates a new weak prediction rule. This is an iterative process. After many iterations, the boosting algorithm combines these weak rules into a single strong prediction rule.How do we choose a different distribution for each round?Step 1: The base learner takes all the distributions and assigns equal weight or attention to each observation.Step 2: If there is any prediction error caused by first base learning algorithm, then we pay higher attention to observations having prediction error. Then, we apply the next base learning algorithm.Step 3: Iterate Step 2 till the limit of base learning algorithm is reached or higher accuracy is achieved.Finally, it combines the outputs from weak learner and creates a strong learner which eventually improves the prediction power of the model. Boosting gives higher focus to examples which are mis-classified or have higher errors by preceding weak rules.How would you classify an email as SPAM or not?Our initial approach would be to identify ‘SPAM’ and ‘NOT SPAM’ emails using the following criteria. If: Email has only one image file (promotional image), It’s a SPAM.Email has only link(s), It’s a SPAM.Email body consists of sentences like “You won a prize money of $ xxxxxx”, It’s a SPAM.Email from our official domain “www.knowledgehut.com” , Not a SPAM.Email from known source, Not a SPAM.Individually, these rules are not powerful enough to classify an email into ‘SPAM’ or ‘NOT SPAM’. Therefore, these rules are called as weak learner.To convert weak learner to strong learner, we’ll combine the prediction of each weak learner using methods like:Using average/ weighted averageConsidering prediction has higher voteExample: Above, we have defined 5 weak learners. Out of these 5, 3 are voted as ‘SPAM’ and 2 are voted as ‘Not a SPAM’. In this case, by default, we’ll consider an email as SPAM because we have higher(3) vote for ‘SPAM’Boosting helps in training a series of low performing algorithms, called weak learners, simply by adjusting the error metric over time. Weak learners are considered to be those algorithms whose error rate is slightly under 50% as illustrated below:Weighted errorsLet us consider data points on a 2D plot. Some of the data points will be well classified, others won’t. The weight attributed to each error when computing the error rate is 1/n where n is the number of data points to classify.Now if we apply some weight to the errors :You might now notice that we give more weight to the data points that are not well classified. An illustration of the weighting process is mentioned below:Example of weighting processIn the end, we want to build a strong classifier that may look like the figure mentioned below:Strong ClassifierTree stumpsThere might be a question in your mind about how many classifiers should one implement in order to ensure it works well. And how is each classifier chosen at each step?Well, Tree stumps defines a 1-level decision tree. At each step, we need to find the best stump, i.e the best data split, which will minimize the overall error. You can see a stump as a test, in which the assumption is that everything that lies on one side belongs to class 1, and everything that lies on the other side belongs to class 0.Many such combinations are possible for a tree stump. Let us look into an example to understand how many combinations we face.3 data points to splitWell there are 12 possible combinations. Let us check how.12 StumpsThere are 12 possible “tests” we could make. The “2” on the side of each separating line simply represents the fact that all points on one side could be points that belong to class 0, or to class 1. Therefore, there are 2 tests embedded in it.At each iteration t, we will choose ht the weak classifier that splits best the data, by reducing the overall error rate the most. Recall that the error rate is a modified error rate version that takes into account what has been introduced before.Finding the best splitThe best split is found by identifying at each iteration t, the best weak classifier ht, generally a decision tree with 1 node and 2 leaves (a stump). Let us consider an example of credit defaulter, i.e whether a person who borrowed money will return or not.Identifying the best splitIn this case, the best split at time t is to stump on the Payment history, since the weighted error resulting from this split is minimum.Simply note that decision tree classifiers like these ones can in practice be deeper than a simple stump. This will be considered as a hyper-parameter.Combining classifiersIn the next step we combine the classifiers into a Sign classifier, and depending on which side of the frontier a point will stand, it is classified as 0 or 1. It can be achieved by:Combining classifiersYou can improve the classifier by adding weights on each classifier, to avoid giving the same importance to the different classifiers.AdaBoostPseudo-codePseudo-codeThe key elements to keep in mind are:Z is a constant whose role is to normalize the weights so that they add up to 1αt is a weight that we apply to each classifierThis algorithm is called AdaBoost or Adaptive Boosting. This is one of the most important algorithms among all boosting methods.ComputationBoosting algorithms are generally fast to train, although we consider every stump possible and compute exponentials recursively.Well, if we choose αt and Z properly, the weights that are supposed to change at each step simplify to:Weights after choice of α and ZTypes of Boosting AlgorithmsUnderlying engine used for boosting algorithms can be anything.  It can be decision stamp, margin-maximizing classification algorithm etc. There are many boosting algorithms which use other types of engines such as: AdaBoost (Adaptive Boosting)Gradient Tree BoostingXGBoostIn this article, we will focus on AdaBoost and Gradient Boosting followed by their respective Python codes and a little bit about XGBoost.Where are Boosted algorithms required?Boosted algorithms are mainly used when there is plenty of data to make a prediction and high predictive power is expected. It is used to reduce bias and variance in supervised learning. It combines multiple weak predictors to build strong predictor.The underlying engine used for boosting algorithms can be anything. For instance, AdaBoost is a boosting done on Decision stump. There are many other boosting algorithms which use other types of engine such as:GentleBoostGradient BoostingLPBoostBrownBoostAdaptive BoostingAdaptive Boosting, or most commonly known AdaBoost, is a Boosting algorithm. This algorithm uses the method to correct its predecessor. It pays more attention to under fitted training instances by the previous model. Thus, at every new predictor the focus is more on the complicated cases more than the others.It fits a sequence of weak learners on different weighted training data. It starts by predicting the original data set and gives equal weight to each observation. If prediction is incorrect using the first learner, then it gives higher weight to observation which have been predicted incorrectly. Being an iterative process, it continues to add learner(s) until a limit is reached in the number of models or accuracy.Mostly, AdaBoost uses decision stamps. But, we can use any machine learning algorithm as base learner if it accepts weight on training data set. We can use AdaBoost algorithms for both classification and regression problems.Let us consider the example of the image mentioned above. In order to build an AdaBoost classifier, consider that as a first base classifier a Decision Tree algorithm is trained to make predictions on our training data. Applying the following methodology of AdaBoost, the weight of the misclassified training instances is increased. Then the second classifier is trained and the updated weights are acknowledged. It repeats the procedure over and over again.At the end of every model prediction we end up boosting the weights of the misclassified instances so that the next model does a better job on them, and so on.This sequential learning technique might sound similar to Gradient Descent, except that instead of tweaking a single predictor’s parameter to minimize the cost function, AdaBoost adds predictors to the ensemble, gradually making it better.One disadvantage of this algorithm is that the model cannot be parallelized since each predictor can only be trained after the previous one has been trained and evaluated.Below are the steps for performing the AdaBoost algorithm:Initially, all observations are given equal weights.A model is built on a subset of data.Using this model, predictions are made on the whole dataset.Errors are calculated by comparing the predictions and actual values.While creating the next model, higher weights are given to the data points which were predicted incorrectly.Weights can be determined using the error value. For instance,the higher the error the more is the weight assigned to the observation.This process is repeated until the error function does not change, or the maximum limit of the number of estimators is reached.Hyperparametersbase_estimators: specify the base type estimator, i.e. the algorithm to be used as base learner.n_estimators: It defines the number of base estimators, where the default is 10 but you can increase it in order to obtain a better performance.learning_rate: same impact as in gradient descent algorithmmax_depth: Maximum depth of the individual estimatorn_jobs: indicates to the system how many processors it is allowed to use. Value of ‘-1’ means there is no limit;random_state: makes the model’s output replicable. It will always produce the same results when you give it a fixed value as well as the same parameters and training data.Now, let us take a quick look at how to use AdaBoost in Python using a simple example on handwritten digit recognition.import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.ensemble import AdaBoostClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import accuracy_score from sklearn.model_selection import cross_val_score from sklearn.model_selection import cross_val_predict from sklearn.model_selection import train_test_split from sklearn.model_selection import learning_curve from sklearn.datasets import load_digitsLet us load the data :dataset = load_digits() X = dataset['data'] y = dataset['target']X contains arrays of length 64 which are simply flattened 8x8 images. The aim of this dataset is to recognize handwritten digits. Let’s take a look at a given handwritten digit:plt.imshow(X[4].reshape(8,8))If we stick to a Decision Tree Classifier of depth 1 (a stump), here’s how to implement AdaBoost classifier:reg_ada = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1)) scores_ada = cross_val_score(reg_ada, X, y, cv=6) scores_ada.mean()0.2636257855582272And it should head a result of around 26%, which can largely be improved. One of the key parameters is the depth of the sequential decision tree classifiers. How does accuracy improve with depth of the decision trees?score = [] for depth in [1,2,10] : reg_ada = AdaBoostClassifier(DecisionTreeClassifier(max_depth=depth)) scores_ada = cross_val_score(reg_ada, X, y, cv=6) score.append(scores_ada.mean()) score[0.2636257855582272, 0.5902852679072207, 0.9527524912410157]And the maximal score is reached for a depth of 10 in this simple example, with an accuracy of 95.3%.Gradient BoostingThis is another very popular Boosting algorithm which works pretty similar to what we’ve seen for AdaBoost. Gradient Boosting works by sequentially adding the previous predictors underfitted predictions to the ensemble, ensuring the errors made previously are corrected.The difference lies in what it does with the underfitted values of its predecessor. Contrary to AdaBoost, which tweaks the instance weights at every interaction, this method tries to fit the new predictor to the residual errors made by the previous predictor.So that you can understand Gradient Boosting it is important to understand Gradient Descent first.Below are the steps for performing the Gradient Boosting algorithm:A model is built on a subset of data.Using this model, predictions are made on the whole dataset.Errors are calculated by comparing the predictions and actual values.A new model is created using the errors calculated as target variable. Our objective is to find the best split to minimize the error.The predictions made by this new model are combined with the predictions of the previous.New errors are calculated using this predicted value and actual value.This process is repeated until the error function does not change, or the maximum limit of the number of estimators is reached.Hyperparametersn_estimators: It controls the number of weak learners.Learning_rate: Controls the contribution of weak learners in the final combination. There is a trade-off between learning_rate and n_estimators.min_samples_split: Minimum number of observation which is required in a node to be considered for splitting. It is used to control overfitting.min_samples_leaf: Minimum samples required in a terminal or leaf node. Lower values should be chosen for imbalanced class problems since the regions in which the minority class will be in the majority will be very small.min_weight_fraction_leaf: similar to the previous but defines a fraction of the total number of observations instead of an integer.max_depth : maximum depth of a tree. Used to control overfitting.max_lead_nodes : maximum number of terminal leaves in a tree. If this is defined max_depth is ignored.max_features : number of features it should consider while searching for the best split.You can tune loss function for better performance.Implementation in PythonYou can find Gradient Boosting function in Scikit-Learn’s library.# for regression from sklearn.ensemble import GradientBoostingRegressor model = GradientBoostingRegressor(n_estimators=3,learning_rate=1) model.fit(X,Y) # for classification from sklearn.ensemble import GradientBoostingClassifier model = GradientBoostingClassifier() model.fit(X,Y)XGBoostXG Boost or Extreme Gradient Boosting is an advanced implementation of the Gradient Boosting. This algorithm has high predictive power and is ten times faster than any other gradient boosting techniques. Moreover, it includes a variety of regularization which reduces overfitting and improves overall performance.AdvantagesIt implements regularization which helps in reducing overfit (Gradient Boosting does not have);It implements parallel processing which is much faster than Gradient Boosting;Allows users to define custom optimization objectives and evaluation criteria adding a whole new dimension to the model;XGBoost has an in-built routine to handle missing values;XGBoost makes splits up to the max_depth specified and then starts pruning the tree backwards and removes splits beyond which there is no positive gain;XGBoost allows a user to run a cross-validation at each iteration of the boosting process and thus it is easy to get the exact optimum number of boosting iterations in a single run.Boosting algorithms represent a different machine learning perspective which is turning a weak model to a stronger one to fix its weaknesses. I hope this article helped you understand how boosting works.We have covered most of the topics related to algorithms in our series of machine learning blogs, click here. If you are inspired by the opportunities provided by machine learning, enroll in our  Data Science and Machine Learning Courses for more lucrative career options in this landscape.
Rated 4.5/5 based on 12 customer reviews
7921
Boosting and AdaBoost in Machine Learning

Ensemble learning is a strategy in which a group o... Read More

Bagging and Random Forest in Machine Learning

In today’s world, innovations happen on a daily basis, rendering all the previous versions of that product, service or skill-set outdated and obsolete. In such a dynamic and chaotic space, how can we make an informed decision without getting carried away by plain hype? To make the right decisions, we must follow a set of processes; investigate the current scenario, chart down your expectations, collect reviews from others, explore your options, select the best solution after weighing the pros and cons, make a decision and take the requisite action. For example, if you are looking to purchase a computer, will you simply walk up to the store and pick any laptop or notebook? It’s highly unlikely that you would do so. You would probably search on Amazon, browse a few web portals where people have posted their reviews and compare different models, checking for their features, specifications and prices. You will also probably ask your friends and colleagues for their opinion. In short, you would not directly jump to a conclusion, but will instead make a decision considering the opinions and reviews of other people as well. Ensemble models in machine learning also operate on a similar manner. They combine the decisions from multiple models to improve the overall performance. The objective of this article is to introduce the concept of ensemble learning and understand algorithms like bagging and random forest which use a similar technique. What is Ensemble Learning? Ensemble methods aim at improving the predictive performance of a given statistical learning or model fitting technique. The general principle of ensemble methods is to construct a linear combination of some model fitting method, instead of using a single fit of the method. An ensemble is itself a supervised learning algorithm, because it can be trained and then used to make predictions. Ensemble methods combine several decision trees classifiers to produce better predictive performance than a single decision tree classifier. The main principle behind the ensemble model is that a group of weak learners come together to form a strong learner, thus increasing the accuracy of the model.When we try to predict the target variable using any machine learning technique, the main causes of difference in actual and predicted values are noise, variance, and bias. Ensemble helps to reduce these factors (except noise, which is irreducible error). The noise-related error is mainly due to noise in the training data and can't be removed. However, the errors due to bias and variance can be reduced.The total error can be expressed as follows: Total Error = Bias + Variance + Irreducible Error A measure such as mean square error (MSE) captures all of these errors for a continuous target variable and can be represented as follows: Where, E stands for the expected mean, Y represents the actual target values and fˆ(x) is the predicted values for the target variable. It can be broken down into its components such as bias, variance and noise as shown in the following formula: Using techniques like Bagging and Boosting helps to decrease the variance and increase the robustness of the model. Combinations of multiple classifiers decrease variance, especially in the case of unstable classifiers, and may produce a more reliable classification than a single classifier. Ensemble Algorithm The goal of ensemble algorithms is to combine the predictions of several base estimators built with a given learning algorithm in order to improve generalizability / robustness over a single estimator. There are two families of ensemble methods which are usually distinguished: Averaging methods. The driving principle is to build several estimators independently and then to average their predictions. On average, the combined estimator is usually better than any of the single base estimator because its variance is reduced.|Examples: Bagging methods, Forests of randomized trees. Boosting methods. Base estimators are built sequentially and one tries to reduce the bias of the combined estimator. The motivation is to combine several weak models to produce a powerful ensemble.Examples: AdaBoost, Gradient Tree Boosting.Advantages of Ensemble Algorithm Ensemble is a proven method for improving the accuracy of the model and works in most of the cases. Ensemble makes the model more robust and stable thus ensuring decent performance on the test cases in most scenarios. You can use ensemble to capture linear and simple as well nonlinear complex relationships in the data. This can be done by using two different models and forming an ensemble of two. Disadvantages of Ensemble Algorithm Ensemble reduces the model interpret-ability and makes it very difficult to draw any crucial business insights at the end It is time-consuming and thus might not be the best idea for real-time applications The selection of models for creating an ensemble is an art which is really hard to master Basic Ensemble Techniques Max Voting: Max-voting is one of the simplest ways of combining predictions from multiple machine learning algorithms. Each base model makes a prediction and votes for each sample. The sample class with the highest votes is considered in the final predictive class. It is mainly used for classification problems.  Averaging: Averaging can be used while estimating the probabilities in classification tasks. But it is usually used for regression problems. Predictions are extracted from multiple models and an average of the predictions are used to make the final prediction. Weighted Average: Like averaging, weighted averaging is also used for regression tasks. Alternatively, it can be used while estimating probabilities in classification problems. Base learners are assigned different weights, which represent the importance of each model in the prediction. Ensemble Methods Ensemble methods became popular as a relatively simple device to improve the predictive performance of a base procedure. There are different reasons for this: the bagging procedure turns out to be a variance reduction scheme, at least for some base procedures. On the other hand, boosting methods are primarily reducing the (model) bias of the base procedure. This already indicates that bagging and boosting are very different ensemble methods. From the perspective of prediction, random forests is about as good as boosting, and often better than bagging.  Bootstrap Aggregation or Bagging tries to implement similar learners on small sample populations and then takes a mean of all the predictions. It combines Bootstrapping and Aggregation to form one ensemble model Reduces the variance error and helps to avoid overfitting Bagging algorithms include: Bagging meta-estimator Random forest Boosting refers to a family of algorithms which converts weak learner to strong learners. Boosting is a sequential process, where each subsequent model attempts to correct the errors of the previous model. Boosting is focused on reducing the bias. It makes the boosting algorithms prone to overfitting. To avoid overfitting, parameter tuning plays an important role in boosting algorithms. Some examples of boosting are mentioned below: AdaBoost GBM XGBM Light GBM CatBoost Why use ensemble models? Ensemble models help in improving algorithm accuracy as well as the robustness of a model. Both Bagging and Boosting should be known by data scientists and machine learning engineers and especially people who are planning to attend data science/machine learning interviews. Ensemble learning uses hundreds to thousands of models of the same algorithm and then work hand in hand to find the correct classification. You may also consider the fable of the blind men and the elephant to understand ensemble learning, where each blind man found a feature of the elephant and they all thought it was something different. However, if they would work together and discussed among themselves, they might have figured out what it is. Using techniques like bagging and boosting leads to increased robustness of statistical models and decreased variance. Now the question becomes, between these different “B” words. Which is better? Which is better, Bagging or Boosting? There is no perfectly correct answer to that. It depends on the data, the simulation and the circumstances. Bagging and Boosting decrease the variance of your single estimate as they combine several estimates from different models. So the result may be a model with higher stability. If the problem is that the single model gets a very low performance, Bagging will rarely get a better bias. However, Boosting could generate a combined model with lower errors as it optimizes the advantages and reduces pitfalls of the single model. By contrast, if the difficulty of the single model is overfitting, then Bagging is the best option. Boosting for its part doesn’t help to avoid over-fitting; in fact, this technique is faced with this problem itself. For this reason, Bagging is effective more often than Boosting. In this article we will discuss about Bagging, we will cover Boosting in the next post. But first, let us look into the very important concept of bootstrapping. Bootstrap Sampling Sampling is the process of selecting a subset of observations from the population with the purpose of estimating some parameters about the whole population. Resampling methods, on the other hand, are used to improve the estimates of the population parameters. In machine learning, the bootstrap method refers to random sampling with replacement. This sample is referred to as a resample. This allows the model or algorithm to get a better understanding of the various biases, variances and features that exist in the resample. Taking a sample of the data allows the resample to contain different characteristics then it might have contained as a whole. This is demonstrated in figure 1 where each sample population has different pieces, and none are identical. This would then affect the overall mean, standard deviation and other descriptive metrics of a data set. In turn, it can develop more robust models. Bootstrapping is also great for small size data sets that can have a tendency to overfit. In fact, we recommended this to one company who was concerned because their data sets were far from “Big Data”. Bootstrapping can be a solution in this case because algorithms that utilize bootstrapping can be more robust and handle new data sets depending on the methodology chosen(boosting or bagging). The reason behind using the bootstrap method is because it can test the stability of a solution. By using multiple sample data sets and then testing multiple models, it can increase robustness. Perhaps one sample data set has a larger mean than another, or a different standard deviation. This might break a model that was overfit, and not tested using data sets with different variations. One of the many reasons bootstrapping has become very common is because of the increase in computing power. This allows for many times more permutations to be done with different resamples than previously. Bootstrapping is used in both Bagging and Boosting Let us assume we have a sample of ‘n’ values (x) and we’d like to get an estimate of the mean of the sample. mean(x) = 1/n * sum(x) Consider a sample of 100 values (x) and we’d like to get an estimate of the mean of the sample. We can calculate the mean directly from the sample as: We know that our sample is small and that the mean has an error in it. We can improve the estimate of our mean using the bootstrap procedure: Create many (e.g. 1000) random sub-samples of the data set with replacement (meaning we can select the same value multiple times). Calculate the mean of each sub-sample Calculate the average of all of our collected means and use that as our estimated mean for the data Example: Suppose we used 3 re-samples and got the mean values 2.3, 4.5 and 3.3. Taking the average of these we could take the estimated mean of the data to be 3.367. This process can be used to estimate other quantities like the standard deviation and even quantities used in machine learning algorithms, like learned coefficients. While using Python, we do not have to implement the bootstrap method manually. The scikit-learn library provides an implementation that creates a single bootstrap sample of a dataset. The resample() scikit-learn function can be used for sampling. It takes as arguments the data array, whether or not to sample with replacement, the size of the sample, and the seed for the pseudorandom number generator used prior to the sampling. For example, let us create a bootstrap that creates a sample with replacement with 4 observations and uses a value of 1 for the pseudorandom number generator. boot = resample(data, replace=True, n_samples=4, random_state=1)As the bootstrap API does not allow to easily gather the out-of-bag observations that could be used as a test set to evaluate a fit model, in the univariate case we can gather the out-of-bag observations using a simple Python list comprehension. # out of bag observations  oob = [x for x in data if x not in boot]Let us look at a small example and execute it.# scikit-learn bootstrap  from sklearn.utils import resample  # data sample  data = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]  # prepare bootstrap sample  boot = resample(data, replace=True, n_samples=4, random_state=1)  print('Bootstrap Sample: %s' % boot)  # out of bag observations  oob = [x for x in data if x not in boot]  print('OOB Sample: %s' % oob) The output will include the observations in the bootstrap sample and those observations in the out-of-bag sample.Bootstrap Sample: [0.6, 0.4, 0.5, 0.1]  OOB Sample: [0.2, 0.3]Bagging Bootstrap Aggregation, also known as Bagging, is a powerful ensemble method that was proposed by Leo Breiman in 1994 to prevent overfitting. The concept behind bagging is to combine the predictions of several base learners to create a more accurate output. Bagging is the application of the Bootstrap procedure to a high-variance machine learning algorithm, typically decision trees. Suppose there are N observations and M features. A sample from observation is selected randomly with replacement (Bootstrapping). A subset of features are selected to create a model with sample of observations and subset of features. Feature from the subset is selected which gives the best split on the training data. This is repeated to create many models and every model is trained in parallel Prediction is given based on the aggregation of predictions from all the models. This approach can be used with machine learning algorithms that have a high variance, such as decision trees. A separate model is trained on each bootstrap sample of data and the average output of those models used to make predictions. This technique is called bootstrap aggregation or bagging for short. Variance means that an algorithm’s performance is sensitive to the training data, with high variance suggesting that the more the training data is changed, the more the performance of the algorithm will vary. The performance of high variance machine learning algorithms like unpruned decision trees can be improved by training many trees and taking the average of their predictions. Results are often better than a single decision tree. What Bagging does is help reduce variance from models that are might be very accurate, but only on the data they were trained on. This is also known as overfitting. Overfitting is when a function fits the data too well. Typically this is because the actual equation is much too complicated to take into account each data point and outlier. Bagging gets around this by creating its own variance amongst the data by sampling and replacing data while it tests multiple hypothesis(models). In turn, this reduces the noise by utilizing multiple samples that would most likely be made up of data with various attributes(median, average, etc). Once each model has developed a hypothesis. The models use voting for classification or averaging for regression. This is where the “Aggregating” in “Bootstrap Aggregating” comes into play. Each hypothesis has the same weight as all the others. When we later discuss boosting, this is one of the places the two methodologies differ. Essentially, all these models run at the same time, and vote on which hypothesis is the most accurate. This helps to decrease variance i.e. reduce the overfit. Advantages Bagging takes advantage of ensemble learning wherein multiple weak learners outperform a single strong learner.  It helps reduce variance and thus helps us avoid overfitting. Disadvantages There is loss of interpretability of the model. There can possibly be a problem of high bias if not modeled properly. While bagging gives us more accuracy, it is computationally expensive and may not be desirable depending on the use case. There are many bagging algorithms of which perhaps the most prominent would be Random Forest.  Decision Trees Decision trees are simple but intuitive models. Using a top-down approach, a root node creates binary splits unless a particular criteria is fulfilled. This binary splitting of nodes results in a predicted value on the basis of the interior nodes which lead to the terminal or the final nodes. For a classification problem, a decision tree will output a predicted target class for each terminal node produced. We have covered decision tree algorithm  in detail for both classification and regression in another article. Limitations to Decision Trees Decision trees tend to have high variance when they utilize different training and test sets of the same data, since they tend to overfit on training data. This leads to poor performance when new and unseen data is added. This limits the usage of decision trees in predictive modeling. However, using ensemble methods, models that utilize decision trees can be created as a foundation for producing powerful results. Bootstrap Aggregating Trees We have already discussed about bootstrap aggregating (or bagging), we can create an ensemble (forest) of trees where multiple training sets are generated with replacement, meaning data instances. Once the training sets are created, a CART model can be trained on each subsample. Features of Bagged Trees Reduces variance by averaging the ensemble's results. The resulting model uses the entire feature space when considering node splits. Bagging trees allow the trees to grow without pruning, reducing the tree-depth sizes and resulting in high variance but lower bias, which can help improve predictive power. Limitations to Bagging Trees The main limitation of bagging trees is that it uses the entire feature space when creating splits in the trees. Suppose some variables within the feature space are indicating certain predictions, there is a risk of having a forest of correlated trees, which actually  increases bias and reduces variance. Why a Forest is better than One Tree?The main objective of a machine learning model is to generalize properly to new and unseen data. When we have a flexible model, overfitting takes place. A flexible model is said to have high variance because the learned parameters (such as the structure of the decision tree) will vary with the training data. On the other hand, an inflexible model is said to have high bias as it makes assumptions about the training data. An inflexible model may not have the capacity to fit even the training data and in both cases — high variance and high bias — the model is not able to generalize new and unseen data properly. You can through the article on one of the foundational concepts in machine learning, bias-variance tradeoff which will help you understand that the balance between creating a model that is so flexible memorizes the training data and an inflexible model cannot learn the training data.  The main reason why decision tree is prone to overfitting when we do not limit the maximum depth is because it has unlimited flexibility, which means it keeps growing unless there is one leaf node for every single observation. Instead of limiting the depth of the tree which results in reduced variance and increase in bias, we can combine many decision trees into a single ensemble model known as the random forest. What is Random Forest algorithm? Random forest is like bootstrapping algorithm with Decision tree (CART) model. Suppose we have 1000 observations in the complete population with 10 variables. Random forest will try to build multiple CART along with different samples and different initial variables. It will take a random sample of 100 observations and then chose 5 initial variables randomly to build a CART model. It will go on repeating the process say about 10 times and then make a final prediction on each of the observations. Final prediction is a function of each prediction. This final prediction can simply be the mean of each prediction. The random forest is a model made up of many decision trees. Rather than just simply averaging the prediction of trees (which we could call a “forest”), this model uses two key concepts that gives it the name random: Random sampling of training data points when building trees Random subsets of features considered when splitting nodes How the Random Forest Algorithm Works The basic steps involved in performing the random forest algorithm are mentioned below: Pick N random records from the dataset. Build a decision tree based on these N records. Choose the number of trees you want in your algorithm and repeat steps 1 and 2. In case of a regression problem, for a new record, each tree in the forest predicts a value for Y (output). The final value can be calculated by taking the average of all the values predicted by all the trees in the forest. Or, in the case of a classification problem, each tree in the forest predicts the category to which the new record belongs. Finally, the new record is assigned to the category that wins the majority vote. Using Random Forest for Regression Here we have a problem where we have to predict the gas consumption (in millions of gallons) in 48 US states based on petrol tax (in cents), per capita income (dollars), paved highways (in miles) and the proportion of population with the driving license. We will use the random forest algorithm via the Scikit-Learn Python library to solve this regression problem. First we import the necessary libraries and our dataset. import pandas as pd  import numpy as np  dataset = pd.read_csv('/content/petrol_consumption.csv')  dataset.head() Petrol_taxAverage_incomepaved_HighwaysPopulation_Driver_licence(%)Petrol_Consumption09.0357119760.52554119.0409212500.57252429.0386515860.58056137.5487023510.52941448.043994310.544410You will notice that the values in our dataset are not very well scaled. Let us scale them down before training the algorithm. Preparing Data For Training We will perform two tasks in order to prepare the data. Firstly we will divide the data into ‘attributes’ and ‘label’ sets. The resultant will then be divided into training and test sets. X = dataset.iloc[:, 0:4].values  y = dataset.iloc[:, 4].valuesNow let us divide the data into training and testing sets:from sklearn.model_selection import train_test_split  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)Feature Scaling The dataset is not yet a scaled value as you will see that the Average_Income field has values in the range of thousands while Petrol_tax has values in the range of tens. It will be better if we scale our data. We will use Scikit-Learn's StandardScaler class to do the same. # Feature Scaling  from sklearn.preprocessing import StandardScaler  sc = StandardScaler()  X_train = sc.fit_transform(X_train)  X_test = sc.transform(X_test)Training the Algorithm Now that we have scaled our dataset, let us train the random forest algorithm to solve this regression problem. from sklearn.ensemble import Random Forest Regressor  regressor = Random Forest Regressor(n_estimators=20,random_state=0)  regressor.fit(X_train, y_train)  y_pred = regressor.predict(X_test)The RandomForestRegressor is used to solve regression problems via random forest. The most important parameter of the RandomForestRegressor class is the n_estimators parameter. This parameter defines the number of trees in the random forest. Here we started with n_estimator=20 and check the performance of the algorithm. You can find details for all of the parameters of RandomForestRegressor here. Evaluating the Algorithm Let us evaluate the performance of the algorithm. For regression problems the metrics used to evaluate an algorithm are mean absolute error, mean squared error, and root mean squared error.  from sklearn import metrics  print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))  print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))  print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred))) Mean Absolute Error: 51.76500000000001 Mean Squared Error: 4216.166749999999 Root Mean Squared Error: 64.93201637097064 With 20 trees, the root mean squared error is 64.93 which is greater than 10 percent of the average petrol consumption i.e. 576.77. This may indicate, among other things, that we have not used enough estimators (trees). Let us now change the number of estimators to 200, the results are as follows: Mean Absolute Error: 48.33899999999999 Mean Squared Error: 3494.2330150000003  Root Mean Squared Error: 59.112037818028234 The graph below shows the decrease in the value of the root mean squared error (RMSE) with respect to number of estimators.  You will notice that the error values decrease with the increase in the number of estimators. You may consider 200 a good number for n_estimators as the rate of decrease in error diminishes. You may try playing around with other parameters to figure out a better result. Using Random Forest for ClassificationNow let us consider a classification problem to predict whether a bank currency note is authentic or not based on four attributes i.e. variance of the image wavelet transformed image, skewness, entropy, andkurtosis of the image. We will use Random Forest Classifier to solve this binary classification problem. Let’s get started. import pandas as pd  import numpy as np  dataset = pd.read_csv('/content/bill_authentication.csv')  dataset.head()VarianceSkewnessKurtosisEntropyClass03.621608.6661-2.8073-0.44699014.545908.1674-2.4586-1.46210023.86600-2.63831.92420.10645033.456609.5228-4.0112-3.59440040.32924-4.45524.5718-0.988800Similar to the data we used previously for the regression problem, this data is not scaled. Let us prepare the data for training. Preparing Data For Training The following code divides data into attributes and labels: X = dataset.iloc[:, 0:4].values  y = dataset.iloc[:, 4].values The following code divides data into training and testing sets:from sklearn.model_selection import train_test_split  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0) Feature Scaling We will do the same thing as we did for the previous problem. # Feature Scaling  from sklearn.preprocessing import StandardScaler  sc = StandardScaler()  X_train = sc.fit_transform(X_train)  X_test = sc.transform(X_test)Training the Algorithm Now that we have scaled our dataset, let us train the random forest algorithm to solve this classification problem. from sklearn.ensemble import Random Forest Classifier  classifier = RandomForestClassifier(n_estimators=20, random_state=0)  classifier.fit(X_train, y_train)  y_pred = classifier.predict(X_test)For classification, we have used RandomForestClassifier class of the sklearn.ensemble library. It takes n_estimators as a parameter. This parameter defines the number of trees in out random forest. Similar to the regression problem, we have started with 20 trees here. You can find details for all of the parameters of Random Forest Classifier here. Evaluating the Algorithm For evaluating classification problems,  the metrics used are accuracy, confusion matrix, precision recall, and F1 valuesfrom sklearn.metrics import classification_report, confusion_matrix, accuracy_score  print(confusion_matrix(y_test,y_pred))  print(classification_report(y_test,y_pred))  print(accuracy_score(y_test, y_pred)) The output will look something like this: Output:[ [ 155   2] [     1  117] ]Precisionrecallf1-scoresupport00.990.990.9915710.980.990.99118accuracy0.99275macro avg0.990.990.992750.98909090909090910.990.990.99275The accuracy achieved by our random forest classifier with 20 trees is 98.90%. Let us change the number of trees to 200.from sklearn.ensemble import Random Forest Classifier  classifier = Random Forest Classifier(n_estimators=200, random_state=0)  classifier.fit(X_train, y_train)  y_pred = classifier.predict(X_test) Output:[ [ 155   2] [     1  117] ]Precisionrecallf1-scoresupport00.990.990.9915710.980.990.99118accuracy0.99275macro avg0.990.990.992750.98909090909090910.990.990.99275Unlike the regression problem, changing the number of estimators for this problem did not make any difference in the results.An accuracy of 98.9% is pretty good. In this case, we have seen that there is not much improvement if the number of trees are increased. You may try playing around with other parameters of the RandomForestClassifier class and see if you can improve on our results. Advantages and Disadvantages of using Random Forest As with any algorithm, there are advantages and disadvantages to using it. Let us look into the pros and cons of using Random Forest for classification and regression. Advantages Random forest algorithm is unbiased as there are multiple trees and each tree is trained on a subset of data.  Random Forest algorithm is very stable. Introducing a new data in the dataset does not affect much as the new data impacts one tree and is pretty hard to impact all the trees. The random forest algorithm works well when you have both categorical and numerical features. With missing values in the dataset, the random forest algorithm performs very well. Disadvantages A major disadvantage of random forests lies in their complexity. More computational resources are required and also results in the large number of decision trees joined together. Due to their complexity, training time is more compared to other algorithms. Summary In this article we have covered what is ensemble learning and discussed about basic ensemble techniques. We also looked into bootstrap sampling involves iteratively resampling of a dataset with replacement which allows the model or algorithm to get a better understanding various features. Then we moved on to bagging followed by random forest. We also implemented random forest in Python for both regression and classification and came to a conclusion that increasing number of trees or estimators does not always make a difference in a classification problem. However, in regression there is an impact.  We have covered most of the topics related to algorithms in our series of machine learning blogs,click here. If you are inspired by the opportunities provided by machine learning, enrol in our  Data Science and Machine Learning Courses for more lucrative career options in this landscape. 0.99
Rated 4.5/5 based on 12 customer reviews
16622
Bagging and Random Forest in Machine Learning

In today’s world, innovations happen on a daily ... Read More

Support Vector Machines in Machine Learning

While many classifiers exist that can classify linearly separable data such as logistic regression, Support Vector Machines can handle highly non-linear problems using a kernel trick which implicitly maps the input vectors to higher-dimensional feature spaces. The transformation rearranges the dataset in such a way that it is then linearly solvable. In this article we are going to look at how SVM works, learn about kernel functions, hyperparameters and pros and cons of SVM along with some of the real life applications of SVM. Support Vector Machines (SVMs), also known as support vector networks, are a family of extremely powerful models which use method based learning and can be used in classification and regression problems. They aim at finding decision boundaries that separate observations with differing class memberships. In other words, SVM is a discriminative classifier formally defined by a separating hyperplane.Method Based Learning There are several learning models namely:Association rules basedEnsemble method basedDeep Learning basedClustering method basedRegression Analysis basedBayesian method basedDimensionality reduction based Instance basedKernel method basedLet us understand what Kernel method based learning is all about.In simple terms, a kernel is a similarity function which is fed into a machine learning algorithm. It accepts two inputs and suggests the similarity. For example, suppose we want to classify images, the input data is a key-value pair (image, label). The image data is taken into consideration, features are computed, and a vector of features are fed into the Machine learning algorithm. But in the case of similarity functions, a kernel function can be defined which internally computes the similarity between images, and then feeds into the learning algorithm along with the images and label data. The outcome of this is a classifier. Perceptron frameworks or Support vector machines work with kernels and use vectors only. Here, the machine learning algorithms are expressed as dot products so that kernel functions can be used.Feature vectors generally prefer kernels. Its ease of computing makes it one of the key reasons, also, feature vectors need more storage space in comparison to dot products. You can writeMachine learning algorithms to use dot products and later map them to use kernels. This completely avoids the usage of feature vectors. This allows us to work with highly complex, efficient-to-compute, and yet high performing kernels effortlessly, without really developing multi-dimensional vectors.Kernel functionsLet us understand what kernel functions are: The figure shown below represents a 1D function using a simple 1-Dimensional example. Assume that given points are as follows, it will depict a vertical line and no other vertical lines will separate the dataset.Now, if we consider a 2-Dimensional representation, as shown in the figure below, there is a hyperplane (an arbitrary line in 2-Dimensions) which separates red and blue points, which can be separated using Support Vector Machines.As we keep increasing dimensional space, the need to be able to separate data will eventually decrease. This mapping, x -> (x, x2), is called the kernel function. In case of growing dimensional space, the computations become more complex and kernel trick needs to be applied to address these computations cheaply. What is Support Vector Machine? Support Vector Machine (SVM) is a supervised machine learning algorithm which can be used for both classification or regression challenges. However,  it is mostly used in classification problems. In this algorithm, each data is plotted in n-dimensional space (where n is the number of features you have) with the value of each feature being the value of a particular coordinate. After that, we perform classification by locating the hyperplane which differentiates both the classes.Let us create a dataset to understand support vector classification:# importing scikit learn with make_blobs from sklearn.datasets.samples_generator import make_blobs# creating datasets X containing n_samples # Y containing two classes X, Y = make_blobs(n_samples=500, centers=2,        random_state=0, cluster_std=0.40)# plotting scatters plt.scatter(X[:, 0], X[:, 1], c=Y, s=50, cmap='spring'); plt.show()Support vector machine is based on the concept of decision planes that define decision boundaries. A decision plane is one that separates between a set of objects with different class memberships. For example, in the figure mentioned below, there are objects which belong to either class Green or Red. The separating line defines a boundary on the right side of which all objects are Green and to the left of which all objects are Red. Any new object (white circle) falling to the right is labeled, i.e., classified, as Green (or classified as Red should it fall to the left of the separating line).Support vector machines not only draw a line between two classes, but consider a region about the line of some given width. Here’s an example of what it can look like:# creating line space between -1 to 3.5 xfit = np.linspace(-1, 3.5) # plotting scatter plt.scatter(X[:, 0], X[:, 1], c=Y, s=50, cmap='spring') # plot a line between the different sets of data for m, b, d in [(1, 0.65, 0.33), (0.5, 1.6, 0.55), (-0.2, 2.9, 0.2)]:     yfit = m * xfit + b     plt.plot(xfit, yfit, '-k')     plt.fill_between(xfit, yfit - d, yfit + d, edgecolor='none',     color='#AAAAAA', alpha=0.4)plt.xlim(-1, 3.5); plt.show()Another scenario, where it is clear that a full separation of the Green and Red objects would require a curve (which is more complex than a line). Classification tasks based on drawing separating lines to distinguish between objects of different class memberships are known as hyperplane classifiers. Support Vector Machines are particularly suited to handle such tasks.The figure below shows the basic idea behind Support Vector Machines. Here you will see that the original objects (left side of the schematic) mapped, are rearranged using a set of mathematical functions called kernels. This process of rearranging objects is known as mapping or transformation. You will notice that the right side of the schematic is linearly separable. All we can do is find an optimal line that will separate red and green objects.What is a hyperplane?The goal of Support Vector Machine is to find the hyperplane which separates these two objects or classes. Let us consider another figure which shows some of the possible hyperplanes which can help in separating or dividing the dataset. It is the choice of the best hyperplane which is also the goal. The best hyperplane is defined by the extent to which a maximum margin is left for both classes. The margin is the distance between the hyperplane and the closest point in the classification.Let us consider two hyperplanes among all and then check the margins represented by M1 and M2. You will notice that margin M1 > M2, so the choice of the hyperplane which separates the best one is the new plane between the green and blue planes.How do we find the right hyperplane?Now, let us represent the new plane by a linear equation as: f(x) = ax + bLet us consider that this equation delivers all values ≥ 1 from the green triangle class and ≤ -1 for the gold star class. The distance of this plane from the closest points in both the classes is at least one; the modulus is one. f(x) ≥ 1 for triangles and f(x) ≤ 1 or |f(x)| = 1 for starThe distance between the hyperplane and the point can be computed using the following equation. M1 = |f(x)| / ||a|| = 1 / ||a||The total margin is 1 / ||a|| + 1 / ||a|| = 2 / ||a|. In order to maximize the separability, we will have to maximize the ||a|| value. This particular value is known as a weight vector. We can minimize the weight value which is a non-linear optimization task. One of the methods is to use the Karush-Kuhn-Tucker (KKT) condition, using the Lagrange multiplier λi.What is a support vector in SVM?Let's take an example of two points between the two attributes X and Y. We need to find a point between these two points that has a maximum distance between these points. This requirement is represented in the graph depicted next. The optimal point is depicted using the red circle.The maximum margin weight vector is parallel to the line from (1, 1) to (2, 3). The weight vector is at (1,2), and this becomes a decision boundary that is halfway between and in perpendicular, that passes through (1.5, 2). So, y = x1 +2x2 − 5.5 and the geometric margin is computed as √5. Following are the steps to compute SVMs: With w = (a, 2a) for the functions of the points (1,1) and (2,3) can be represented as shown here: a + 2a + ω0 = -1 for the point (1,1) 2a + 6a + ω0 = 1 for the point (2,3) The weights can be computed as follows:These are the support vectors:Lastly, the final equation is as follows:Large Margin IntuitionIn logistic regression, the output of linear function is taken and the value is squashed within the range of [0,1] using the sigmoid function. If the value is greater than a threshold value, say 0.5, label 1 is assigned else label 0.  In case of support vector machines, the linear function is taken and if the output is greater than 1 and we identify it with one class and if the output is -1, it is identified with another class. Since the threshold values are changed to 1 and -1 in SVM, we obtain this reinforcement range of values([-1,1]) which acts as margin. Cost Function and Gradient UpdatesIn the SVM algorithm, we maximize the margin between the data points and the hyperplane. The loss function that helps maximize the margin is called the hinge loss.Hinge loss function (function on the left can be represented as a function on the right)   If the predicted value and the actual value are of the same sign, the cost is 0 . If not, we calculate the loss value. We also add a regularization parameter the cost function. The objective of the regularization parameter is to balance the margin maximization and loss. After adding the regularization parameter, the cost functions looks as below.Loss function for SVM  Now that we have the loss function, we take partial derivatives with respect to the weights to find the gradients. Using gradients, we can update our weights.Gradients  When there is no misclassification, i.e our model correctly predicts the class of our data point, we only have to update the gradient from the regularization parameter.Gradient Update — No misclassification  When there is a misclassification, i.e our model makes a mistake on the prediction of the class of our data point, we include the loss along with the regularization parameter to perform gradient update.Gradient Update — Misclassification  Let us start with a code and import the necessary libraries:import pandas as pd  import numpy as np  from sklearn.model_selection import train_test_split  from sklearn.model_selection import cross_val_score, GridSearchCV  from sklearn import metrics  from sklearn.preprocessing import MinMaxScaler  pd.set_option('display.max_columns', None)Read the Wisconsin Breast Cancer dataset using pandas.read_csv function into an object 'data' from the current directorydata = pd.read_csv('wisconsin.csv')After reading the data, we have prepared the data as per requirement. Feature scaling is a method used to standardize the range of independent variables or features of data. The min-max scaling (or min-max normalization) shrinks the range of feature such that the range is in between 0 and 1 (or -1 to 1 if there are negative values).sclr = MinMaxScaler() predictor_sc = sclr.fit_transform(predictor)predictor_sc.shapeSplit the scaled data into train-test split:x_train_sc,x_test_sc, y_train, y_test = train_test_split(predictor_sc, target, test_size = 0.30, random_state=101) print("Scaled train and test split") print("x_train ",x_train_sc.shape) print("x_test ",x_test_sc.shape) print("y_train ",y_train.shape) print("y_test ",y_test.shape)Scaled train and test split x_train  (398, 30) x_test  (171, 30) y_train  (398,) y_test  (171,)But what happens when there is no clear hyperplane? Support Vector Machines can probably help you to find a separating hyperplane but only if it exists. There are certain cases when it is not possible to define a hyperplane, this happens due to noise in the data. Another possible reason can be a non-linear boundary. The first graph below depicts noise and the next one shows a non-linear boundary.There might be cases where there is no possibility to define a hyperplane, which can happen due to noise in the data. In fact, another reason can be a non-linear boundary as well. The following first graph depicts noise and the second one shows a non-linear boundary.For such problems which arise due to noise in the data, the best way is to reduce the margin itself and introduce slack.The non-linear boundary problem can be solved if we introduce a kernel. Some of the kernel functions that can be introduced are mentioned below:A radial basis function is a real-valued function whose value is dependent on the distance between the input and some fixed point. In machine learning, the radial basis function kernel, or RBF kernel, is a popular kernel function used in various kernelized learning algorithms.The RBF kernel on two samples x and x', represented as feature vectors in some input space, is defined as:Applying SVM with default hyperparametersLet us get back to the example and apply SVM after data pre-processsing with default hyperparameters. Linear Kernelfrom sklearn import svm svm2 = svm.SVC(kernel='linear') svm2 SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False) model2 = svm2.fit(x_train_sc, y_train) y_pred2 = svm2.predict(x_test_sc) print('Accuracy Score’) print(metrics.accuracy_score(y_test,y_pred2))Accuracy Score:0.9707602339181286Gaussian Kernelsvm3 = svm.SVC(kernel='rbf') svm3 SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False) model3 = svm3.fit(x_train_sc, y_train) y_pred3 = svm3.predict(x_test_sc) print('Accuracy Score’) print(metrics.accuracy_score(y_test, y_pred3))Accuracy Score:0.935672514619883Polynomial Kernelsvm4 = svm.SVC(kernel='poly') svm4SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma='auto', kernel='poly', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False)model4 = svm4.fit(x_train_sc, y_train) y_pred4 = svm4.predict(x_test_sc) print('Accuracy Score’) print(metrics.accuracy_score(y_test,y_pred4)) Accuracy Score:0.6198830409356725How to tune Parameters of SVM? Kernel: Kernel in support vector machine is responsible for the transformation of the input data into the required format. Some of the kernels used in support vector machines are linear, polynomial and radial basis function (RBF). In order to create a non-linear hyperplane, we use RBF and Polynomial function, and for complex applications, you should use more advanced kernels to separate classes that are nonlinear in nature. With this transformation, you can obtain accurate classifiers. Regularization: Using the Scikit-learn’s C parameters and adjusting we can maintain regularization. C denotes a penalty parameter representing an error or any form of misclassification. This misclassification allows you to understand how much of the error is actually bearable. This helps you nullify the compensation between the misclassified term and the decision boundary. With a smaller C value, you obtain hyperplane of small margin and with a larger C value, hyperplane of larger value is obtained. Gamma: Lower value of Gamma creates a loose fit of the training dataset. On the other hand, a high value of gamma allows the model to get fit more appropriately. A low value of gamma will only provide consideration to the nearby points for the calculation of a separate plane. However, the high value of gamma will consider all the data-points to calculate the final separation line. Do we need to tune parameters always?? You do not need to tune parameter in all cases. There are inbuilt functions in sklearn tool kit which can be used. Tuning HyperparametersThe 'C' and 'gamma' hyperparameterC is the parameter for the soft margin cost function, which controls the influence of each individual support vector. This process involves trading error penalty for stability. Small C tends to emphasize the margin while ignoring the outliers in the training data(Soft Margin), while large C may tend to overfit the training data(Hard Margin). Thus for a very large values we can cause overfitting of the model and for a very small value of C we can cause underfitting.Thus the value of C must be chosen in such a manner that it generalises the unseen data well. The gamma parameter is the inverse of the standard deviation of the RBF kernel (Gaussian function), which is used as a similarity measure between two points. A small gamma value define a Gaussian function with a large variance. In this case, two points can be considered similar even if are far from each other. On the other hand, a large gamma value define a Gaussian function with a small variance and in this case, two points are considered similar just if they are close to each other. Taking kernel as linear and tuning C hyperparameterC_range=list(range(1,26)) acc_score=[] for c in C_range: svc = svm.SVC(kernel='linear', C=c) scores = cross_val_score(svc, predictor_sc, target, cv=10, scoring='accuracy') acc_score.append(scores.mean()) print(acc_score) [0.9772210699161695, 0.9772210699161695, 0.9806995938121164, 0.9824539797770286, 0.9789754558810818, 0.9789452078472042, 0.9806995938121164, 0.9789452078472041, 0.9789452078472041, 0.9789452078472041, 0.9806995938121164, 0.9789452078472041, 0.9789452078472041, 0.9772210699161695, 0.9772210699161695, 0.9772210699161695, 0.9772210699161695, 0.9754666839512574, 0.9754666839512574, 0.9754666839512574, 0.9754666839512574, 0.9754666839512574, 0.9754666839512574, 0.9754666839512574, 0.9754666839512574]Let us visualize the above points:import matplotlib.pyplot as plt %matplotlib inline C_Val_list = list(range(1,26)) plt.plot(C_Val_list,acc_score) plt.xticks(np.arange(0,27,2)) plt.xlabel('Value of C for SVC') plt.ylabel('Cross-Validated Accuracy')From the plot we can see that accuracy has been close to 98% somewhere in between C=4 and C=5 and then it drops.#Taking a close look at the cross-validation accuracy in the range C(4,5) C_range=list(np.arange(4,5,0.2)) acc_score=[] for c in C_range: svc = svm.SVC(kernel='linear', C=c) scores = cross_val_score(svc, predictor_sc, target, cv=10, scoring='accuracy') acc_score.append(scores.mean()) print(acc_score) [0.9824539797770286, 0.9806995938121164, 0.9789754558810818, 0.9789754558810818, 0.9789754558810818] Accuracy score is highest for C=4Taking kernel as gaussian and tuning gamma hyperparametergamma_range=[0.0001,0.001,0.01,0.1,1,10,100] acc_score=[] for g in gamma_range: svc = svm.SVC(kernel='rbf', gamma=g) scores = cross_val_score(svc, predictor_sc, target, cv=10, scoring='accuracy') acc_score.append(scores.mean()) print(acc_score) [0.6274274047186933, 0.6274274047186933, 0.9195035001296346, 0.9561651974764496, 0.9806995938121164, 0.9420026359000951, 0.6274274047186933] Let us visualize the above points: gamma_range=[0.0001,0.001,0.01,0.1,1,10,100]# plotting the value of gamma for SVM versus the cross-validated accuracy plt.plot(gamma_range,acc_score) plt.xlabel('Value of gamma for SVC ') plt.xticks(np.arange(0.0001,100,5)) plt.ylabel('Cross-Validated Accuracy')Text(0,0.5,'Cross-Validated Accuracy')For gamma between 5 and 100 the kernel performs very poorly.Let us take a closer look at the cross-validated accuracy for gamma value in between 0 and 5.gamma_range=list(np.arange(0.1,5,0.1))  acc_score=[] for g in gamma_range:  svc = svm.SVC(kernel='rbf', gamma=g)  scores = cross_val_score(svc, predictor_sc, target, cv=10, scoring='accuracy') acc_score.append(scores.mean())  print(acc_score)[0.9561651974764496, 0.9718952553798289, 0.9754051075965776, 0.9737122979863452, 0.9806995938121164, 0.9806995938121164, 0.9806995938121164, 0.9806995938121164, 0.9806995938121164, 0.9806995938121164, 0.9789754558810818, 0.9754969319851352, 0.9754969319851352, 0.9754969319851352, 0.9754969319851352, 0.9737727940541007, 0.9737727940541007, 0.9737727940541007, 0.9737727940541007, 0.9720184080891883, 0.9720184080891883, 0.9720184080891883, 0.9720184080891883, 0.9720184080891883, 0.9720184080891883, 0.9702326938034741, 0.9702326938034741, 0.9702326938034741, 0.9702326938034741, 0.9702326938034741, 0.9702326938034741, 0.9702326938034741, 0.9702326938034741, 0.9666925935528475, 0.9666925935528475, 0.9684167314838821, 0.9684167314838821, 0.9684167314838821, 0.9701711174487941, 0.9701711174487941, 0.96838540316308, 0.9649068792671333, 0.9649068792671333, 0.9649068792671333, 0.9649068792671333, 0.9649068792671333, 0.9649068792671333, 0.963152493302221, 0.963152493302221] gamma_range=list(np.arange(0.1,5,0.1)) plt.plot(gamma_range,acc_score) plt.xlabel('Value of gamma for SVC ') #plt.xticks(np.arange(0.0001,5,5)) plt.ylabel('Cross-Validated Accuracy') Text(0,0.5,'Cross-Validated Accuracy')The highest cross-validated accuracy for rbf kernel remains constant in between gamma=0.5 and gamma=1Taking polynomial kernel and tuning degree hyperparameterdegree=[2,3,4,5,6] acc_score=[] for d in degree: svc = svm.SVC(kernel='poly', degree=d) scores = cross_val_score(svc, predictor_sc, target, cv=10, scoring='accuracy') acc_score.append(scores.mean()) print(acc_score) [0.8350974418805635, 0.6450652493302222, 0.6274274047186933, 0.6274274047186933, 0.6274274047186933] plt.plot(degree,acc_score) plt.xlabel('degrees for SVC ') plt.ylabel('Cross-Validated Accuracy') Text(0,0.5,'Cross-Validated Accuracy')Score is high for second degree polynomial. There is drop in the accuracy score as degree of polynomial increases.Thus increase in polynomial degree results in high complexity of the model. Advantages and Disadvantages of Support Vector MachineAdvantages of SVMSVM Classifiers offer good accuracy and perform faster prediction compared to Naïve Bayes algorithm. SVM guarantees optimality due to the nature of Convex Optimization, the solution will always be global minimum not a local minimum. SVMcan be access it conveniently, be it from Python or Matlab. SVM can be used for both linearly separable as well as non-linearly separable data. Linearly separable data is the hard margin however, non-linearly separable data poses a soft margin. SVM provides compliance to the semi-supervised learning models as well. It can be implemented in both labelled and unlabelled data. The only thing it requires is a condition to the minimization problem which is known as the Transductive SVM. Feature Mapping used to be complex with respect to computation of the overall training performance of the model. With the help of Kernel Trick, SVM can carry out the feature mapping using simple dot product. SVM works well with a clear margin of separation and with high dimensional space.  Disadvantages of SVM SVM is not at all capable of handling text structures. It leads to bad performance as it results in the loss of sequential information. SVM is not suitable for large datasets because of its high training time and it also takes more time in training compared to Naïve Bayes. SVM works poorly with overlapping classes and is also sensitive to the type of kernel used. In cases where the number of features for each data point exceeds the number of training data samples , the SVM under performs. Applications of SVM in Real WorldSupport vector machines depend on supervised learning algorithms. The main goal of using SVM is to classify unseen data correctly. SVMs can be used to solve various real-world problems: Face detection – SVM can be used to classify parts of the image as a face and non-face and create a square boundary around the face. Text and hypertext categorization – SVM allows text and hypertext categorization for both inductive and transductive models. It uses training data for classification of documents into different categories. It categorizes based on the score generated and then compares with the threshold value. Classification of images – SVMs enhances search accuracy for image classification. In comparison to the traditional query-based searching techniques, SVM provides better accuracy. Bioinformatics – It includes classification of proteins and classification of cancer. SVM is used for identifying the classification of genes, patients on the basis of genes and other biological problems. Protein fold and remote homology detection – SVM algorithms are applied for protein remote homology detection. Handwriting recognition –  SVMs are used widely to recognize handwritten characters.  Generalized predictive control(GPC) – You can use SVM based GPC in order to control chaotic dynamics with useful parameters. Summary In this article, we looked at the machine learning algorithm, Support Vector Machine in detail. We have discussed the concept behind support vector machines, how it works, the process of implementation in Python.  We also looked into how to tune its parameters and make efficient models. Lastly, we came across the advantages and disadvantages of SVM along with various real world applications of support vector machines.We have covered most of the topics related to algorithms in our series of machine learning blogs,click here. If you are inspired by the opportunities provided by machine learning, enrol in our  Data Science and Machine Learning Courses for more lucrative career options in this landscape.
Rated 4.0/5 based on 67 customer reviews
27711
Support Vector Machines in Machine Learning

While many classifiers exist that can classify lin... Read More

20% Discount