## Types of Classification in Machine Learning

Rated 4.0/5 based on 15 customer reviews

# Types of Classification in Machine Learning

8K
• by Amit Diwan
• 05th Sep, 2020
• Last updated on 29th Sep, 2020 • In this post, we understand the concept of classification, regression, classification predictive modelling, and the different types of classification and regression
• We understand why and how classification is important.
• We also see a few classification algorithms and their implementations in Python.
• We understand logistic regression, decision trees, random forests, support vector machines, k nearest neighbour and neural networks.
• We understand their inner workings and their prominence.

## Introduction

Classification refers to the process of classifying the given data set into different classes or groups. The classification algorithm is placed under predictive modelling problem, wherein every class of the dataset is given a label, to indicate that it is different from other classes. Some examples include email classification as spam or not, recognition of a handwritten character as a specific character only, and not another character and so on.

Classification algorithms need data to be trained with many inputs and their respective output, with the help of which the model learns. It is important to understand that the training data must encompass all kinds of data (options) which could be encountered in the test data set or real world.

## Classification

The 4 different prominent types of classification include the following:

• Binary classification
• Multi-class classification
• Multi-label classification
• Imbalanced classification

### Binary classification

As the name suggests, it deals with the tasks in classification that only have two class labels. Some examples include: email classification as spam or not, whether the price of a stock will go up or go down (ignoring the fact that it could also remain as is), and so on. The value obtained after classifying the data would be either 0 or 1, yes or no, normal or abnormal.

The Bernoulli probability distribution is used as prediction to classify the data as 0 or 1. Bernoulli distribution is a discrete (discontinuous) distribution that gives a binary outcome -- a 0 or a 1.

Algorithms that are used to perform binary classification include the following:

• Logistic regression
• Decision trees
• Support vector machine
• Naïve Bayes
• k’nn (k nearest neighbors)

Code to demonstrate a binary classification task:

from numpy import where
from collections import Counter
from sklearn.datasets import make_blobs
from matplotlib import pyplot
X, y = make_blobs(n_samples=560, centers=2, random_state=1)
print("Data has been generated ")
print("The number of rows and columns are ")
print(X.shape, y.shape)
my_counter = Counter(y)
print(my_counter)
for i in range(10):
print(X[i], y[i])
for my_label, _ in my_counter.items():
row_ix = where(y == my_label)
pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(my_label))
pyplot.legend()
pyplot.show()

Output:

Data has been generated
The number of rows and columns are
(560, 2) (560,)
Counter({1: 280, 0: 280})
[-9.64384208 -4.14030356] 1
[-0.8821407  4.2877187] 0
… 

Code explanation

• The required packages are imported using the ‘import’ function.
• The dataset is generated using the ‘make_blobs’ function and by specifying the number of rows and columns that need to be generated.
• In addition, the number of classes into which the data points need to be labelled into is also defined. Here, it is 2.
• The number of rows and columns are displayed along with the summarization of class labelling.
• A ‘for’ loop is used to print the first few classified values.
• The entire dataset is then plotted on a graph in the form of a scatterplot using the ‘pyplot’ function and displayed on the screen.

### Multi-class classification

It is a type of classification wherein the input data set is classified/labelled into more than 2 classes. Some examples of multi-class classification include:

• Animal species classification
• Facial recognition/classification
• Text translation (special type of multi-class classification task)

This is different from binary classification in that it doesn’t have just two classes like 0 or 1, but more, and they need not be 0 or 1. They could be names or other continuous or discontinuous numbers. The data points are classified into one among many different classes given.

The number of class labels may be too high, when trying to classify a given photo into that of a specific person. Text translation also deals with a similar issue, wherein the word placement may vary widely and there maybe thousands of combinations of the same number of words. Multinoulli probability distribution is a discrete/discontinuous probability distribution, where the output could be any value within a given range. Algorithms that are used for binary classification can also be used for multi-class classification.

Code to demonstrate the multi-class classification:

from numpy import where
from collections import Counter
from sklearn.datasets import make_blobs
from matplotlib import pyplot

X, y = make_blobs(n_samples=670, centers=5, random_state=1)
print("The dataset has been generated")
print("The rows and columns are ")
print(X.shape, y.shape)
my_counter = Counter(y)
print(my_counter)
for i in range(10):
print(X[i], y[i])
for my_label, _ in my_counter.items():
row_ix = where(y == my_label)
pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(my_label))
pyplot.legend()
pyplot.show() 

Output:

The dataset has been generated
The rows and columns are
(670, 2) (670,)
Counter({3: 134, 0: 134, 2: 134, 4: 134, 1: 134})
[-6.45785776 -3.30981436] 3
[-6.44623696 -2.90184841] 3
[-5.60217602 -0.65990849] 3 

Code explanation:

• The required packages are imported using the ‘import’ function.
• The dataset is generated using the ‘make_blobs’ function and by specifying the number of rows and columns that need to be generated.
• In addition, the number of classes into which the data points need to be labelled into is also defined. Here, it is 5.
• The number of rows and columns are displayed along with the summarization of class labelling.
• A ‘for’ loop is used to print the first few classified values.
• The entire dataset is then plotted on a graph in the form of a scatterplot using the ‘pyplot’ function and displayed on the screen.

### Multi-label classification

Multi-label classification refers to those classification problems that deal with more than one class being assigned to a single data point, i.e. every data point would belong or be labelled into more than one class/label. A simple example would be a photo that contains multiple people, not just one. This means one photo might be classified or labelled as more than one (in fact thousands) of persons. This is different from binary and multi-class classification, since the number of labels into which one data point is classified remains same, i.e one.

Some multi-label classification algorithms include:

• Multi-label random forests

Code to demonstrate multi-label classification:

from sklearn.datasets import make_multilabel_classification
X, y = make_multilabel_classification(n_samples=800, n_features=2, n_classes=5, n_labels=3, random_state=1)
print("The number of rows and columns are ")
print(X.shape, y.shape)
for i in range(8):
print(X[i], y[i]) 

Output:

The number of rows and columns are
(800, 2) (800, 5)
[22. 24.] [1 0 0 1 1]
[12. 35.] [0 1 0 1 0]
[27. 30.] [1 1 0 0 1]
..  

Code explanation

• The required packages are imported using the ‘import’ function.
• The dataset is generated using the ‘make_multilabel_classification’ function present in the scikit-learn package is used.
• It is done by specifying the number of rows and columns that need to be generated.
• The number of rows and columns are displayed along with the summarization of class labelling.
• A ‘for’ loop is used to print the first few classified values.
• The entire dataset is then plotted on a graph in the form of a scatterplot using the ‘pyplot’ function and displayed on the screen.

### Imbalanced classification

This is a type of classification wherein the number of data points of the dataset in every class is not distributed equally. This means imbalanced classification is basically a binary classification problem, which doesn’t have a uniform distribution of points, one class could contains an extremely large amount of data points, and the other class might contains a very small number of data points.

Examples of imbalanced classification problem include:

• Fraud detection in credit cards
• Anomaly detection in the given dataset

There are specialized algorithms that are used to classify this data into the large data point group or small data point group. Some algorithms have been listed below:

• Cost sensitive decision trees
• Cost sensitive logistic regression
• Cost sensitive support vector machines

Code to demonstrate imbalanced binary classification

#An example of imbalanced binary classification task
from numpy import where
from collections import Counter
from sklearn.datasets import make_classification
from matplotlib import pyplot
#The dataset is defined
X, y = make_classification(n_samples=800, n_features=2, n_informative=2, n_redundant=0, n_classes=2, n_clusters_per_class=1, weights=[0.99,0.01], random_state=1)
#The shape of the dataset is summarized
print("The number of rows and columns ")
print(X.shape, y.shape)
#The labelled data is summarized
my_counter = Counter(y)
print(my_counter)
#A few data points are summarized
for i in range(10):
print(X[i], y[i])
#The dataset is plotted on a graph and displayed
for my_label, _ in my_counter.items():
row_ix = where(y == my_label)
pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(my_label))
pyplot.legend()
pyplot.show() 

Output:

The number of rows and columns
(800, 2) (800,)
Counter({0: 785, 1: 15})
[0.28622882 0.38305399] 0
[1.17971415 0.48003249] 0
[1.32658794 0.71712275] 0 

Code explanation

• The required packages are imported using the ‘import’ function.
• The dataset is generated using the ‘make_classification’ function present in the scikit-learn package is used.
• It is done by specifying the number of rows and columns that need to be generated.
• The number of rows and columns are displayed along with the summarization of class labelling.
• A ‘for’ loop is used to print the first few classified values.
• The entire dataset is then plotted on a graph in the form of a scatterplot using the ‘pyplot’ function and displayed on the screen.

## Logistic regression

In this classification technique, instead of finding continuous values like that of linear regression, we are concerned with finding discrete values. It is simply a classification technique that classifies the given data points into one of the labelled classes. Usually, we are looking at a Boolean output, wherein the result is either 0 or 1, yes or no and so on. Some examples include:

• Classifying an email as spam or not
• Finding whether it would rain today or not

## Naïve Bayes classification

Bayes theorem is way of calculating the probability of a hypothesis (situation, which might not have occurred in reality) based on our previous experiences and the knowledge we have gained by it.

Bayes theorem is stated as follows:

P(hypo | data) = (P(data | hypo) * P(hypo)) / P(data)

In the above equation,

P(hypo | data) is the probability of a hypothesis ‘hypo’ when data ‘data’ is given, which is also known as posterior probability.

P(data | hypo) is the probability of data ‘data’ when the specific hypothesis ‘hypo’ is known to be true.

P(hypo) is the probability of a hypothesis ‘hypo’ being true (irrespective of the data in hand), which is also known as prior probability of ‘hypo’.

P(data) is the probability of the data (irrespective of the hypothesis).

The idea here is to get the value of the posterior probability, given other data. The posterior probability for a variety of different hypotheses is found out, and the probability that has the highest value is selected. This is known as the maximum probable hypothesis, and is also known as maximum a posteriori (MAP) hypothesis.

MAP(hypo) = max(P(hypo | data))

If the value of P(hypo | data) is replaced with the value we saw before, the equation would become:

MAP(hypo) = max((P(data | hypo) * P(hypo)) / P(data))

P(data) is considered as a normalizing term that helps in determining the probability. This value can be ignored when required, since it is a constant value.

Naïve Bayes classifier is an algorithm that can be used with binary or multi-class classification problems. Once a Naïve Bayes classifier has learnt from the data, it stores a list of probabilities. Probabilities such as ‘class probability’ and ‘condition probability’ is stored. Training such a model is quick since the probability of every class and its associated value needs to be determined, and this doesn’t involve any optimization processes or coefficient changing.

## K-nearest neighbour (KNN)

The simplest way to understand k-nearest neighbour, is that the training data for the algorithm is all the data in its entirety. KNN doesn’t have a different model, other than the one that stores the entire dataset, which means there is no machine learning that is actually happening. This means KNN makes predictions and extracts patterns directly from the training dataset itself.

When a new data point is encountered, the corresponding value for that can be found using KNN by navigating through the entire training dataset, by looking at the ‘k’ number of very similar neighbours. Once the ‘k’ neighbours have been identified, they are summarized and the output for every instance is found. In case of regression, the mean of this output is the result, and in case of classification, the mode of this output is the result.

### How to determine the ‘k’ neighbours?

To find ‘k’ number of instances from the training dataset that are very similar to the new data point, we use a distance factor, and the most popular metric is the Euclidean distance.

Euclidean distance can be determined by finding the square root of the sum of the square of difference between the new point and an existing point in the data set, and this sum is from values in the range (a,b).

Euclidean Distance:

(a,b) = square root( sum( a – b) ^ 2))

Other distances that can be used include:

• Hamming distance
• Manhattan distance
• Minkowski Distance

When the number of data points in the training set increases, the complexity of KNN also increases.

## Support vector machines (SVM)

The hyperplane present in linear SVM is learnt by performing simple transformations using linear algebra. The sum of the product of every pair of input data points is multiplied, and this is known as the inner product. The basic idea behind SVM is that the inner product of two vectors can be expressed as a sum of product of the first value of every vector.

To find inner product of two input vectors:

[a,b] and [c,d], we do [a*c + b*d]

In order to predict new value, the dot product can be used, and the support vector can be calculated using the below equation:

f(x) = coeff-1 + sum(coeff-2 * (a,b))

Here, ‘a’ and ‘b’ are input vectors and coeff-1 and coeff-2 are coefficients that are determined with the help of the training dataset and the learning algorithm. Stochastic gradient descent or sequential minimal optimization technique can be used. All these optimization techniques break down the main problem into sub-problems and every sub problem is solved by calculating the required value.

## Decision trees

It is a part of predictive modelling in machine learning that is considered as one of the most powerful algorithms. It is also known as CART, i.e. classification and regression trees since this can be used in the process of classification as well as regression tasks. Decision tree can be simply visualized as a binary tree that has a root and many branches from it and leaves. It is the same as the tree data structure. The root is a single input value, and the branches that lead to leaves are used in predicting the values for the given input.

The tree structure can be stored in the form of a graph structure or a set of rules. Once the data in the form of tree is available, it is simple to make predictions on it with the help of the leaf nodes. The specific branch and its leaf node is examined to reach the node.

Data is filtered from the root of the tree and goes and sits in the branch and the leaf that is relevant to it.

No data preparation or pre-processing is required while working with CART or decision trees.

It is a method to build predictive models in machine learning. The idea behind boosting is to understand whether a weak learning algorithm can be made to learn better. This involves three attributes:

1. A weak learning algorithm that makes prediction: Decision tree is considered to be a weak learner when it comes to gradient boosting. The best splits are chosen in decision trees, thereby minimizing the loss, hence they need to be improved so that they work well even when the split is random.
2. A loss function that needs to be optimized: This value depends on the situation in hand. Many different loss functions can be used, such as squared error, measure squared error, logarithmic loss function and so on. A new boosting algorithm won’t have to be figured out for every loss function.
3. An additive model that adds weak learner to minimize the loss function: The trees to the gradient boosting technique are added one at a time, so that the existing model trees don’t have changes. This way, the loss is minimized when new trees are added. Usually, gradient descent optimization technique is used to minimize the loss.

## Random forest

Random forest is an ensemble machine leaning algorithm that uses bootstrap aggregation or bagging. It is a statistical method that helps in estimating the quantity from a given data sample. It is done to reduce the variance for those algorithms that seem to have a high variance. Examples of algorithms that have high variance include CART, and decision trees. Decision trees are extremely sensitive to the data on which they are trained. If the training data changes, the resultant tree would also be completely different. A small change in the input makes a huge difference to the overall training and output.

An ensemble method is the one that combines the predictions that have come from many different machine learning algorithms, thereby making sure that the predictions are more accurate in comparison to dealing with an algorithm that gives a single prediction. It is like combining the best algorithms to give the best of best values.

Random forest makes sure that the every sub-tree that learns and trains on the data and makes the predictions is less correlated to the other sub-trees that do the same. The learning algorithm is limited to be able to look at a random sample of the data points, so that it doesn’t have the opportunity to look through all the variables, and select an optimal point to split upon (which is actually the case with CART). It is seen that for classification trees, a good value for the number of randomly selected columns from the dataset is square root (p) where p refers to the number of input variables. On the other hand, for regression trees, a good value for the number of randomly selected columns from the dataset is p/3.

## Neural networks

It is a part of deep learning that deals with artificial neural networks. In general, the word ‘neural’ or ‘neuro’ deals with the decision making branch of the human brain. The idea behind artificial neural network, also abbreviated as ANN, is that it takes decision similar to how the neurons in the brain function while performing a function or taking a decision.

It is called deep learning since these networks have various layers, and every layer has a large number of nodes. Every layer processes some part of the data and passes on the computed data to the next layer. The input data to one layer is the output data of the previous layer. Usually, the input layer’s nodes are large in number, and the output layer has just one node indicating that the data was processed, and the output has been obtained.

## Conclusion

In this post, we understood how classification works, the different types of classification and regression, their working, implementations by generating simple dataset and working through it using Python and other relevant machine learning related packages.

### Amit Diwan

Author

Amit Diwan is an E-Learning Entrepreneur, who has taught more than a million professionals with Text & Video Courses on the following technologies: Data Science, AI, ML, C#, Java, Python, Android, WordPress, Drupal, Magento, Bootstrap 4, etc.

## Role of Statistics in Data Science

Rated 4.0/5 based on 11 customer reviews
5426
Role of Statistics in Data Science

## Getting Started With Machine Learning With Python: Step by Step Guide

Rated 4.0/5 based on 16 customer reviews
930
Getting Started With Machine Learning With Python:...