Search

Series List Filter

What is Logistic Regression in Machine Learning

Every machine learning algorithm performs best under a given set of conditions. To ensure good performance, we must know which algorithm to use depending on the problem at hand. You cannot just use one particular algorithm for all problems. For example: Linear regression algorithm cannot be applied on a categorical dependent variable. This is where Logistic Regression comes in.Logistic Regression is a popular statistical model used for binary classification, that is for predictions of the type this or that, yes or no, A or B, etc. Logistic regression can, however, be used for multiclass classification, but here we will focus on its simplest application. It is one of the most frequently used machine learning algorithms for binary classifications that translates the input to 0 or 1.  For example, 0: negative class1: positive classSome examples of classification are mentioned below:Email: spam / not spamOnline transactions: fraudulent / not fraudulentTumor: malignant / not malignantLet us look at the issues we encounter in Linear Regression.Issue 1 of Linear RegressionAs you can see on the graph mentioned below, the prediction would leave out malignant tumors as the gradient becomes less steep with an additional data point on the extreme right.Issue 2 of Linear RegressionHypothesis can be larger than 1 or smaller than zeroHence, we have to use logistic regressionWhat is Logistic Regression?Logistic Regression is the appropriate regression analysis to conduct when the dependent variable has a binary solution. Similar to all other types of regression systems, Logistic Regression is also a type of predictive regression system. Logistic regression is used to evaluate the relationship between one dependent binary variable and one or more independent variables. It gives discrete outputs ranging between 0 and 1.A simple example of Logistic Regression is: Does calorie intake, weather, and age have any influence on the risk of having a heart attack? The question can have a discrete answer, either “yes” or “no”.Logistic Regression HypothesisThe logistic regression classifier can be derived by analogy to the linear regression hypothesis which is:Linear regression hypothesisHowever, the logistic regression hypothesis generalizes from the linear regression hypothesis in that it uses the logistic function:The result is the logistic regression hypothesis:Logistic regression hypothesisThe function g(z) is the logistic function, also known as the sigmoid function.The logistic function has asymptotes at 0 and 1, and it crosses the y-axis at 0.5.How Logistic Regression works?Logistic Regression uses a more complex cost function than Linear Regression, this cost function is called the ‘Sigmoid function’ or also known as the ‘logistic function’ instead of a linear function.The hypothesis of logistic regression tends to limit the cost function between 0 and 1. Therefore linear functions fail to represent it as it can have a value greater than 1 or less than 0 which is not possible as per the hypothesis of logistic regression.Sigmoid function maps any real value into another value between 0 and 1. In machine learning, we use sigmoid to map predictions to probabilities.Formula:Where,f(x) = output between 0 and 1 (probability estimate)x = input to the functione = base of natural logDecision BoundaryThe prediction function returns a probability score between 0 and 1. If you want to map the discrete class (true/false, yes/no), you will have to select a threshold value above which you will be classifying values into class 1 and below the threshold value into class 2.p≥0.5,class=1 p<0.5,class=0For example, suppose the threshold value is 0.5 and your prediction function returns 0.7, it will be classified as positive. If your predicted value is 0.2, which is less than the threshold value, it will be classified as negative. For logistic regression with multiple classes we could select the class with the highest predicted probability.Our aim should be to maximize the likelihood that a random data point gets classified correctly, which is called Maximum Likelihood Estimation. Maximum Likelihood Estimation is a general approach to estimating parameters in statistical models. The likelihood can be maximized using an optimization algorithm. Newton’s Method is one such algorithm which can be used to find maximum (or minimum) of many different functions, including the likelihood function. Other than Newton’s Method, you can also use Gradient Descent.Cost FunctionWe have covered Cost Function earlier in the blog on Linear Regression. In brief, a cost function is created for optimization purpose so that we can minimize it and create a model with minimum error.Cost function for Logistic Regression are:Cost(hθ(x),y) = −log(hθ(x))   if y = 1Cost(hθ(x),y) = −log(1−hθ(x))   if y = 0The above functions can be written together as:Gradient DescentAfter finding out the cost function for Logistic Regression, our job should be to minimize it i.e. min J(θ). The cost function can be reduced by using Gradient Descent.The general form of gradient descent:The derivative part can be solved using calculus so the equation comes to:When to use Logistic Regression?Logistic Regression is used when the input needs to be separated into “two regions” by a linear boundary. The data points are separated using a linear line as shown:Based on the number of categories, Logistic regression can be classified as:binomial: target variable can have only 2 possible types: “0” or “1” which may represent “win” vs “loss”, “pass” vs “fail”, “dead” vs “alive”, etc.multinomial: target variable can have 3 or more possible types which are not ordered(i.e. types have no quantitative significance) like “disease A” vs “disease B” vs “disease C”.ordinal: it deals with target variables with ordered categories. For example, a test score can be categorized as:“very poor”, “poor”, “good”, “very good”. Here, each category can be given a score like 0, 1, 2, 3.Let us explore the simplest form of Logistic Regression, i.e Binomial Logistic Regression. It  can be used while solving a classification problem, i.e. when the y-variable takes on only two values. Such a variable is said to be a “binary” or “dichotomous” variable. “Dichotomous” basically means two categories such as yes/no, defective/non-defective, success/failure, and so on. “Binary” refers to the 0's and 1’s.Linear vs Logistic RegressionLinear RegressionLogistic RegressionOutcomeIn linear regression, the outcome (dependent variable) is continuous. It can have any one of an infinite number of possible values.In logistic regression, the outcome (dependent variable) has only a limited number of possible values.The dependent variableLinear regression is used when your response variable is continuous. For instance, weight, height, number of hours, etc.Logistic regression is used when the response variable is categorical in nature. For instance, yes/no, true/false, red/green/blue, 1st/2nd/3rd/4th, etc.The independent variableIn Linear Regression, the independent variables can be correlated with each other.In logistic Regression, the independent variables should not be correlated with each other. (no  multi-collinearity)EquationLinear regression gives an equation which is of the form Y = mX + C, means equation with degree 1.Logistic regression gives an equation which is of the form Y = eX + e-X.Coefficient interpretationIn linear regression, the coefficient interpretation of independent variables are quite straightforward (i.e. holding all other variables constant, with a unit increase in this variable, the dependent variable is expected to increase/decrease by xxx).In logistic regression, depends on the family (binomial, Poisson, etc.) and link (log, logit, inverse-log, etc.) you use, the interpretation is different.Error minimization techniqueLinear regression uses ordinary least squares method to minimise the errors and arrive at a best possible fit, while logistic regression uses maximum likelihood method to arrive at the solution.Logistic regression is just the opposite. Using the logistic loss function causes large errors to be penalized to an asymptotic constant.How is OLS different from MLE?Linear regression is estimated using Ordinary Least Squares (OLS) while logistic regression is estimated using Maximum Likelihood Estimation (MLE) approach.Ordinary Least Squares (OLS) also called the linear least squares is a method to approximately determine the unknown parameters of a linear regression model. Ordinary least squares is obtained by minimizing the total squared vertical distances between the observed responses within the dataset and the responses predicted by the linear approximation(represented by the line of best fit or regression line). The resulting estimator can be represented using a simple formula.For example, let’s say you have a set of equations which consist of several equations with unknown parameters. The ordinary least squares method may be used because this is the most standard approach in finding the approximate solution to your overly determined systems. In other words, it is your overall solution in minimizing the sum of the squares of errors in your equation. Data that best fits the ordinary least squares minimizes the sum of squared residuals. Residual is the difference between an observed value and the predicted value provided by a model.Maximum likelihood estimation, or MLE, is a method used in estimating the parameters of a statistical model, and for fitting a statistical model to data. If you want to find the height measurement of every basketball player in a specific location, maximum likelihood estimation can be used. If you could not afford to measure all of the basketball players’ heights, the maximum likelihood estimation can come in very handy. Using the maximum likelihood estimation, you can estimate the mean and variance of the height of your subjects. The MLE would set the mean and variance as parameters in determining the specific parametric values in a given model.To sum it up, the maximum likelihood estimation covers a set of parameters which can be used for predicting the data needed in a normal distribution. A given, fixed set of data and its probability model would likely produce the predicted data. The MLE would give us a unified approach when it comes to the estimation. But in some cases, we cannot use the maximum likelihood estimation because of recognized errors or the problem actually doesn’t even exist in reality.Building Logistic Regression Model To build a logistic regression model we can use statsmodel and the inbuilt logistic regression function present in the sklearn library.# Importing Packages import pandas as pd import numpy as np import matplotlib.pyplot as plt %matplotlib inline import seaborn as sns # Reading German Credit Data raw_data = pd.read_csv("/content/German_Credit_data.csv") raw_data.head()Building Logistic Regression Base Model after data preparation:import statsmodels.api as sm #Build Logit Model logit = sm.Logit(y_train,x_train) # fit the model model1 = logit.fit() # Printing Logistic Regression model results model1.summary2()Optimization terminated successfully. Current function value: 0.480402 Iterations 6Model:                  Logit                            Pseudo R-squared:  0.197     Dependent Variable:     Creditability                    AIC:               712.5629 Date:                   2019-09-19 09:55                 BIC:               803.5845 No. Observations:       700                              Log-Likelihood:   -336.28 Df Model:               19                               LL-Null:          -418.79 Df Residuals:           680                              LLR p-value:       2.6772e-25 Converged:              1.0000                           Scale:             1.0000 No. Iterations:         6.0000We will calculate the model accuracy on the test dataset using ‘score’ function.# Checking the accuracy with test data from sklearn.metrics import accuracy_score print(accuracy_score(y_test,predicted_df['Predicted_Class']))0.74We can see the accuracy of 74%.Model EvaluationModel evaluation metrics are used to find out the goodness of the fit between model and data, to compare the different models, in the context of model selection, and to predict how predictions are expected to be accurate.What is a Confusion Matrix?A confusion matrix is a summary of prediction results on a classification problem. The number of correct and incorrect predictions are summarized with count values and broken down by each class. This is the key to the confusion matrix. The confusion matrix shows the ways in which your classification model is confused when it makes predictions.Confusion Matrix gives insight not only into the errors being made by your classifier but more importantly the types of errors that are being made. It is this breakdown that overcomes the limitation of using classification accuracy alone.How to Calculate a Confusion MatrixBelow is the process for calculating a confusion Matrix:You need a test dataset or a validation dataset with expected outcome values.Make a prediction for each row in your test dataset.From the expected outcomes and predictions count:The number of correct predictions for each class.The number of incorrect predictions for each class, organized by the class that was predicted.These numbers are then organized into a table or a matrix as follows:Expected down the side: Each row of the matrix corresponds to a predicted class.Predicted across the top: Each column of the matrix corresponds to an actual class.The counts of correct and incorrect classification are then filled into the table.The total number of correct predictions for a class goes into the expected row for that class value and the predicted column for that class value.In the same way, the total number of incorrect predictions for a class goes into the expected row for that class value and the predicted column for that class value.2-Class Confusion Matrix Case StudyLet us consider we have a two-class classification problem of predicting whether a photograph contains a man or a woman. We have a test dataset of 10 records with expected outcomes and a set of predictions from our classification algorithm.ExpectedPredictedManWomanManManWomanWomanManManWomanManWomanWomanWomanWomanManManManWomanWomanWomanLet’s start off and calculate the classification accuracy for this set of predictions.Suppose the algorithm made 7 of the 10 predictions correct with an accuracy of 70%, then:accuracy = total correct predictions / total predictions made * 100 accuracy = 7/10∗100But what are the types of errors made?We can determine that by turning our results into a confusion matrix:First, we must calculate the number of correct predictions for each class.men classified as men: 3women classified as women: 4Now, we can calculate the number of incorrect predictions for each class, organized by the predicted value:men classified as women: 2woman classified as men: 1We can now arrange these values into the 2-class confusion matrix:menwomenmen31women24From the above table we learn that:The total actual men in the dataset is the sum of the values on the men column.The total actual women in the dataset is the sum of values in the women's column.The correct values are organized in a diagonal line from top left to bottom-right of the matrix.More errors were made by predicting men as women than predicting women as men.Two-Class Problems Are SpecialIn a two-class problem, we are often looking to discriminate between observations with a specific outcome, from normal observations. Such as a disease state or event from no-disease state or no-event. In this way, we can assign the event row as “positive” and the no-event row as “negative“. We can then assign the event column of predictions as “true” and the no-event as “false“.This gives us:“true positive” for correctly predicted event values.“false positive” for incorrectly predicted event values.“true negative” for correctly predicted no-event values.“false negative” for incorrectly predicted no-event values.We can summarize this in the confusion matrix as follows:eventno-eventmen31women24This can help in calculating more advanced classification metrics such as precision, recall, specificity and sensitivity of our classifier. Sensitivity/ recall= 7/ (7+5)= 0.583 Specificity= 3/ (3+5)= 0.375 Precision= 7/ (7+3)= 0.7The code mentioned below shows the implementation of confusion matrix in Python with respect to the example used earlier:# Confusion Matrix from sklearn.metrics import confusion_matrix confusion_matrix = confusion_matrix(y_test, predicted_df['Predicted_Class']).ravel() confusion_matrixarray([ 37,  63,  15, 185])The results from the confusion matrix are telling us that 37 and 185 are the number of correct predictions. 63 and 15 are the number of incorrect predictions.Receiver Operating Characteristic (ROC)The receiver operating characteristic (ROC), or the ROC curve, is a graphical plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied. The curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The true-positive rate is also known as sensitivity or the sensitivity index d', known as "d-prime" in signal detection and biomedical informatics, or recall in machine learning. The false-positive rate is also known as the fall-out and can be calculated as (1 - specificity). The ROC curve is thus the sensitivity as a function of fall-out.There are a number of methods of evaluating whether a logistic model is a good model. One such way is sensitivity and specificity. Sensitivity and specificity are statistical measures of the performance of a binary classification test, also known in statistics as classification function:Sensitivity / Recall (also known as the true positive rate, or the recall) measures the proportion of actual positives which are correctly identified as such (e.g., the percentage of sick people who are correctly identified as having the condition), and is complementary to the false negative rate. It shows how good a test is at detecting the positives. A test can cheat and maximize this by always returning “positive”. Sensitivity= true positives/ (true positive + false negative)Specificity (also called the true negative rate) measures the proportion of negatives which are correctly identified as such (e.g., the percentage of healthy people who are correctly identified as not having the condition), and is complementary to the false positive rate. It shows how good a test is at avoiding false alarms. A test can cheat and maximize this by always returning “negative”.Specificity= true negatives/ (true negative + false positives)Precision is used as a measure to calculate the success of predicted values to the values which were supposed to be successful. Precision is used with recall, the percent of all relevant documents that is returned by the search. The two measures are sometimes used together in the F1 Score (or f-measure) to provide a single measurement for a system. It shows how many of the positively classified were relevant. A test can cheat and maximize this by only returning positive on one result it’s most confident in.Precision= true positives/ (true positive + true negative)The precision-recall curve shows the trade-off between precision and recall for different threshold. The decision for the value of the threshold value is majorly affected by the values of precision and recall. Ideally, we want both precision and recall to be 1, but this seldom is the case. In case of a Precision-Recall tradeoff we use the following arguments to decide upon the threshold:-Low Precision/High Recall: In applications where we want to reduce the number of false negatives without necessarily reducing the number of false positives, we choose a decision value which has a low value of Precision or high value of Recall. For example, in a cancer diagnosis application, we do not want any affected patient to be classified as not affected without giving much heed to if the patient is being wrongfully diagnosed with cancer. This is because, the absence of cancer can be detected by further medical diseases but the presence of the disease cannot be detected in an already rejected candidate.High Precision/Low Recall: In applications where we want to reduce the number of false positives without necessarily reducing the number of false negatives, we choose a decision value which has a high value of Precision or low value of Recall. For example, if we are classifying customers whether they will react positively or negatively to a personalised advertisement, we want to be absolutely sure that the customer will react positively to the advertisement because otherwise, a negative reaction can cause a loss of potential sales from the customer.The code mentioned below shows the implementation in Python with respect to the example used earlier:from sklearn.metrics import classification_report print(classification_report(y_test, predicted_df['Predicted_Class']))The f1-score tells you the accuracy of the classifier in classifying the data points in that particular class compared to all other classes. It is calculated by taking the harmonic mean of precision and recall. The support is the number of samples of the true response that lies in that class.y_pred_prob = model1.predict(x_test) from sklearn.metrics import roc_curve # Generate ROC curve values: fpr, tpr, thresholds fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob) # Plot ROC curve plt.plot([0, 1], [0, 1], 'k--') plt.plot(fpr, tpr) plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('ROC Curve') plt.show()# AUCfrom sklearn.metrics import roc_auc_score roc_auc_score(y_test,predicted_df['Predicted_Class'])0.6475Area Under the Curve is 0.6475Hosmer Lemeshow Goodness-of-FitIt measures the association between actual events and predicted probability.How well our model fits depends on the difference between the model and the observed data. One approach for binary data is to implement a Hosmer Lemeshow goodness of fit testIn HL test, the null hypothesis states, the model fits the data well. Model appears to fit well if we have no significant difference between the model and the observed data (i.e. the p-value > 0.05, so not rejecting the Ho)Or in other words, if the test is NOT statistically significant, that indicates the model is a good fit.As with all measures of model fit, use this as just one piece of information in deciding how well this model fits. It doesn’t work well in very large or very small data sets, but is often useful nonetheless.      n     G2HL = ∑ {[(Oj-Ej)2]/[Ej(1-Ej/nj)]} ~Xs2       j=1Χ2 = chi squared.nj = number of observations in the group.Oj = number of observed cases in the j th group.Oj = number of expected cases in the  j th group.Gini CoefficientThe Gini coefficient is sometimes used in classification problems.Gini coefficient can be straight away derived from the AUC ROC number. Gini is nothing but the ratio between area between the ROC curve and the diagonal line & the area of the above triangle. Following is the formulae used :Gini=2*AUC–1Gini above 60% is a good model.Akaike Information Criterion and Bayesian Information CriterionAIC and BIC values are like adjusted R-squared values in linear regression.AIC= -2ln(SSE)+ 2kBIC = n*ln(SSE/n) + k*ln(n)Pros and Cons of Logistic RegressionMany of the pros and cons of the linear regression model also apply to the logistic regression model. Although Logistic regression is used widely by many people for solving various types of problems, it fails to hold up its performance due to its various limitations and also other predictive models provide better predictive results. ProsThe logistic regression model not only acts as a classification model, but also gives you probabilities. This is a big advantage over other models where they can only provide the final classification. Knowing that an instance has a 99% probability for a class compared to 51% makes a big difference. Logistic Regression performs well when the dataset is linearly separable.Logistic Regression not only gives a measure of how relevant a predictor (coefficient size) is, but also its direction of association (positive or negative). We see that Logistic regression is easier to implement, interpret and very efficient to train.ConsLogistic regression can suffer from complete separation. If there is a feature that would perfectly separate the two classes, the logistic regression model can no longer be trained. This is because the weight for that feature would not converge, because the optimal weight would be infinite. This is really a bit unfortunate, because such a feature is really very useful. But you do not need machine learning if you have a simple rule that separates both classes. The problem of complete separation can be solved by introducing penalization of the weights or defining a prior probability distribution of weights.Logistic regression is less prone to overfitting but it can overfit in high dimensional datasets and in that case, regularization techniques should be considered to avoid over-fitting in such scenarios. In this article we have seen what Logistic Regression is, how it works, when we should use it, comparison of Logistic and Linear Regression, the difference between the approach and usage of two estimation techniques: Maximum Likelihood Estimation and Ordinary Least Square Method, evaluation of model using Confusion Matrix and the advantages and disadvantages of Logistic Regression. We have also covered some basics of sigmoid function, cost function and gradient descent.If you are inspired by the opportunities provided by machine learning, enrol in our  Data Science and Machine Learning Courses for more lucrative career options in this landscape.
Rated 4.0/5 based on 34 customer reviews

What is Logistic Regression in Machine Learning

11920
What is Logistic Regression in Machine Learning

Every machine learning algorithm performs best under a given set of conditions. To ensure good performance, we must know which algorithm to use depending on the problem at hand. You cannot just use one particular algorithm for all problems. For example: Linear regression algorithm cannot be applied on a categorical dependent variable. This is where Logistic Regression comes in.

Customer Churn in Machine Learning

Logistic Regression is a popular statistical model used for binary classification, that is for predictions of the type this or that, yes or no, A or B, etc. Logistic regression can, however, be used for multiclass classification, but here we will focus on its simplest application. It is one of the most frequently used machine learning algorithms for binary classifications that translates the input to 0 or 1.  For example, 

  • 0: negative class
  • 1: positive class

Some examples of classification are mentioned below:

  • Email: spam / not spam
  • Online transactions: fraudulent / not fraudulent
  • Tumor: malignant / not malignant

Let us look at the issues we encounter in Linear Regression.

Issue 1 of Linear Regression

As you can see on the graph mentioned below, the prediction would leave out malignant tumors as the gradient becomes less steep with an additional data point on the extreme right.

Linear Regression issue in Machine Learning

Linear Regression issue in Machine Learning

Issue 2 of Linear Regression

  • Hypothesis can be larger than 1 or smaller than zero
  • Hence, we have to use logistic regression

What is Logistic Regression?

Logistic Regression is the appropriate regression analysis to conduct when the dependent variable has a binary solution. Similar to all other types of regression systems, Logistic Regression is also a type of predictive regression system. Logistic regression is used to evaluate the relationship between one dependent binary variable and one or more independent variables. It gives discrete outputs ranging between 0 and 1.

A simple example of Logistic Regression is: Does calorie intake, weather, and age have any influence on the risk of having a heart attack? The question can have a discrete answer, either “yes” or “no”.

Logistic Regression Hypothesis

The logistic regression classifier can be derived by analogy to the linear regression hypothesis which is:

Logistic Regression Hypothesis in Machine LearningLinear regression hypothesis

However, the logistic regression hypothesis generalizes from the linear regression hypothesis in that it uses the logistic function:

logistic function in machine Learning

The result is the logistic regression hypothesis:

Logistic regression hypothesis In Machine LearningLogistic regression hypothesis

The function g(z) is the logistic function, also known as the sigmoid function.

The logistic function has asymptotes at 0 and 1, and it crosses the y-axis at 0.5.

Logistic Function In Machine Learning

How Logistic Regression works?

Logistic Regression uses a more complex cost function than Linear Regression, this cost function is called the ‘Sigmoid function’ or also known as the ‘logistic function’ instead of a linear function.

The hypothesis of logistic regression tends to limit the cost function between 0 and 1. Therefore linear functions fail to represent it as it can have a value greater than 1 or less than 0 which is not possible as per the hypothesis of logistic regression.

Logistic Regression in Machine Learning

Sigmoid function maps any real value into another value between 0 and 1. In machine learning, we use sigmoid to map predictions to probabilities.

Formula:

Logistic Regression in Machine Learning

Where,

f(x) = output between 0 and 1 (probability estimate)
x = input to the function
e = base of natural log

Logistic Regression in Machine Learning

Decision Boundary

The prediction function returns a probability score between 0 and 1. If you want to map the discrete class (true/false, yes/no), you will have to select a threshold value above which you will be classifying values into class 1 and below the threshold value into class 2.

p≥0.5,class=1
p<0.5,class=0

For example, suppose the threshold value is 0.5 and your prediction function returns 0.7, it will be classified as positive. If your predicted value is 0.2, which is less than the threshold value, it will be classified as negative. For logistic regression with multiple classes we could select the class with the highest predicted probability.

Decision Boundary in machine Learning

Our aim should be to maximize the likelihood that a random data point gets classified correctly, which is called Maximum Likelihood Estimation. Maximum Likelihood Estimation is a general approach to estimating parameters in statistical models. The likelihood can be maximized using an optimization algorithm. Newton’s Method is one such algorithm which can be used to find maximum (or minimum) of many different functions, including the likelihood function. Other than Newton’s Method, you can also use Gradient Descent.

Cost Function

We have covered Cost Function earlier in the blog on Linear Regression. In brief, a cost function is created for optimization purpose so that we can minimize it and create a model with minimum error.

Cost function for Logistic Regression are:

  • Cost(hθ(x),y) = −log(hθ(x))   if y = 1
  • Cost(hθ(x),y) = −log(1−hθ(x))   if y = 0

The above functions can be written together as:

Cost Function equation In Machine Learning

Gradient Descent

After finding out the cost function for Logistic Regression, our job should be to minimize it i.e. min J(θ). The cost function can be reduced by using Gradient Descent.

The general form of gradient descent:

Gradient Descent in Machine Learning

The derivative part can be solved using calculus so the equation comes to:

Gradient Descent in Machine Learning

When to use Logistic Regression?

Logistic Regression is used when the input needs to be separated into “two regions” by a linear boundary. The data points are separated using a linear line as shown:

When to use Logistic Regression in Machine Learning

Based on the number of categories, Logistic regression can be classified as:

  1. binomial: target variable can have only 2 possible types: “0” or “1” which may represent “win” vs “loss”, “pass” vs “fail”, “dead” vs “alive”, etc.
  2. multinomial: target variable can have 3 or more possible types which are not ordered(i.e. types have no quantitative significance) like “disease A” vs “disease B” vs “disease C”.
  3. ordinal: it deals with target variables with ordered categories. For example, a test score can be categorized as:“very poor”, “poor”, “good”, “very good”. Here, each category can be given a score like 0, 1, 2, 3.

Let us explore the simplest form of Logistic Regression, i.e Binomial Logistic Regression. It  can be used while solving a classification problem, i.e. when the y-variable takes on only two values. Such a variable is said to be a “binary” or “dichotomous” variable. “Dichotomous” basically means two categories such as yes/no, defective/non-defective, success/failure, and so on. “Binary” refers to the 0's and 1’s.

Linear vs Logistic Regression


Linear RegressionLogistic Regression
OutcomeIn linear regression, the outcome (dependent variable) is continuous. It can have any one of an infinite number of possible values.In logistic regression, the outcome (dependent variable) has only a limited number of possible values.
The dependent variableLinear regression is used when your response variable is continuous. For instance, weight, height, number of hours, etc.Logistic regression is used when the response variable is categorical in nature. For instance, yes/no, true/false, red/green/blue, 1st/2nd/3rd/4th, etc.
The independent variableIn Linear Regression, the independent variables can be correlated with each other.In logistic Regression, the independent variables should not be correlated with each other. (no  multi-collinearity)
EquationLinear regression gives an equation which is of the form Y = mX + C, means equation with degree 1.Logistic regression gives an equation which is of the form Y = eX + e-X.
Coefficient interpretationIn linear regression, the coefficient interpretation of independent variables are quite straightforward (i.e. holding all other variables constant, with a unit increase in this variable, the dependent variable is expected to increase/decrease by xxx).In logistic regression, depends on the family (binomial, Poisson, etc.) and link (log, logit, inverse-log, etc.) you use, the interpretation is different.
Error minimization techniqueLinear regression uses ordinary least squares method to minimise the errors and arrive at a best possible fit, while logistic regression uses maximum likelihood method to arrive at the solution.Logistic regression is just the opposite. Using the logistic loss function causes large errors to be penalized to an asymptotic constant.


Graphical Representation between Linear and Logistic Regression

How is OLS different from MLE?

Linear regression is estimated using Ordinary Least Squares (OLS) while logistic regression is estimated using Maximum Likelihood Estimation (MLE) approach.

Ordinary Least Squares (OLS) also called the linear least squares is a method to approximately determine the unknown parameters of a linear regression model. Ordinary least squares is obtained by minimizing the total squared vertical distances between the observed responses within the dataset and the responses predicted by the linear approximation(represented by the line of best fit or regression line). The resulting estimator can be represented using a simple formula.

For example, let’s say you have a set of equations which consist of several equations with unknown parameters. The ordinary least squares method may be used because this is the most standard approach in finding the approximate solution to your overly determined systems. In other words, it is your overall solution in minimizing the sum of the squares of errors in your equation. Data that best fits the ordinary least squares minimizes the sum of squared residuals. Residual is the difference between an observed value and the predicted value provided by a model.

Maximum likelihood estimation, or MLE, is a method used in estimating the parameters of a statistical model, and for fitting a statistical model to data. If you want to find the height measurement of every basketball player in a specific location, maximum likelihood estimation can be used. If you could not afford to measure all of the basketball players’ heights, the maximum likelihood estimation can come in very handy. Using the maximum likelihood estimation, you can estimate the mean and variance of the height of your subjects. The MLE would set the mean and variance as parameters in determining the specific parametric values in a given model.

To sum it up, the maximum likelihood estimation covers a set of parameters which can be used for predicting the data needed in a normal distribution. A given, fixed set of data and its probability model would likely produce the predicted data. The MLE would give us a unified approach when it comes to the estimation. But in some cases, we cannot use the maximum likelihood estimation because of recognized errors or the problem actually doesn’t even exist in reality.

Building Logistic Regression Model 

To build a logistic regression model we can use statsmodel and the inbuilt logistic regression function present in the sklearn library.

# Importing Packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# Reading German Credit Data
raw_data = pd.read_csv("/content/German_Credit_data.csv")
raw_data.head()

Building Logistic Regression Base Model after data preparation:

import statsmodels.api as sm
#Build Logit Model
logit = sm.Logit(y_train,x_train)

# fit the model
model1 = logit.fit()

# Printing Logistic Regression model results
model1.summary2()
Optimization terminated successfully.
Current function value: 0.480402
Iterations 6
Model:                  Logit                            Pseudo R-squared:  0.197    
Dependent Variable:     Creditability                    AIC:               712.5629
Date:                   2019-09-19 09:55                 BIC:               803.5845
No. Observations:       700                              Log-Likelihood:   -336.28
Df Model:               19                               LL-Null:          -418.79
Df Residuals:           680                              LLR p-value:       2.6772e-25
Converged:              1.0000                           Scale:             1.0000
No. Iterations:         6.0000

We will calculate the model accuracy on the test dataset using ‘score’ function.

# Checking the accuracy with test data
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test,predicted_df['Predicted_Class']))
0.74

We can see the accuracy of 74%.

Model Evaluation

Model evaluation metrics are used to find out the goodness of the fit between model and data, to compare the different models, in the context of model selection, and to predict how predictions are expected to be accurate.

What is a Confusion Matrix?

A confusion matrix is a summary of prediction results on a classification problem. The number of correct and incorrect predictions are summarized with count values and broken down by each class. This is the key to the confusion matrix. The confusion matrix shows the ways in which your classification model is confused when it makes predictions.

Confusion Matrix in Machine Learning

Confusion Matrix gives insight not only into the errors being made by your classifier but more importantly the types of errors that are being made. It is this breakdown that overcomes the limitation of using classification accuracy alone.

How to Calculate a Confusion Matrix

Below is the process for calculating a confusion Matrix:

  1. You need a test dataset or a validation dataset with expected outcome values.
  2. Make a prediction for each row in your test dataset.
  3. From the expected outcomes and predictions count:
  • The number of correct predictions for each class.
  • The number of incorrect predictions for each class, organized by the class that was predicted.

These numbers are then organized into a table or a matrix as follows:

  • Expected down the side: Each row of the matrix corresponds to a predicted class.
  • Predicted across the top: Each column of the matrix corresponds to an actual class.

The counts of correct and incorrect classification are then filled into the table.
The total number of correct predictions for a class goes into the expected row for that class value and the predicted column for that class value.

In the same way, the total number of incorrect predictions for a class goes into the expected row for that class value and the predicted column for that class value.

2-Class Confusion Matrix Case Study

Let us consider we have a two-class classification problem of predicting whether a photograph contains a man or a woman. We have a test dataset of 10 records with expected outcomes and a set of predictions from our classification algorithm.

2-Class Confusion Matrix Case Study in Machine Learning

ExpectedPredicted
ManWoman
ManMan
WomanWoman
ManMan
WomanMan
WomanWoman
WomanWoman
ManMan
ManWoman
WomanWoman

Let’s start off and calculate the classification accuracy for this set of predictions.

Suppose the algorithm made 7 of the 10 predictions correct with an accuracy of 70%, then:

accuracy = total correct predictions / total predictions made * 100
accuracy = 7/10∗100

But what are the types of errors made?
We can determine that by turning our results into a confusion matrix:
First, we must calculate the number of correct predictions for each class.

  • men classified as men: 3
  • women classified as women: 4

Now, we can calculate the number of incorrect predictions for each class, organized by the predicted value:

  • men classified as women: 2
  • woman classified as men: 1

We can now arrange these values into the 2-class confusion matrix:


menwomen
men31
women24

From the above table we learn that:

  • The total actual men in the dataset is the sum of the values on the men column.
  • The total actual women in the dataset is the sum of values in the women's column.
  • The correct values are organized in a diagonal line from top left to bottom-right of the matrix.
  • More errors were made by predicting men as women than predicting women as men.

Two-Class Problems Are Special

In a two-class problem, we are often looking to discriminate between observations with a specific outcome, from normal observations. Such as a disease state or event from no-disease state or no-event. In this way, we can assign the event row as “positive” and the no-event row as “negative“. We can then assign the event column of predictions as “true” and the no-event as “false“.

This gives us:

  • “true positive” for correctly predicted event values.
  • “false positive” for incorrectly predicted event values.
  • “true negative” for correctly predicted no-event values.
  • “false negative” for incorrectly predicted no-event values.

We can summarize this in the confusion matrix as follows:


eventno-event
men31
women24

This can help in calculating more advanced classification metrics such as precision, recall, specificity and sensitivity of our classifier. 

Sensitivity/ recall= 7/ (7+5)= 0.583
Specificity= 3/ (3+5)= 0.375
Precision= 7/ (7+3)= 0.7

The code mentioned below shows the implementation of confusion matrix in Python with respect to the example used earlier:

# Confusion Matrix
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(y_test,
predicted_df['Predicted_Class']).ravel()
confusion_matrix
array([ 37,  63,  15, 185])

The results from the confusion matrix are telling us that 37 and 185 are the number of correct predictions. 63 and 15 are the number of incorrect predictions.

Receiver Operating Characteristic (ROC)

The receiver operating characteristic (ROC), or the ROC curve, is a graphical plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied. The curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The true-positive rate is also known as sensitivity or the sensitivity index d', known as "d-prime" in signal detection and biomedical informatics, or recall in machine learning. The false-positive rate is also known as the fall-out and can be calculated as (1 - specificity). The ROC curve is thus the sensitivity as a function of fall-out.

Receiver Operating Characteristic (ROC) in Machine Learning

There are a number of methods of evaluating whether a logistic model is a good model. One such way is sensitivity and specificity. Sensitivity and specificity are statistical measures of the performance of a binary classification test, also known in statistics as classification function:

Sensitivity / Recall (also known as the true positive rate, or the recall) measures the proportion of actual positives which are correctly identified as such (e.g., the percentage of sick people who are correctly identified as having the condition), and is complementary to the false negative rate. It shows how good a test is at detecting the positives. A test can cheat and maximize this by always returning “positive”.

 Sensitivity= true positives/ (true positive + false negative)

Specificity (also called the true negative rate) measures the proportion of negatives which are correctly identified as such (e.g., the percentage of healthy people who are correctly identified as not having the condition), and is complementary to the false positive rate. It shows how good a test is at avoiding false alarms. A test can cheat and maximize this by always returning “negative”.

Specificity= true negatives/ (true negative + false positives)

Precision is used as a measure to calculate the success of predicted values to the values which were supposed to be successful. Precision is used with recall, the percent of all relevant documents that is returned by the search. The two measures are sometimes used together in the F1 Score (or f-measure) to provide a single measurement for a system. It shows how many of the positively classified were relevant. A test can cheat and maximize this by only returning positive on one result it’s most confident in.

Precision= true positives/ (true positive + true negative)

The precision-recall curve shows the trade-off between precision and recall for different threshold. The decision for the value of the threshold value is majorly affected by the values of precision and recall. Ideally, we want both precision and recall to be 1, but this seldom is the case. In case of a Precision-Recall tradeoff we use the following arguments to decide upon the threshold:-

  1. Low Precision/High Recall: In applications where we want to reduce the number of false negatives without necessarily reducing the number of false positives, we choose a decision value which has a low value of Precision or high value of Recall. For example, in a cancer diagnosis application, we do not want any affected patient to be classified as not affected without giving much heed to if the patient is being wrongfully diagnosed with cancer. This is because, the absence of cancer can be detected by further medical diseases but the presence of the disease cannot be detected in an already rejected candidate.
  2. High Precision/Low Recall: In applications where we want to reduce the number of false positives without necessarily reducing the number of false negatives, we choose a decision value which has a high value of Precision or low value of Recall. For example, if we are classifying customers whether they will react positively or negatively to a personalised advertisement, we want to be absolutely sure that the customer will react positively to the advertisement because otherwise, a negative reaction can cause a loss of potential sales from the customer.

The code mentioned below shows the implementation in Python with respect to the example used earlier:

from sklearn.metrics import classification_report

print(classification_report(y_test, predicted_df['Predicted_Class']))

precision-recall curve in Machine Learning

The f1-score tells you the accuracy of the classifier in classifying the data points in that particular class compared to all other classes. It is calculated by taking the harmonic mean of precision and recall. The support is the number of samples of the true response that lies in that class.

y_pred_prob = model1.predict(x_test)

from sklearn.metrics import roc_curve
# Generate ROC curve values: fpr, tpr, thresholds
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)

# Plot ROC curve
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()
Positive rate in Machine Learning# AUCfrom sklearn.metrics import roc_auc_score
roc_auc_score(y_test,predicted_df['Predicted_Class'])
0.6475

Area Under the Curve is 0.6475

Hosmer Lemeshow Goodness-of-Fit

  • It measures the association between actual events and predicted probability.
  • How well our model fits depends on the difference between the model and the observed data. One approach for binary data is to implement a Hosmer Lemeshow goodness of fit test
  • In HL test, the null hypothesis states, the model fits the data well. Model appears to fit well if we have no significant difference between the model and the observed data (i.e. the p-value > 0.05, so not rejecting the Ho)
  • Or in other words, if the test is NOT statistically significant, that indicates the model is a good fit.
  • As with all measures of model fit, use this as just one piece of information in deciding how well this model fits. It doesn’t work well in very large or very small data sets, but is often useful nonetheless.
       n    
G2HL = ∑ {[(Oj-Ej)2]/[Ej(1-Ej/nj)]} ~Xs2
      j=1
  • Χ2 = chi squared.
  • nj = number of observations in the group.
  • Oj = number of observed cases in the j th group.
  • Oj = number of expected cases in the  j th group.

Gini Coefficient

  • The Gini coefficient is sometimes used in classification problems.
  • Gini coefficient can be straight away derived from the AUC ROC number. Gini is nothing but the ratio between area between the ROC curve and the diagonal line & the area of the above triangle. Following is the formulae used :
Gini=2*AUC–1
  • Gini above 60% is a good model.

Akaike Information Criterion and Bayesian Information Criterion

  • AIC and BIC values are like adjusted R-squared values in linear regression.
  • AIC= -2ln(SSE)+ 2k
  • BIC = n*ln(SSE/n) + k*ln(n)

Pros and Cons of Logistic Regression

Many of the pros and cons of the linear regression model also apply to the logistic regression model. Although Logistic regression is used widely by many people for solving various types of problems, it fails to hold up its performance due to its various limitations and also other predictive models provide better predictive results. 

Pros

  • The logistic regression model not only acts as a classification model, but also gives you probabilities. This is a big advantage over other models where they can only provide the final classification. Knowing that an instance has a 99% probability for a class compared to 51% makes a big difference. Logistic Regression performs well when the dataset is linearly separable.
  • Logistic Regression not only gives a measure of how relevant a predictor (coefficient size) is, but also its direction of association (positive or negative). We see that Logistic regression is easier to implement, interpret and very efficient to train.

Cons

  • Logistic regression can suffer from complete separation. If there is a feature that would perfectly separate the two classes, the logistic regression model can no longer be trained. This is because the weight for that feature would not converge, because the optimal weight would be infinite. This is really a bit unfortunate, because such a feature is really very useful. But you do not need machine learning if you have a simple rule that separates both classes. The problem of complete separation can be solved by introducing penalization of the weights or defining a prior probability distribution of weights.
  • Logistic regression is less prone to overfitting but it can overfit in high dimensional datasets and in that case, regularization techniques should be considered to avoid over-fitting in such scenarios. 

In this article we have seen what Logistic Regression is, how it works, when we should use it, comparison of Logistic and Linear Regression, the difference between the approach and usage of two estimation techniques: Maximum Likelihood Estimation and Ordinary Least Square Method, evaluation of model using Confusion Matrix and the advantages and disadvantages of Logistic Regression. We have also covered some basics of sigmoid function, cost function and gradient descent.

If you are inspired by the opportunities provided by machine learning, enrol in our  Data Science and Machine Learning Courses for more lucrative career options in this landscape.

Priyankur

Priyankur Sarkar

Data Science Enthusiast

Priyankur Sarkar loves to play with data and get insightful results out of it, then turn those data insights and results in business growth. He is an electronics engineer with a versatile experience as an individual contributor and leading teams, and has actively worked towards building Machine Learning capabilities for organizations.

Join the Discussion

Your email address will not be published. Required fields are marked *

Suggested Blogs

Bagging and Random Forest in Machine Learning

In today’s world, innovations happen on a daily basis, rendering all the previous versions of that product, service or skill-set outdated and obsolete. In such a dynamic and chaotic space, how can we make an informed decision without getting carried away by plain hype? To make the right decisions, we must follow a set of processes; investigate the current scenario, chart down your expectations, collect reviews from others, explore your options, select the best solution after weighing the pros and cons, make a decision and take the requisite action. For example, if you are looking to purchase a computer, will you simply walk up to the store and pick any laptop or notebook? It’s highly unlikely that you would do so. You would probably search on Amazon, browse a few web portals where people have posted their reviews and compare different models, checking for their features, specifications and prices. You will also probably ask your friends and colleagues for their opinion. In short, you would not directly jump to a conclusion, but will instead make a decision considering the opinions and reviews of other people as well. Ensemble models in machine learning also operate on a similar manner. They combine the decisions from multiple models to improve the overall performance. The objective of this article is to introduce the concept of ensemble learning and understand algorithms like bagging and random forest which use a similar technique. What is Ensemble Learning? Ensemble methods aim at improving the predictive performance of a given statistical learning or model fitting technique. The general principle of ensemble methods is to construct a linear combination of some model fitting method, instead of using a single fit of the method. An ensemble is itself a supervised learning algorithm, because it can be trained and then used to make predictions. Ensemble methods combine several decision trees classifiers to produce better predictive performance than a single decision tree classifier. The main principle behind the ensemble model is that a group of weak learners come together to form a strong learner, thus increasing the accuracy of the model.When we try to predict the target variable using any machine learning technique, the main causes of difference in actual and predicted values are noise, variance, and bias. Ensemble helps to reduce these factors (except noise, which is irreducible error). The noise-related error is mainly due to noise in the training data and can't be removed. However, the errors due to bias and variance can be reduced.The total error can be expressed as follows: Total Error = Bias + Variance + Irreducible Error A measure such as mean square error (MSE) captures all of these errors for a continuous target variable and can be represented as follows: Where, E stands for the expected mean, Y represents the actual target values and fˆ(x) is the predicted values for the target variable. It can be broken down into its components such as bias, variance and noise as shown in the following formula: Using techniques like Bagging and Boosting helps to decrease the variance and increase the robustness of the model. Combinations of multiple classifiers decrease variance, especially in the case of unstable classifiers, and may produce a more reliable classification than a single classifier. Ensemble Algorithm The goal of ensemble algorithms is to combine the predictions of several base estimators built with a given learning algorithm in order to improve generalizability / robustness over a single estimator. There are two families of ensemble methods which are usually distinguished: Averaging methods. The driving principle is to build several estimators independently and then to average their predictions. On average, the combined estimator is usually better than any of the single base estimator because its variance is reduced.|Examples: Bagging methods, Forests of randomized trees. Boosting methods. Base estimators are built sequentially and one tries to reduce the bias of the combined estimator. The motivation is to combine several weak models to produce a powerful ensemble.Examples: AdaBoost, Gradient Tree Boosting.Advantages of Ensemble Algorithm Ensemble is a proven method for improving the accuracy of the model and works in most of the cases. Ensemble makes the model more robust and stable thus ensuring decent performance on the test cases in most scenarios. You can use ensemble to capture linear and simple as well nonlinear complex relationships in the data. This can be done by using two different models and forming an ensemble of two. Disadvantages of Ensemble Algorithm Ensemble reduces the model interpret-ability and makes it very difficult to draw any crucial business insights at the end It is time-consuming and thus might not be the best idea for real-time applications The selection of models for creating an ensemble is an art which is really hard to master Basic Ensemble Techniques Max Voting: Max-voting is one of the simplest ways of combining predictions from multiple machine learning algorithms. Each base model makes a prediction and votes for each sample. The sample class with the highest votes is considered in the final predictive class. It is mainly used for classification problems.  Averaging: Averaging can be used while estimating the probabilities in classification tasks. But it is usually used for regression problems. Predictions are extracted from multiple models and an average of the predictions are used to make the final prediction. Weighted Average: Like averaging, weighted averaging is also used for regression tasks. Alternatively, it can be used while estimating probabilities in classification problems. Base learners are assigned different weights, which represent the importance of each model in the prediction. Ensemble Methods Ensemble methods became popular as a relatively simple device to improve the predictive performance of a base procedure. There are different reasons for this: the bagging procedure turns out to be a variance reduction scheme, at least for some base procedures. On the other hand, boosting methods are primarily reducing the (model) bias of the base procedure. This already indicates that bagging and boosting are very different ensemble methods. From the perspective of prediction, random forests is about as good as boosting, and often better than bagging.  Bootstrap Aggregation or Bagging tries to implement similar learners on small sample populations and then takes a mean of all the predictions. It combines Bootstrapping and Aggregation to form one ensemble model Reduces the variance error and helps to avoid overfitting Bagging algorithms include: Bagging meta-estimator Random forest Boosting refers to a family of algorithms which converts weak learner to strong learners. Boosting is a sequential process, where each subsequent model attempts to correct the errors of the previous model. Boosting is focused on reducing the bias. It makes the boosting algorithms prone to overfitting. To avoid overfitting, parameter tuning plays an important role in boosting algorithms. Some examples of boosting are mentioned below: AdaBoost GBM XGBM Light GBM CatBoost Why use ensemble models? Ensemble models help in improving algorithm accuracy as well as the robustness of a model. Both Bagging and Boosting should be known by data scientists and machine learning engineers and especially people who are planning to attend data science/machine learning interviews. Ensemble learning uses hundreds to thousands of models of the same algorithm and then work hand in hand to find the correct classification. You may also consider the fable of the blind men and the elephant to understand ensemble learning, where each blind man found a feature of the elephant and they all thought it was something different. However, if they would work together and discussed among themselves, they might have figured out what it is. Using techniques like bagging and boosting leads to increased robustness of statistical models and decreased variance. Now the question becomes, between these different “B” words. Which is better? Which is better, Bagging or Boosting? There is no perfectly correct answer to that. It depends on the data, the simulation and the circumstances. Bagging and Boosting decrease the variance of your single estimate as they combine several estimates from different models. So the result may be a model with higher stability. If the problem is that the single model gets a very low performance, Bagging will rarely get a better bias. However, Boosting could generate a combined model with lower errors as it optimizes the advantages and reduces pitfalls of the single model. By contrast, if the difficulty of the single model is overfitting, then Bagging is the best option. Boosting for its part doesn’t help to avoid over-fitting; in fact, this technique is faced with this problem itself. For this reason, Bagging is effective more often than Boosting. In this article we will discuss about Bagging, we will cover Boosting in the next post. But first, let us look into the very important concept of bootstrapping. Bootstrap Sampling Sampling is the process of selecting a subset of observations from the population with the purpose of estimating some parameters about the whole population. Resampling methods, on the other hand, are used to improve the estimates of the population parameters. In machine learning, the bootstrap method refers to random sampling with replacement. This sample is referred to as a resample. This allows the model or algorithm to get a better understanding of the various biases, variances and features that exist in the resample. Taking a sample of the data allows the resample to contain different characteristics then it might have contained as a whole. This is demonstrated in figure 1 where each sample population has different pieces, and none are identical. This would then affect the overall mean, standard deviation and other descriptive metrics of a data set. In turn, it can develop more robust models. Bootstrapping is also great for small size data sets that can have a tendency to overfit. In fact, we recommended this to one company who was concerned because their data sets were far from “Big Data”. Bootstrapping can be a solution in this case because algorithms that utilize bootstrapping can be more robust and handle new data sets depending on the methodology chosen(boosting or bagging). The reason behind using the bootstrap method is because it can test the stability of a solution. By using multiple sample data sets and then testing multiple models, it can increase robustness. Perhaps one sample data set has a larger mean than another, or a different standard deviation. This might break a model that was overfit, and not tested using data sets with different variations. One of the many reasons bootstrapping has become very common is because of the increase in computing power. This allows for many times more permutations to be done with different resamples than previously. Bootstrapping is used in both Bagging and Boosting Let us assume we have a sample of ‘n’ values (x) and we’d like to get an estimate of the mean of the sample. mean(x) = 1/n * sum(x) Consider a sample of 100 values (x) and we’d like to get an estimate of the mean of the sample. We can calculate the mean directly from the sample as: We know that our sample is small and that the mean has an error in it. We can improve the estimate of our mean using the bootstrap procedure: Create many (e.g. 1000) random sub-samples of the data set with replacement (meaning we can select the same value multiple times). Calculate the mean of each sub-sample Calculate the average of all of our collected means and use that as our estimated mean for the data Example: Suppose we used 3 re-samples and got the mean values 2.3, 4.5 and 3.3. Taking the average of these we could take the estimated mean of the data to be 3.367. This process can be used to estimate other quantities like the standard deviation and even quantities used in machine learning algorithms, like learned coefficients. While using Python, we do not have to implement the bootstrap method manually. The scikit-learn library provides an implementation that creates a single bootstrap sample of a dataset. The resample() scikit-learn function can be used for sampling. It takes as arguments the data array, whether or not to sample with replacement, the size of the sample, and the seed for the pseudorandom number generator used prior to the sampling. For example, let us create a bootstrap that creates a sample with replacement with 4 observations and uses a value of 1 for the pseudorandom number generator. boot = resample(data, replace=True, n_samples=4, random_state=1)As the bootstrap API does not allow to easily gather the out-of-bag observations that could be used as a test set to evaluate a fit model, in the univariate case we can gather the out-of-bag observations using a simple Python list comprehension. # out of bag observations  oob = [x for x in data if x not in boot]Let us look at a small example and execute it.# scikit-learn bootstrap  from sklearn.utils import resample  # data sample  data = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]  # prepare bootstrap sample  boot = resample(data, replace=True, n_samples=4, random_state=1)  print('Bootstrap Sample: %s' % boot)  # out of bag observations  oob = [x for x in data if x not in boot]  print('OOB Sample: %s' % oob) The output will include the observations in the bootstrap sample and those observations in the out-of-bag sample.Bootstrap Sample: [0.6, 0.4, 0.5, 0.1]  OOB Sample: [0.2, 0.3]Bagging Bootstrap Aggregation, also known as Bagging, is a powerful ensemble method that was proposed by Leo Breiman in 1994 to prevent overfitting. The concept behind bagging is to combine the predictions of several base learners to create a more accurate output. Bagging is the application of the Bootstrap procedure to a high-variance machine learning algorithm, typically decision trees. Suppose there are N observations and M features. A sample from observation is selected randomly with replacement (Bootstrapping). A subset of features are selected to create a model with sample of observations and subset of features. Feature from the subset is selected which gives the best split on the training data. This is repeated to create many models and every model is trained in parallel Prediction is given based on the aggregation of predictions from all the models. This approach can be used with machine learning algorithms that have a high variance, such as decision trees. A separate model is trained on each bootstrap sample of data and the average output of those models used to make predictions. This technique is called bootstrap aggregation or bagging for short. Variance means that an algorithm’s performance is sensitive to the training data, with high variance suggesting that the more the training data is changed, the more the performance of the algorithm will vary. The performance of high variance machine learning algorithms like unpruned decision trees can be improved by training many trees and taking the average of their predictions. Results are often better than a single decision tree. What Bagging does is help reduce variance from models that are might be very accurate, but only on the data they were trained on. This is also known as overfitting. Overfitting is when a function fits the data too well. Typically this is because the actual equation is much too complicated to take into account each data point and outlier. Bagging gets around this by creating its own variance amongst the data by sampling and replacing data while it tests multiple hypothesis(models). In turn, this reduces the noise by utilizing multiple samples that would most likely be made up of data with various attributes(median, average, etc). Once each model has developed a hypothesis. The models use voting for classification or averaging for regression. This is where the “Aggregating” in “Bootstrap Aggregating” comes into play. Each hypothesis has the same weight as all the others. When we later discuss boosting, this is one of the places the two methodologies differ. Essentially, all these models run at the same time, and vote on which hypothesis is the most accurate. This helps to decrease variance i.e. reduce the overfit. Advantages Bagging takes advantage of ensemble learning wherein multiple weak learners outperform a single strong learner.  It helps reduce variance and thus helps us avoid overfitting. Disadvantages There is loss of interpretability of the model. There can possibly be a problem of high bias if not modeled properly. While bagging gives us more accuracy, it is computationally expensive and may not be desirable depending on the use case. There are many bagging algorithms of which perhaps the most prominent would be Random Forest.  Decision Trees Decision trees are simple but intuitive models. Using a top-down approach, a root node creates binary splits unless a particular criteria is fulfilled. This binary splitting of nodes results in a predicted value on the basis of the interior nodes which lead to the terminal or the final nodes. For a classification problem, a decision tree will output a predicted target class for each terminal node produced. We have covered decision tree algorithm  in detail for both classification and regression in another article. Limitations to Decision Trees Decision trees tend to have high variance when they utilize different training and test sets of the same data, since they tend to overfit on training data. This leads to poor performance when new and unseen data is added. This limits the usage of decision trees in predictive modeling. However, using ensemble methods, models that utilize decision trees can be created as a foundation for producing powerful results. Bootstrap Aggregating Trees We have already discussed about bootstrap aggregating (or bagging), we can create an ensemble (forest) of trees where multiple training sets are generated with replacement, meaning data instances. Once the training sets are created, a CART model can be trained on each subsample. Features of Bagged Trees Reduces variance by averaging the ensemble's results. The resulting model uses the entire feature space when considering node splits. Bagging trees allow the trees to grow without pruning, reducing the tree-depth sizes and resulting in high variance but lower bias, which can help improve predictive power. Limitations to Bagging Trees The main limitation of bagging trees is that it uses the entire feature space when creating splits in the trees. Suppose some variables within the feature space are indicating certain predictions, there is a risk of having a forest of correlated trees, which actually  increases bias and reduces variance. Why a Forest is better than One Tree?The main objective of a machine learning model is to generalize properly to new and unseen data. When we have a flexible model, overfitting takes place. A flexible model is said to have high variance because the learned parameters (such as the structure of the decision tree) will vary with the training data. On the other hand, an inflexible model is said to have high bias as it makes assumptions about the training data. An inflexible model may not have the capacity to fit even the training data and in both cases — high variance and high bias — the model is not able to generalize new and unseen data properly. You can through the article on one of the foundational concepts in machine learning, bias-variance tradeoff which will help you understand that the balance between creating a model that is so flexible memorizes the training data and an inflexible model cannot learn the training data.  The main reason why decision tree is prone to overfitting when we do not limit the maximum depth is because it has unlimited flexibility, which means it keeps growing unless there is one leaf node for every single observation. Instead of limiting the depth of the tree which results in reduced variance and increase in bias, we can combine many decision trees into a single ensemble model known as the random forest. What is Random Forest algorithm? Random forest is like bootstrapping algorithm with Decision tree (CART) model. Suppose we have 1000 observations in the complete population with 10 variables. Random forest will try to build multiple CART along with different samples and different initial variables. It will take a random sample of 100 observations and then chose 5 initial variables randomly to build a CART model. It will go on repeating the process say about 10 times and then make a final prediction on each of the observations. Final prediction is a function of each prediction. This final prediction can simply be the mean of each prediction. The random forest is a model made up of many decision trees. Rather than just simply averaging the prediction of trees (which we could call a “forest”), this model uses two key concepts that gives it the name random: Random sampling of training data points when building trees Random subsets of features considered when splitting nodes How the Random Forest Algorithm Works The basic steps involved in performing the random forest algorithm are mentioned below: Pick N random records from the dataset. Build a decision tree based on these N records. Choose the number of trees you want in your algorithm and repeat steps 1 and 2. In case of a regression problem, for a new record, each tree in the forest predicts a value for Y (output). The final value can be calculated by taking the average of all the values predicted by all the trees in the forest. Or, in the case of a classification problem, each tree in the forest predicts the category to which the new record belongs. Finally, the new record is assigned to the category that wins the majority vote. Using Random Forest for Regression Here we have a problem where we have to predict the gas consumption (in millions of gallons) in 48 US states based on petrol tax (in cents), per capita income (dollars), paved highways (in miles) and the proportion of population with the driving license. We will use the random forest algorithm via the Scikit-Learn Python library to solve this regression problem. First we import the necessary libraries and our dataset. import pandas as pd  import numpy as np  dataset = pd.read_csv('/content/petrol_consumption.csv')  dataset.head() Petrol_taxAverage_incomepaved_HighwaysPopulation_Driver_licence(%)Petrol_Consumption09.0357119760.52554119.0409212500.57252429.0386515860.58056137.5487023510.52941448.043994310.544410You will notice that the values in our dataset are not very well scaled. Let us scale them down before training the algorithm. Preparing Data For Training We will perform two tasks in order to prepare the data. Firstly we will divide the data into ‘attributes’ and ‘label’ sets. The resultant will then be divided into training and test sets. X = dataset.iloc[:, 0:4].values  y = dataset.iloc[:, 4].valuesNow let us divide the data into training and testing sets:from sklearn.model_selection import train_test_split  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)Feature Scaling The dataset is not yet a scaled value as you will see that the Average_Income field has values in the range of thousands while Petrol_tax has values in the range of tens. It will be better if we scale our data. We will use Scikit-Learn's StandardScaler class to do the same. # Feature Scaling  from sklearn.preprocessing import StandardScaler  sc = StandardScaler()  X_train = sc.fit_transform(X_train)  X_test = sc.transform(X_test)Training the Algorithm Now that we have scaled our dataset, let us train the random forest algorithm to solve this regression problem. from sklearn.ensemble import Random Forest Regressor  regressor = Random Forest Regressor(n_estimators=20,random_state=0)  regressor.fit(X_train, y_train)  y_pred = regressor.predict(X_test)The RandomForestRegressor is used to solve regression problems via random forest. The most important parameter of the RandomForestRegressor class is the n_estimators parameter. This parameter defines the number of trees in the random forest. Here we started with n_estimator=20 and check the performance of the algorithm. You can find details for all of the parameters of RandomForestRegressor here. Evaluating the Algorithm Let us evaluate the performance of the algorithm. For regression problems the metrics used to evaluate an algorithm are mean absolute error, mean squared error, and root mean squared error.  from sklearn import metrics  print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))  print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))  print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred))) Mean Absolute Error: 51.76500000000001 Mean Squared Error: 4216.166749999999 Root Mean Squared Error: 64.93201637097064 With 20 trees, the root mean squared error is 64.93 which is greater than 10 percent of the average petrol consumption i.e. 576.77. This may indicate, among other things, that we have not used enough estimators (trees). Let us now change the number of estimators to 200, the results are as follows: Mean Absolute Error: 48.33899999999999 Mean Squared Error: 3494.2330150000003  Root Mean Squared Error: 59.112037818028234 The graph below shows the decrease in the value of the root mean squared error (RMSE) with respect to number of estimators.  You will notice that the error values decrease with the increase in the number of estimators. You may consider 200 a good number for n_estimators as the rate of decrease in error diminishes. You may try playing around with other parameters to figure out a better result. Using Random Forest for ClassificationNow let us consider a classification problem to predict whether a bank currency note is authentic or not based on four attributes i.e. variance of the image wavelet transformed image, skewness, entropy, andkurtosis of the image. We will use Random Forest Classifier to solve this binary classification problem. Let’s get started. import pandas as pd  import numpy as np  dataset = pd.read_csv('/content/bill_authentication.csv')  dataset.head()VarianceSkewnessKurtosisEntropyClass03.621608.6661-2.8073-0.44699014.545908.1674-2.4586-1.46210023.86600-2.63831.92420.10645033.456609.5228-4.0112-3.59440040.32924-4.45524.5718-0.988800Similar to the data we used previously for the regression problem, this data is not scaled. Let us prepare the data for training. Preparing Data For Training The following code divides data into attributes and labels: X = dataset.iloc[:, 0:4].values  y = dataset.iloc[:, 4].values The following code divides data into training and testing sets:from sklearn.model_selection import train_test_split  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0) Feature Scaling We will do the same thing as we did for the previous problem. # Feature Scaling  from sklearn.preprocessing import StandardScaler  sc = StandardScaler()  X_train = sc.fit_transform(X_train)  X_test = sc.transform(X_test)Training the Algorithm Now that we have scaled our dataset, let us train the random forest algorithm to solve this classification problem. from sklearn.ensemble import Random Forest Classifier  classifier = RandomForestClassifier(n_estimators=20, random_state=0)  classifier.fit(X_train, y_train)  y_pred = classifier.predict(X_test)For classification, we have used RandomForestClassifier class of the sklearn.ensemble library. It takes n_estimators as a parameter. This parameter defines the number of trees in out random forest. Similar to the regression problem, we have started with 20 trees here. You can find details for all of the parameters of Random Forest Classifier here. Evaluating the Algorithm For evaluating classification problems,  the metrics used are accuracy, confusion matrix, precision recall, and F1 valuesfrom sklearn.metrics import classification_report, confusion_matrix, accuracy_score  print(confusion_matrix(y_test,y_pred))  print(classification_report(y_test,y_pred))  print(accuracy_score(y_test, y_pred)) The output will look something like this: Output:[ [ 155   2] [     1  117] ]Precisionrecallf1-scoresupport00.990.990.9915710.980.990.99118accuracy0.99275macro avg0.990.990.992750.98909090909090910.990.990.99275The accuracy achieved by our random forest classifier with 20 trees is 98.90%. Let us change the number of trees to 200.from sklearn.ensemble import Random Forest Classifier  classifier = Random Forest Classifier(n_estimators=200, random_state=0)  classifier.fit(X_train, y_train)  y_pred = classifier.predict(X_test) Output:[ [ 155   2] [     1  117] ]Precisionrecallf1-scoresupport00.990.990.9915710.980.990.99118accuracy0.99275macro avg0.990.990.992750.98909090909090910.990.990.99275Unlike the regression problem, changing the number of estimators for this problem did not make any difference in the results.An accuracy of 98.9% is pretty good. In this case, we have seen that there is not much improvement if the number of trees are increased. You may try playing around with other parameters of the RandomForestClassifier class and see if you can improve on our results. Advantages and Disadvantages of using Random Forest As with any algorithm, there are advantages and disadvantages to using it. Let us look into the pros and cons of using Random Forest for classification and regression. Advantages Random forest algorithm is unbiased as there are multiple trees and each tree is trained on a subset of data.  Random Forest algorithm is very stable. Introducing a new data in the dataset does not affect much as the new data impacts one tree and is pretty hard to impact all the trees. The random forest algorithm works well when you have both categorical and numerical features. With missing values in the dataset, the random forest algorithm performs very well. Disadvantages A major disadvantage of random forests lies in their complexity. More computational resources are required and also results in the large number of decision trees joined together. Due to their complexity, training time is more compared to other algorithms. Summary In this article we have covered what is ensemble learning and discussed about basic ensemble techniques. We also looked into bootstrap sampling involves iteratively resampling of a dataset with replacement which allows the model or algorithm to get a better understanding various features. Then we moved on to bagging followed by random forest. We also implemented random forest in Python for both regression and classification and came to a conclusion that increasing number of trees or estimators does not always make a difference in a classification problem. However, in regression there is an impact.  We have covered most of the topics related to algorithms in our series of machine learning blogs,click here. If you are inspired by the opportunities provided by machine learning, enrol in our  Data Science and Machine Learning Courses for more lucrative career options in this landscape. 0.99
Rated 4.5/5 based on 12 customer reviews
16608
Bagging and Random Forest in Machine Learning

In today’s world, innovations happen on a daily ... Read More

Support Vector Machines in Machine Learning

While many classifiers exist that can classify linearly separable data such as logistic regression, Support Vector Machines can handle highly non-linear problems using a kernel trick which implicitly maps the input vectors to higher-dimensional feature spaces. The transformation rearranges the dataset in such a way that it is then linearly solvable. In this article we are going to look at how SVM works, learn about kernel functions, hyperparameters and pros and cons of SVM along with some of the real life applications of SVM. Support Vector Machines (SVMs), also known as support vector networks, are a family of extremely powerful models which use method based learning and can be used in classification and regression problems. They aim at finding decision boundaries that separate observations with differing class memberships. In other words, SVM is a discriminative classifier formally defined by a separating hyperplane.Method Based Learning There are several learning models namely:Association rules basedEnsemble method basedDeep Learning basedClustering method basedRegression Analysis basedBayesian method basedDimensionality reduction based Instance basedKernel method basedLet us understand what Kernel method based learning is all about.In simple terms, a kernel is a similarity function which is fed into a machine learning algorithm. It accepts two inputs and suggests the similarity. For example, suppose we want to classify images, the input data is a key-value pair (image, label). The image data is taken into consideration, features are computed, and a vector of features are fed into the Machine learning algorithm. But in the case of similarity functions, a kernel function can be defined which internally computes the similarity between images, and then feeds into the learning algorithm along with the images and label data. The outcome of this is a classifier. Perceptron frameworks or Support vector machines work with kernels and use vectors only. Here, the machine learning algorithms are expressed as dot products so that kernel functions can be used.Feature vectors generally prefer kernels. Its ease of computing makes it one of the key reasons, also, feature vectors need more storage space in comparison to dot products. You can writeMachine learning algorithms to use dot products and later map them to use kernels. This completely avoids the usage of feature vectors. This allows us to work with highly complex, efficient-to-compute, and yet high performing kernels effortlessly, without really developing multi-dimensional vectors.Kernel functionsLet us understand what kernel functions are: The figure shown below represents a 1D function using a simple 1-Dimensional example. Assume that given points are as follows, it will depict a vertical line and no other vertical lines will separate the dataset.Now, if we consider a 2-Dimensional representation, as shown in the figure below, there is a hyperplane (an arbitrary line in 2-Dimensions) which separates red and blue points, which can be separated using Support Vector Machines.As we keep increasing dimensional space, the need to be able to separate data will eventually decrease. This mapping, x -> (x, x2), is called the kernel function. In case of growing dimensional space, the computations become more complex and kernel trick needs to be applied to address these computations cheaply. What is Support Vector Machine? Support Vector Machine (SVM) is a supervised machine learning algorithm which can be used for both classification or regression challenges. However,  it is mostly used in classification problems. In this algorithm, each data is plotted in n-dimensional space (where n is the number of features you have) with the value of each feature being the value of a particular coordinate. After that, we perform classification by locating the hyperplane which differentiates both the classes.Let us create a dataset to understand support vector classification:# importing scikit learn with make_blobs from sklearn.datasets.samples_generator import make_blobs# creating datasets X containing n_samples # Y containing two classes X, Y = make_blobs(n_samples=500, centers=2,        random_state=0, cluster_std=0.40)# plotting scatters plt.scatter(X[:, 0], X[:, 1], c=Y, s=50, cmap='spring'); plt.show()Support vector machine is based on the concept of decision planes that define decision boundaries. A decision plane is one that separates between a set of objects with different class memberships. For example, in the figure mentioned below, there are objects which belong to either class Green or Red. The separating line defines a boundary on the right side of which all objects are Green and to the left of which all objects are Red. Any new object (white circle) falling to the right is labeled, i.e., classified, as Green (or classified as Red should it fall to the left of the separating line).Support vector machines not only draw a line between two classes, but consider a region about the line of some given width. Here’s an example of what it can look like:# creating line space between -1 to 3.5 xfit = np.linspace(-1, 3.5) # plotting scatter plt.scatter(X[:, 0], X[:, 1], c=Y, s=50, cmap='spring') # plot a line between the different sets of data for m, b, d in [(1, 0.65, 0.33), (0.5, 1.6, 0.55), (-0.2, 2.9, 0.2)]:     yfit = m * xfit + b     plt.plot(xfit, yfit, '-k')     plt.fill_between(xfit, yfit - d, yfit + d, edgecolor='none',     color='#AAAAAA', alpha=0.4)plt.xlim(-1, 3.5); plt.show()Another scenario, where it is clear that a full separation of the Green and Red objects would require a curve (which is more complex than a line). Classification tasks based on drawing separating lines to distinguish between objects of different class memberships are known as hyperplane classifiers. Support Vector Machines are particularly suited to handle such tasks.The figure below shows the basic idea behind Support Vector Machines. Here you will see that the original objects (left side of the schematic) mapped, are rearranged using a set of mathematical functions called kernels. This process of rearranging objects is known as mapping or transformation. You will notice that the right side of the schematic is linearly separable. All we can do is find an optimal line that will separate red and green objects.What is a hyperplane?The goal of Support Vector Machine is to find the hyperplane which separates these two objects or classes. Let us consider another figure which shows some of the possible hyperplanes which can help in separating or dividing the dataset. It is the choice of the best hyperplane which is also the goal. The best hyperplane is defined by the extent to which a maximum margin is left for both classes. The margin is the distance between the hyperplane and the closest point in the classification.Let us consider two hyperplanes among all and then check the margins represented by M1 and M2. You will notice that margin M1 > M2, so the choice of the hyperplane which separates the best one is the new plane between the green and blue planes.How do we find the right hyperplane?Now, let us represent the new plane by a linear equation as: f(x) = ax + bLet us consider that this equation delivers all values ≥ 1 from the green triangle class and ≤ -1 for the gold star class. The distance of this plane from the closest points in both the classes is at least one; the modulus is one. f(x) ≥ 1 for triangles and f(x) ≤ 1 or |f(x)| = 1 for starThe distance between the hyperplane and the point can be computed using the following equation. M1 = |f(x)| / ||a|| = 1 / ||a||The total margin is 1 / ||a|| + 1 / ||a|| = 2 / ||a|. In order to maximize the separability, we will have to maximize the ||a|| value. This particular value is known as a weight vector. We can minimize the weight value which is a non-linear optimization task. One of the methods is to use the Karush-Kuhn-Tucker (KKT) condition, using the Lagrange multiplier λi.What is a support vector in SVM?Let's take an example of two points between the two attributes X and Y. We need to find a point between these two points that has a maximum distance between these points. This requirement is represented in the graph depicted next. The optimal point is depicted using the red circle.The maximum margin weight vector is parallel to the line from (1, 1) to (2, 3). The weight vector is at (1,2), and this becomes a decision boundary that is halfway between and in perpendicular, that passes through (1.5, 2). So, y = x1 +2x2 − 5.5 and the geometric margin is computed as √5. Following are the steps to compute SVMs: With w = (a, 2a) for the functions of the points (1,1) and (2,3) can be represented as shown here: a + 2a + ω0 = -1 for the point (1,1) 2a + 6a + ω0 = 1 for the point (2,3) The weights can be computed as follows:These are the support vectors:Lastly, the final equation is as follows:Large Margin IntuitionIn logistic regression, the output of linear function is taken and the value is squashed within the range of [0,1] using the sigmoid function. If the value is greater than a threshold value, say 0.5, label 1 is assigned else label 0.  In case of support vector machines, the linear function is taken and if the output is greater than 1 and we identify it with one class and if the output is -1, it is identified with another class. Since the threshold values are changed to 1 and -1 in SVM, we obtain this reinforcement range of values([-1,1]) which acts as margin. Cost Function and Gradient UpdatesIn the SVM algorithm, we maximize the margin between the data points and the hyperplane. The loss function that helps maximize the margin is called the hinge loss.Hinge loss function (function on the left can be represented as a function on the right)   If the predicted value and the actual value are of the same sign, the cost is 0 . If not, we calculate the loss value. We also add a regularization parameter the cost function. The objective of the regularization parameter is to balance the margin maximization and loss. After adding the regularization parameter, the cost functions looks as below.Loss function for SVM  Now that we have the loss function, we take partial derivatives with respect to the weights to find the gradients. Using gradients, we can update our weights.Gradients  When there is no misclassification, i.e our model correctly predicts the class of our data point, we only have to update the gradient from the regularization parameter.Gradient Update — No misclassification  When there is a misclassification, i.e our model makes a mistake on the prediction of the class of our data point, we include the loss along with the regularization parameter to perform gradient update.Gradient Update — Misclassification  Let us start with a code and import the necessary libraries:import pandas as pd  import numpy as np  from sklearn.model_selection import train_test_split  from sklearn.model_selection import cross_val_score, GridSearchCV  from sklearn import metrics  from sklearn.preprocessing import MinMaxScaler  pd.set_option('display.max_columns', None)Read the Wisconsin Breast Cancer dataset using pandas.read_csv function into an object 'data' from the current directorydata = pd.read_csv('wisconsin.csv')After reading the data, we have prepared the data as per requirement. Feature scaling is a method used to standardize the range of independent variables or features of data. The min-max scaling (or min-max normalization) shrinks the range of feature such that the range is in between 0 and 1 (or -1 to 1 if there are negative values).sclr = MinMaxScaler() predictor_sc = sclr.fit_transform(predictor)predictor_sc.shapeSplit the scaled data into train-test split:x_train_sc,x_test_sc, y_train, y_test = train_test_split(predictor_sc, target, test_size = 0.30, random_state=101) print("Scaled train and test split") print("x_train ",x_train_sc.shape) print("x_test ",x_test_sc.shape) print("y_train ",y_train.shape) print("y_test ",y_test.shape)Scaled train and test split x_train  (398, 30) x_test  (171, 30) y_train  (398,) y_test  (171,)But what happens when there is no clear hyperplane? Support Vector Machines can probably help you to find a separating hyperplane but only if it exists. There are certain cases when it is not possible to define a hyperplane, this happens due to noise in the data. Another possible reason can be a non-linear boundary. The first graph below depicts noise and the next one shows a non-linear boundary.There might be cases where there is no possibility to define a hyperplane, which can happen due to noise in the data. In fact, another reason can be a non-linear boundary as well. The following first graph depicts noise and the second one shows a non-linear boundary.For such problems which arise due to noise in the data, the best way is to reduce the margin itself and introduce slack.The non-linear boundary problem can be solved if we introduce a kernel. Some of the kernel functions that can be introduced are mentioned below:A radial basis function is a real-valued function whose value is dependent on the distance between the input and some fixed point. In machine learning, the radial basis function kernel, or RBF kernel, is a popular kernel function used in various kernelized learning algorithms.The RBF kernel on two samples x and x', represented as feature vectors in some input space, is defined as:Applying SVM with default hyperparametersLet us get back to the example and apply SVM after data pre-processsing with default hyperparameters. Linear Kernelfrom sklearn import svm svm2 = svm.SVC(kernel='linear') svm2 SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False) model2 = svm2.fit(x_train_sc, y_train) y_pred2 = svm2.predict(x_test_sc) print('Accuracy Score’) print(metrics.accuracy_score(y_test,y_pred2))Accuracy Score:0.9707602339181286Gaussian Kernelsvm3 = svm.SVC(kernel='rbf') svm3 SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False) model3 = svm3.fit(x_train_sc, y_train) y_pred3 = svm3.predict(x_test_sc) print('Accuracy Score’) print(metrics.accuracy_score(y_test, y_pred3))Accuracy Score:0.935672514619883Polynomial Kernelsvm4 = svm.SVC(kernel='poly') svm4SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma='auto', kernel='poly', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False)model4 = svm4.fit(x_train_sc, y_train) y_pred4 = svm4.predict(x_test_sc) print('Accuracy Score’) print(metrics.accuracy_score(y_test,y_pred4)) Accuracy Score:0.6198830409356725How to tune Parameters of SVM? Kernel: Kernel in support vector machine is responsible for the transformation of the input data into the required format. Some of the kernels used in support vector machines are linear, polynomial and radial basis function (RBF). In order to create a non-linear hyperplane, we use RBF and Polynomial function, and for complex applications, you should use more advanced kernels to separate classes that are nonlinear in nature. With this transformation, you can obtain accurate classifiers. Regularization: Using the Scikit-learn’s C parameters and adjusting we can maintain regularization. C denotes a penalty parameter representing an error or any form of misclassification. This misclassification allows you to understand how much of the error is actually bearable. This helps you nullify the compensation between the misclassified term and the decision boundary. With a smaller C value, you obtain hyperplane of small margin and with a larger C value, hyperplane of larger value is obtained. Gamma: Lower value of Gamma creates a loose fit of the training dataset. On the other hand, a high value of gamma allows the model to get fit more appropriately. A low value of gamma will only provide consideration to the nearby points for the calculation of a separate plane. However, the high value of gamma will consider all the data-points to calculate the final separation line. Do we need to tune parameters always?? You do not need to tune parameter in all cases. There are inbuilt functions in sklearn tool kit which can be used. Tuning HyperparametersThe 'C' and 'gamma' hyperparameterC is the parameter for the soft margin cost function, which controls the influence of each individual support vector. This process involves trading error penalty for stability. Small C tends to emphasize the margin while ignoring the outliers in the training data(Soft Margin), while large C may tend to overfit the training data(Hard Margin). Thus for a very large values we can cause overfitting of the model and for a very small value of C we can cause underfitting.Thus the value of C must be chosen in such a manner that it generalises the unseen data well. The gamma parameter is the inverse of the standard deviation of the RBF kernel (Gaussian function), which is used as a similarity measure between two points. A small gamma value define a Gaussian function with a large variance. In this case, two points can be considered similar even if are far from each other. On the other hand, a large gamma value define a Gaussian function with a small variance and in this case, two points are considered similar just if they are close to each other. Taking kernel as linear and tuning C hyperparameterC_range=list(range(1,26)) acc_score=[] for c in C_range: svc = svm.SVC(kernel='linear', C=c) scores = cross_val_score(svc, predictor_sc, target, cv=10, scoring='accuracy') acc_score.append(scores.mean()) print(acc_score) [0.9772210699161695, 0.9772210699161695, 0.9806995938121164, 0.9824539797770286, 0.9789754558810818, 0.9789452078472042, 0.9806995938121164, 0.9789452078472041, 0.9789452078472041, 0.9789452078472041, 0.9806995938121164, 0.9789452078472041, 0.9789452078472041, 0.9772210699161695, 0.9772210699161695, 0.9772210699161695, 0.9772210699161695, 0.9754666839512574, 0.9754666839512574, 0.9754666839512574, 0.9754666839512574, 0.9754666839512574, 0.9754666839512574, 0.9754666839512574, 0.9754666839512574]Let us visualize the above points:import matplotlib.pyplot as plt %matplotlib inline C_Val_list = list(range(1,26)) plt.plot(C_Val_list,acc_score) plt.xticks(np.arange(0,27,2)) plt.xlabel('Value of C for SVC') plt.ylabel('Cross-Validated Accuracy')From the plot we can see that accuracy has been close to 98% somewhere in between C=4 and C=5 and then it drops.#Taking a close look at the cross-validation accuracy in the range C(4,5) C_range=list(np.arange(4,5,0.2)) acc_score=[] for c in C_range: svc = svm.SVC(kernel='linear', C=c) scores = cross_val_score(svc, predictor_sc, target, cv=10, scoring='accuracy') acc_score.append(scores.mean()) print(acc_score) [0.9824539797770286, 0.9806995938121164, 0.9789754558810818, 0.9789754558810818, 0.9789754558810818] Accuracy score is highest for C=4Taking kernel as gaussian and tuning gamma hyperparametergamma_range=[0.0001,0.001,0.01,0.1,1,10,100] acc_score=[] for g in gamma_range: svc = svm.SVC(kernel='rbf', gamma=g) scores = cross_val_score(svc, predictor_sc, target, cv=10, scoring='accuracy') acc_score.append(scores.mean()) print(acc_score) [0.6274274047186933, 0.6274274047186933, 0.9195035001296346, 0.9561651974764496, 0.9806995938121164, 0.9420026359000951, 0.6274274047186933] Let us visualize the above points: gamma_range=[0.0001,0.001,0.01,0.1,1,10,100]# plotting the value of gamma for SVM versus the cross-validated accuracy plt.plot(gamma_range,acc_score) plt.xlabel('Value of gamma for SVC ') plt.xticks(np.arange(0.0001,100,5)) plt.ylabel('Cross-Validated Accuracy')Text(0,0.5,'Cross-Validated Accuracy')For gamma between 5 and 100 the kernel performs very poorly.Let us take a closer look at the cross-validated accuracy for gamma value in between 0 and 5.gamma_range=list(np.arange(0.1,5,0.1))  acc_score=[] for g in gamma_range:  svc = svm.SVC(kernel='rbf', gamma=g)  scores = cross_val_score(svc, predictor_sc, target, cv=10, scoring='accuracy') acc_score.append(scores.mean())  print(acc_score)[0.9561651974764496, 0.9718952553798289, 0.9754051075965776, 0.9737122979863452, 0.9806995938121164, 0.9806995938121164, 0.9806995938121164, 0.9806995938121164, 0.9806995938121164, 0.9806995938121164, 0.9789754558810818, 0.9754969319851352, 0.9754969319851352, 0.9754969319851352, 0.9754969319851352, 0.9737727940541007, 0.9737727940541007, 0.9737727940541007, 0.9737727940541007, 0.9720184080891883, 0.9720184080891883, 0.9720184080891883, 0.9720184080891883, 0.9720184080891883, 0.9720184080891883, 0.9702326938034741, 0.9702326938034741, 0.9702326938034741, 0.9702326938034741, 0.9702326938034741, 0.9702326938034741, 0.9702326938034741, 0.9702326938034741, 0.9666925935528475, 0.9666925935528475, 0.9684167314838821, 0.9684167314838821, 0.9684167314838821, 0.9701711174487941, 0.9701711174487941, 0.96838540316308, 0.9649068792671333, 0.9649068792671333, 0.9649068792671333, 0.9649068792671333, 0.9649068792671333, 0.9649068792671333, 0.963152493302221, 0.963152493302221] gamma_range=list(np.arange(0.1,5,0.1)) plt.plot(gamma_range,acc_score) plt.xlabel('Value of gamma for SVC ') #plt.xticks(np.arange(0.0001,5,5)) plt.ylabel('Cross-Validated Accuracy') Text(0,0.5,'Cross-Validated Accuracy')The highest cross-validated accuracy for rbf kernel remains constant in between gamma=0.5 and gamma=1Taking polynomial kernel and tuning degree hyperparameterdegree=[2,3,4,5,6] acc_score=[] for d in degree: svc = svm.SVC(kernel='poly', degree=d) scores = cross_val_score(svc, predictor_sc, target, cv=10, scoring='accuracy') acc_score.append(scores.mean()) print(acc_score) [0.8350974418805635, 0.6450652493302222, 0.6274274047186933, 0.6274274047186933, 0.6274274047186933] plt.plot(degree,acc_score) plt.xlabel('degrees for SVC ') plt.ylabel('Cross-Validated Accuracy') Text(0,0.5,'Cross-Validated Accuracy')Score is high for second degree polynomial. There is drop in the accuracy score as degree of polynomial increases.Thus increase in polynomial degree results in high complexity of the model. Advantages and Disadvantages of Support Vector MachineAdvantages of SVMSVM Classifiers offer good accuracy and perform faster prediction compared to Naïve Bayes algorithm. SVM guarantees optimality due to the nature of Convex Optimization, the solution will always be global minimum not a local minimum. SVMcan be access it conveniently, be it from Python or Matlab. SVM can be used for both linearly separable as well as non-linearly separable data. Linearly separable data is the hard margin however, non-linearly separable data poses a soft margin. SVM provides compliance to the semi-supervised learning models as well. It can be implemented in both labelled and unlabelled data. The only thing it requires is a condition to the minimization problem which is known as the Transductive SVM. Feature Mapping used to be complex with respect to computation of the overall training performance of the model. With the help of Kernel Trick, SVM can carry out the feature mapping using simple dot product. SVM works well with a clear margin of separation and with high dimensional space.  Disadvantages of SVM SVM is not at all capable of handling text structures. It leads to bad performance as it results in the loss of sequential information. SVM is not suitable for large datasets because of its high training time and it also takes more time in training compared to Naïve Bayes. SVM works poorly with overlapping classes and is also sensitive to the type of kernel used. In cases where the number of features for each data point exceeds the number of training data samples , the SVM under performs. Applications of SVM in Real WorldSupport vector machines depend on supervised learning algorithms. The main goal of using SVM is to classify unseen data correctly. SVMs can be used to solve various real-world problems: Face detection – SVM can be used to classify parts of the image as a face and non-face and create a square boundary around the face. Text and hypertext categorization – SVM allows text and hypertext categorization for both inductive and transductive models. It uses training data for classification of documents into different categories. It categorizes based on the score generated and then compares with the threshold value. Classification of images – SVMs enhances search accuracy for image classification. In comparison to the traditional query-based searching techniques, SVM provides better accuracy. Bioinformatics – It includes classification of proteins and classification of cancer. SVM is used for identifying the classification of genes, patients on the basis of genes and other biological problems. Protein fold and remote homology detection – SVM algorithms are applied for protein remote homology detection. Handwriting recognition –  SVMs are used widely to recognize handwritten characters.  Generalized predictive control(GPC) – You can use SVM based GPC in order to control chaotic dynamics with useful parameters. Summary In this article, we looked at the machine learning algorithm, Support Vector Machine in detail. We have discussed the concept behind support vector machines, how it works, the process of implementation in Python.  We also looked into how to tune its parameters and make efficient models. Lastly, we came across the advantages and disadvantages of SVM along with various real world applications of support vector machines.We have covered most of the topics related to algorithms in our series of machine learning blogs,click here. If you are inspired by the opportunities provided by machine learning, enrol in our  Data Science and Machine Learning Courses for more lucrative career options in this landscape.
Rated 4.0/5 based on 67 customer reviews
27689
Support Vector Machines in Machine Learning

While many classifiers exist that can classify lin... Read More

What is LDA: Linear Discriminant Analysis for Machine Learning

Linear Discriminant Analysis or LDA is a dimensionality reduction technique. It is used as a pre-processing step in Machine Learning and applications of pattern classification. The goal of LDA is to project the features in higher dimensional space onto a lower-dimensional space in order to avoid the curse of dimensionality and also reduce resources and dimensional costs.The original technique was developed in the year 1936 by Ronald A. Fisher and was named Linear Discriminant or Fisher's Discriminant Analysis. The original Linear Discriminant was described as a two-class technique. The multi-class version was later generalized by C.R Rao as Multiple Discriminant Analysis. They are all simply referred to as the Linear Discriminant Analysis.LDA is a supervised classification technique that is considered a part of crafting competitive machine learning models. This category of dimensionality reduction is used in areas like image recognition and predictive analysis in marketing.What is Dimensionality Reduction?The techniques of dimensionality reduction are important in applications of Machine Learning, Data Mining, Bioinformatics, and Information Retrieval. The main agenda is to remove the redundant and dependent features by changing the dataset onto a lower-dimensional space.In simple terms, they reduce the dimensions (i.e. variables) in a particular dataset while retaining most of the data.Multi-dimensional data comprises multiple features having a correlation with one another. You can plot multi-dimensional data in just 2 or 3 dimensions with dimensionality reduction. It allows the data to be presented in an explicit manner which can be easily understood by a layman.What are the limitations of Logistic Regression?Logistic Regression is a simple and powerful linear classification algorithm. However, it has some disadvantages which have led to alternate classification algorithms like LDA. Some of the limitations of Logistic Regression are as follows:Two-class problems – Logistic Regression is traditionally used for two-class and binary classification problems. Though it can be extrapolated and used in multi-class classification, this is rarely performed. On the other hand, Linear Discriminant Analysis is considered a better choice whenever multi-class classification is required and in the case of binary classifications, both logistic regression and LDA are applied.Unstable with Well-Separated classes – Logistic Regression can lack stability when the classes are well-separated. This is where LDA comes in.Unstable with few examples – If there are few examples from which the parameters are to be estimated, logistic regression becomes unstable. However, Linear Discriminant Analysis is a better option because it tends to be stable even in such cases.How to have a practical approach to an LDA model?Consider a situation where you have plotted the relationship between two variables where each color represents a different class. One is shown with a red color and the other with blue.If you are willing to reduce the number of dimensions to 1, you can just project everything to the x-axis as shown below: This approach neglects any helpful information provided by the second feature. However, you can use LDA to plot it. The advantage of LDA is that it uses information from both the features to create a new axis which in turn minimizes the variance and maximizes the class distance of the two variables.How does LDA work?LDA focuses primarily on projecting the features in higher dimension space to lower dimensions. You can achieve this in three steps:Firstly, you need to calculate the separability between classes which is the distance between the mean of different classes. This is called the between-class variance.Secondly, calculate the distance between the mean and sample of each class. It is also called the within-class variance.Finally, construct the lower-dimensional space which maximizes the between-class variance and minimizes the within-class variance. P is considered as the lower-dimensional space projection, also called Fisher’s criterion.How are LDA models represented?The representation of LDA is pretty straight-forward. The model consists of the statistical properties of your data that has been calculated for each class. The same properties are calculated over the multivariate Gaussian in the case of multiple variables. The multivariates are means and covariate matrix.Predictions are made by providing the statistical properties into the LDA equation. The properties are estimated from your data. Finally, the model values are saved to file to create the LDA model.How do LDA models learn?The assumptions made by an LDA model about your data:Each variable in the data is shaped in the form of a bell curve when plotted,i.e. Gaussian.The values of each variable vary around the mean by the same amount on the average,i.e. each attribute has the same variance.The LDA model is able to estimate the mean and variance from your data for each class with the help of these assumptions.The mean value of each input for each of the classes can be calculated by dividing the sum of values by the total number of values:Mean =Sum(x)/Nkwhere Mean = mean value of x for class           N = number of           k = number of           Sum(x) = sum of values of each input x.The variance is computed across all the classes as the average of the square of the difference of each value from the mean:Σ²=Sum((x - M)²)/(N - k)where  Σ² = Variance across all inputs x.            N = number of instances.            k = number of classes.            Sum((x - M)²) = Sum of values of all (x - M)².            M = mean for input x.How does an LDA model make predictions?LDA models use Bayes’ Theorem to estimate probabilities. They make predictions based upon the probability that a new input dataset belongs to each class. The class which has the highest probability is considered the output class and then the LDA makes a prediction.  The prediction is made simply by the use of Bayes’ Theorem which estimates the probability of the output class given the input. They also make use of the probability of each class and the probability of the data belonging to each class:P(Y=x|X=x)  = [(Plk * fk(x))] / [(sum(PlI * fl(x))]Where x = input.            k = output class.            Plk = Nk/n or base probability of each class observed in the training data. It is also called prior probability in Bayes’ Theorem.            fk(x) = estimated probability of x belonging to class k.The f(x) is plotted using a Gaussian Distribution function and then it is plugged into the equation above and the result we get is the equation as follows:Dk(x) = x∗(mean/Σ²) – (mean²/(2*Σ²)) + ln(PIk)The Dk(x) is called the discriminant function for class k given input x, mean,  Σ² and Plk are all estimated from the data and the class is calculated as having the largest value, will be considered in the output classification.  How to prepare data from LDA?Some suggestions you should keep in mind while preparing your data to build your LDA model:LDA is mainly used in classification problems where you have a categorical output variable. It allows both binary classification and multi-class classification.The standard LDA model makes use of the Gaussian Distribution of the input variables. You should check the univariate distributions of each attribute and transform them into a more Gaussian-looking distribution. For example, for the exponential distribution, use log and root function and for skewed distributions use BoxCox.Outliers can skew the primitive statistics used to separate classes in LDA, so it is preferable to remove them.Since LDA assumes that each input variable has the same variance, it is always better to standardize your data before using an LDA model. Keep the mean to be 0 and the standard deviation to be 1.How to implement an LDA model from scratch?You can implement a Linear Discriminant Analysis model from scratch using Python. Let’s start by importing the libraries that are required for the model:from sklearn.datasets import load_wine import pandas as pd import numpy as np np.set_printoptions(precision=4) from matplotlib import pyplot as plt import seaborn as sns sns.set() from sklearn.preprocessing import LabelEncoder from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import confusion_matrixSince we will work with the wine dataset, you can obtain it from the UCI machine learning repository. The scikit-learn library in Python provides a wrapper function for downloading it:wine_info = load_wine() X = pd.DataFrame(wine_info.data, columns=wine_info.feature_names) y = pd.Categorical.from_codes(wine_info.target, wine_info.target_names)The wine dataset comprises of 178 rows of 13 columns each:X.shape(178, 13)The attributes of the wine dataset comprise of various characteristics such as alcohol content of the wine, magnesium content, color intensity, hue and many more:X.head()The wine dataset contains three different kinds of wine:wine_info.target_names array(['class_0', 'class_1', 'class_2'], dtype='
Rated 4.5/5 based on 12 customer reviews
8675
What is LDA: Linear Discriminant Analysis for Mach...

Linear Discriminant Analysis or LDA is a dimension... Read More

20% Discount