Conditions Apply

Search

Machine learning Filter

What is Logistic Regression in Machine Learning

Every machine learning algorithm performs best under a given set of conditions. To ensure good performance, we must know which algorithm to use depending on the problem at hand. You cannot just use one particular algorithm for all problems. For example: Linear regression algorithm cannot be applied on a categorical dependent variable. This is where Logistic Regression comes in.Logistic Regression is a popular statistical model used for binary classification, that is for predictions of the type this or that, yes or no, A or B, etc. Logistic regression can, however, be used for multiclass classification, but here we will focus on its simplest application. It is one of the most frequently used machine learning algorithms for binary classifications that translates the input to 0 or 1.  For example, 0: negative class1: positive classSome examples of classification are mentioned below:Email: spam / not spamOnline transactions: fraudulent / not fraudulentTumor: malignant / not malignantLet us look at the issues we encounter in Linear Regression.Issue 1 of Linear RegressionAs you can see on the graph mentioned below, the prediction would leave out malignant tumors as the gradient becomes less steep with an additional data point on the extreme right.Issue 2 of Linear RegressionHypothesis can be larger than 1 or smaller than zeroHence, we have to use logistic regressionWhat is Logistic Regression?Logistic Regression is the appropriate regression analysis to conduct when the dependent variable has a binary solution. Similar to all other types of regression systems, Logistic Regression is also a type of predictive regression system. Logistic regression is used to evaluate the relationship between one dependent binary variable and one or more independent variables. It gives discrete outputs ranging between 0 and 1.A simple example of Logistic Regression is: Does calorie intake, weather, and age have any influence on the risk of having a heart attack? The question can have a discrete answer, either “yes” or “no”.Logistic Regression HypothesisThe logistic regression classifier can be derived by analogy to the linear regression hypothesis which is:Linear regression hypothesisHowever, the logistic regression hypothesis generalizes from the linear regression hypothesis in that it uses the logistic function:The result is the logistic regression hypothesis:Logistic regression hypothesisThe function g(z) is the logistic function, also known as the sigmoid function.The logistic function has asymptotes at 0 and 1, and it crosses the y-axis at 0.5.How Logistic Regression works?Logistic Regression uses a more complex cost function than Linear Regression, this cost function is called the ‘Sigmoid function’ or also known as the ‘logistic function’ instead of a linear function.The hypothesis of logistic regression tends to limit the cost function between 0 and 1. Therefore linear functions fail to represent it as it can have a value greater than 1 or less than 0 which is not possible as per the hypothesis of logistic regression.Sigmoid function maps any real value into another value between 0 and 1. In machine learning, we use sigmoid to map predictions to probabilities.Formula:Where,f(x) = output between 0 and 1 (probability estimate)x = input to the functione = base of natural logDecision BoundaryThe prediction function returns a probability score between 0 and 1. If you want to map the discrete class (true/false, yes/no), you will have to select a threshold value above which you will be classifying values into class 1 and below the threshold value into class 2.p≥0.5,class=1 p<0.5,class=0For example, suppose the threshold value is 0.5 and your prediction function returns 0.7, it will be classified as positive. If your predicted value is 0.2, which is less than the threshold value, it will be classified as negative. For logistic regression with multiple classes we could select the class with the highest predicted probability.Our aim should be to maximize the likelihood that a random data point gets classified correctly, which is called Maximum Likelihood Estimation. Maximum Likelihood Estimation is a general approach to estimating parameters in statistical models. The likelihood can be maximized using an optimization algorithm. Newton’s Method is one such algorithm which can be used to find maximum (or minimum) of many different functions, including the likelihood function. Other than Newton’s Method, you can also use Gradient Descent.Cost FunctionWe have covered Cost Function earlier in the blog on Linear Regression. In brief, a cost function is created for optimization purpose so that we can minimize it and create a model with minimum error.Cost function for Logistic Regression are:Cost(hθ(x),y) = −log(hθ(x))   if y = 1Cost(hθ(x),y) = −log(1−hθ(x))   if y = 0The above functions can be written together as:Gradient DescentAfter finding out the cost function for Logistic Regression, our job should be to minimize it i.e. min J(θ). The cost function can be reduced by using Gradient Descent.The general form of gradient descent:The derivative part can be solved using calculus so the equation comes to:When to use Logistic Regression?Logistic Regression is used when the input needs to be separated into “two regions” by a linear boundary. The data points are separated using a linear line as shown:Based on the number of categories, Logistic regression can be classified as:binomial: target variable can have only 2 possible types: “0” or “1” which may represent “win” vs “loss”, “pass” vs “fail”, “dead” vs “alive”, etc.multinomial: target variable can have 3 or more possible types which are not ordered(i.e. types have no quantitative significance) like “disease A” vs “disease B” vs “disease C”.ordinal: it deals with target variables with ordered categories. For example, a test score can be categorized as:“very poor”, “poor”, “good”, “very good”. Here, each category can be given a score like 0, 1, 2, 3.Let us explore the simplest form of Logistic Regression, i.e Binomial Logistic Regression. It  can be used while solving a classification problem, i.e. when the y-variable takes on only two values. Such a variable is said to be a “binary” or “dichotomous” variable. “Dichotomous” basically means two categories such as yes/no, defective/non-defective, success/failure, and so on. “Binary” refers to the 0's and 1’s.Linear vs Logistic RegressionLinear RegressionLogistic RegressionOutcomeIn linear regression, the outcome (dependent variable) is continuous. It can have any one of an infinite number of possible values.In logistic regression, the outcome (dependent variable) has only a limited number of possible values.The dependent variableLinear regression is used when your response variable is continuous. For instance, weight, height, number of hours, etc.Logistic regression is used when the response variable is categorical in nature. For instance, yes/no, true/false, red/green/blue, 1st/2nd/3rd/4th, etc.The independent variableIn Linear Regression, the independent variables can be correlated with each other.In logistic Regression, the independent variables should not be correlated with each other. (no  multi-collinearity)EquationLinear regression gives an equation which is of the form Y = mX + C, means equation with degree 1.Logistic regression gives an equation which is of the form Y = eX + e-X.Coefficient interpretationIn linear regression, the coefficient interpretation of independent variables are quite straightforward (i.e. holding all other variables constant, with a unit increase in this variable, the dependent variable is expected to increase/decrease by xxx).In logistic regression, depends on the family (binomial, Poisson, etc.) and link (log, logit, inverse-log, etc.) you use, the interpretation is different.Error minimization techniqueLinear regression uses ordinary least squares method to minimise the errors and arrive at a best possible fit, while logistic regression uses maximum likelihood method to arrive at the solution.Logistic regression is just the opposite. Using the logistic loss function causes large errors to be penalized to an asymptotic constant.How is OLS different from MLE?Linear regression is estimated using Ordinary Least Squares (OLS) while logistic regression is estimated using Maximum Likelihood Estimation (MLE) approach.Ordinary Least Squares (OLS) also called the linear least squares is a method to approximately determine the unknown parameters of a linear regression model. Ordinary least squares is obtained by minimizing the total squared vertical distances between the observed responses within the dataset and the responses predicted by the linear approximation(represented by the line of best fit or regression line). The resulting estimator can be represented using a simple formula.For example, let’s say you have a set of equations which consist of several equations with unknown parameters. The ordinary least squares method may be used because this is the most standard approach in finding the approximate solution to your overly determined systems. In other words, it is your overall solution in minimizing the sum of the squares of errors in your equation. Data that best fits the ordinary least squares minimizes the sum of squared residuals. Residual is the difference between an observed value and the predicted value provided by a model.Maximum likelihood estimation, or MLE, is a method used in estimating the parameters of a statistical model, and for fitting a statistical model to data. If you want to find the height measurement of every basketball player in a specific location, maximum likelihood estimation can be used. If you could not afford to measure all of the basketball players’ heights, the maximum likelihood estimation can come in very handy. Using the maximum likelihood estimation, you can estimate the mean and variance of the height of your subjects. The MLE would set the mean and variance as parameters in determining the specific parametric values in a given model.To sum it up, the maximum likelihood estimation covers a set of parameters which can be used for predicting the data needed in a normal distribution. A given, fixed set of data and its probability model would likely produce the predicted data. The MLE would give us a unified approach when it comes to the estimation. But in some cases, we cannot use the maximum likelihood estimation because of recognized errors or the problem actually doesn’t even exist in reality.Building Logistic Regression Model To build a logistic regression model we can use statsmodel and the inbuilt logistic regression function present in the sklearn library.# Importing Packages import pandas as pd import numpy as np import matplotlib.pyplot as plt %matplotlib inline import seaborn as sns # Reading German Credit Data raw_data = pd.read_csv("/content/German_Credit_data.csv") raw_data.head()Building Logistic Regression Base Model after data preparation:import statsmodels.api as sm #Build Logit Model logit = sm.Logit(y_train,x_train) # fit the model model1 = logit.fit() # Printing Logistic Regression model results model1.summary2()Optimization terminated successfully. Current function value: 0.480402 Iterations 6Model:                  Logit                            Pseudo R-squared:  0.197     Dependent Variable:     Creditability                    AIC:               712.5629 Date:                   2019-09-19 09:55                 BIC:               803.5845 No. Observations:       700                              Log-Likelihood:   -336.28 Df Model:               19                               LL-Null:          -418.79 Df Residuals:           680                              LLR p-value:       2.6772e-25 Converged:              1.0000                           Scale:             1.0000 No. Iterations:         6.0000We will calculate the model accuracy on the test dataset using ‘score’ function.# Checking the accuracy with test data from sklearn.metrics import accuracy_score print(accuracy_score(y_test,predicted_df['Predicted_Class']))0.74We can see the accuracy of 74%.Model EvaluationModel evaluation metrics are used to find out the goodness of the fit between model and data, to compare the different models, in the context of model selection, and to predict how predictions are expected to be accurate.What is a Confusion Matrix?A confusion matrix is a summary of prediction results on a classification problem. The number of correct and incorrect predictions are summarized with count values and broken down by each class. This is the key to the confusion matrix. The confusion matrix shows the ways in which your classification model is confused when it makes predictions.Confusion Matrix gives insight not only into the errors being made by your classifier but more importantly the types of errors that are being made. It is this breakdown that overcomes the limitation of using classification accuracy alone.How to Calculate a Confusion MatrixBelow is the process for calculating a confusion Matrix:You need a test dataset or a validation dataset with expected outcome values.Make a prediction for each row in your test dataset.From the expected outcomes and predictions count:The number of correct predictions for each class.The number of incorrect predictions for each class, organized by the class that was predicted.These numbers are then organized into a table or a matrix as follows:Expected down the side: Each row of the matrix corresponds to a predicted class.Predicted across the top: Each column of the matrix corresponds to an actual class.The counts of correct and incorrect classification are then filled into the table.The total number of correct predictions for a class goes into the expected row for that class value and the predicted column for that class value.In the same way, the total number of incorrect predictions for a class goes into the expected row for that class value and the predicted column for that class value.2-Class Confusion Matrix Case StudyLet us consider we have a two-class classification problem of predicting whether a photograph contains a man or a woman. We have a test dataset of 10 records with expected outcomes and a set of predictions from our classification algorithm.ExpectedPredictedManWomanManManWomanWomanManManWomanManWomanWomanWomanWomanManManManWomanWomanWomanLet’s start off and calculate the classification accuracy for this set of predictions.Suppose the algorithm made 7 of the 10 predictions correct with an accuracy of 70%, then:accuracy = total correct predictions / total predictions made * 100 accuracy = 7/10∗100But what are the types of errors made?We can determine that by turning our results into a confusion matrix:First, we must calculate the number of correct predictions for each class.men classified as men: 3women classified as women: 4Now, we can calculate the number of incorrect predictions for each class, organized by the predicted value:men classified as women: 2woman classified as men: 1We can now arrange these values into the 2-class confusion matrix:menwomenmen31women24From the above table we learn that:The total actual men in the dataset is the sum of the values on the men column.The total actual women in the dataset is the sum of values in the women's column.The correct values are organized in a diagonal line from top left to bottom-right of the matrix.More errors were made by predicting men as women than predicting women as men.Two-Class Problems Are SpecialIn a two-class problem, we are often looking to discriminate between observations with a specific outcome, from normal observations. Such as a disease state or event from no-disease state or no-event. In this way, we can assign the event row as “positive” and the no-event row as “negative“. We can then assign the event column of predictions as “true” and the no-event as “false“.This gives us:“true positive” for correctly predicted event values.“false positive” for incorrectly predicted event values.“true negative” for correctly predicted no-event values.“false negative” for incorrectly predicted no-event values.We can summarize this in the confusion matrix as follows:eventno-eventmen31women24This can help in calculating more advanced classification metrics such as precision, recall, specificity and sensitivity of our classifier. Sensitivity/ recall= 7/ (7+5)= 0.583 Specificity= 3/ (3+5)= 0.375 Precision= 7/ (7+3)= 0.7The code mentioned below shows the implementation of confusion matrix in Python with respect to the example used earlier:# Confusion Matrix from sklearn.metrics import confusion_matrix confusion_matrix = confusion_matrix(y_test, predicted_df['Predicted_Class']).ravel() confusion_matrixarray([ 37,  63,  15, 185])The results from the confusion matrix are telling us that 37 and 185 are the number of correct predictions. 63 and 15 are the number of incorrect predictions.Receiver Operating Characteristic (ROC)The receiver operating characteristic (ROC), or the ROC curve, is a graphical plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied. The curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The true-positive rate is also known as sensitivity or the sensitivity index d', known as "d-prime" in signal detection and biomedical informatics, or recall in machine learning. The false-positive rate is also known as the fall-out and can be calculated as (1 - specificity). The ROC curve is thus the sensitivity as a function of fall-out.There are a number of methods of evaluating whether a logistic model is a good model. One such way is sensitivity and specificity. Sensitivity and specificity are statistical measures of the performance of a binary classification test, also known in statistics as classification function:Sensitivity / Recall (also known as the true positive rate, or the recall) measures the proportion of actual positives which are correctly identified as such (e.g., the percentage of sick people who are correctly identified as having the condition), and is complementary to the false negative rate. It shows how good a test is at detecting the positives. A test can cheat and maximize this by always returning “positive”. Sensitivity= true positives/ (true positive + false negative)Specificity (also called the true negative rate) measures the proportion of negatives which are correctly identified as such (e.g., the percentage of healthy people who are correctly identified as not having the condition), and is complementary to the false positive rate. It shows how good a test is at avoiding false alarms. A test can cheat and maximize this by always returning “negative”.Specificity= true negatives/ (true negative + false positives)Precision is used as a measure to calculate the success of predicted values to the values which were supposed to be successful. Precision is used with recall, the percent of all relevant documents that is returned by the search. The two measures are sometimes used together in the F1 Score (or f-measure) to provide a single measurement for a system. It shows how many of the positively classified were relevant. A test can cheat and maximize this by only returning positive on one result it’s most confident in.Precision= true positives/ (true positive + true negative)The precision-recall curve shows the trade-off between precision and recall for different threshold. The decision for the value of the threshold value is majorly affected by the values of precision and recall. Ideally, we want both precision and recall to be 1, but this seldom is the case. In case of a Precision-Recall tradeoff we use the following arguments to decide upon the threshold:-Low Precision/High Recall: In applications where we want to reduce the number of false negatives without necessarily reducing the number of false positives, we choose a decision value which has a low value of Precision or high value of Recall. For example, in a cancer diagnosis application, we do not want any affected patient to be classified as not affected without giving much heed to if the patient is being wrongfully diagnosed with cancer. This is because, the absence of cancer can be detected by further medical diseases but the presence of the disease cannot be detected in an already rejected candidate.High Precision/Low Recall: In applications where we want to reduce the number of false positives without necessarily reducing the number of false negatives, we choose a decision value which has a high value of Precision or low value of Recall. For example, if we are classifying customers whether they will react positively or negatively to a personalised advertisement, we want to be absolutely sure that the customer will react positively to the advertisement because otherwise, a negative reaction can cause a loss of potential sales from the customer.The code mentioned below shows the implementation in Python with respect to the example used earlier:from sklearn.metrics import classification_report print(classification_report(y_test, predicted_df['Predicted_Class']))The f1-score tells you the accuracy of the classifier in classifying the data points in that particular class compared to all other classes. It is calculated by taking the harmonic mean of precision and recall. The support is the number of samples of the true response that lies in that class.y_pred_prob = model1.predict(x_test) from sklearn.metrics import roc_curve # Generate ROC curve values: fpr, tpr, thresholds fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob) # Plot ROC curve plt.plot([0, 1], [0, 1], 'k--') plt.plot(fpr, tpr) plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('ROC Curve') plt.show()# AUCfrom sklearn.metrics import roc_auc_score roc_auc_score(y_test,predicted_df['Predicted_Class'])0.6475Area Under the Curve is 0.6475Hosmer Lemeshow Goodness-of-FitIt measures the association between actual events and predicted probability.How well our model fits depends on the difference between the model and the observed data. One approach for binary data is to implement a Hosmer Lemeshow goodness of fit testIn HL test, the null hypothesis states, the model fits the data well. Model appears to fit well if we have no significant difference between the model and the observed data (i.e. the p-value > 0.05, so not rejecting the Ho)Or in other words, if the test is NOT statistically significant, that indicates the model is a good fit.As with all measures of model fit, use this as just one piece of information in deciding how well this model fits. It doesn’t work well in very large or very small data sets, but is often useful nonetheless.      n     G2HL = ∑ {[(Oj-Ej)2]/[Ej(1-Ej/nj)]} ~Xs2       j=1Χ2 = chi squared.nj = number of observations in the group.Oj = number of observed cases in the j th group.Oj = number of expected cases in the  j th group.Gini CoefficientThe Gini coefficient is sometimes used in classification problems.Gini coefficient can be straight away derived from the AUC ROC number. Gini is nothing but the ratio between area between the ROC curve and the diagonal line & the area of the above triangle. Following is the formulae used :Gini=2*AUC–1Gini above 60% is a good model.Akaike Information Criterion and Bayesian Information CriterionAIC and BIC values are like adjusted R-squared values in linear regression.AIC= -2ln(SSE)+ 2kBIC = n*ln(SSE/n) + k*ln(n)Pros and Cons of Logistic RegressionMany of the pros and cons of the linear regression model also apply to the logistic regression model. Although Logistic regression is used widely by many people for solving various types of problems, it fails to hold up its performance due to its various limitations and also other predictive models provide better predictive results. ProsThe logistic regression model not only acts as a classification model, but also gives you probabilities. This is a big advantage over other models where they can only provide the final classification. Knowing that an instance has a 99% probability for a class compared to 51% makes a big difference. Logistic Regression performs well when the dataset is linearly separable.Logistic Regression not only gives a measure of how relevant a predictor (coefficient size) is, but also its direction of association (positive or negative). We see that Logistic regression is easier to implement, interpret and very efficient to train.ConsLogistic regression can suffer from complete separation. If there is a feature that would perfectly separate the two classes, the logistic regression model can no longer be trained. This is because the weight for that feature would not converge, because the optimal weight would be infinite. This is really a bit unfortunate, because such a feature is really very useful. But you do not need machine learning if you have a simple rule that separates both classes. The problem of complete separation can be solved by introducing penalization of the weights or defining a prior probability distribution of weights.Logistic regression is less prone to overfitting but it can overfit in high dimensional datasets and in that case, regularization techniques should be considered to avoid over-fitting in such scenarios. In this article we have seen what Logistic Regression is, how it works, when we should use it, comparison of Logistic and Linear Regression, the difference between the approach and usage of two estimation techniques: Maximum Likelihood Estimation and Ordinary Least Square Method, evaluation of model using Confusion Matrix and the advantages and disadvantages of Logistic Regression. We have also covered some basics of sigmoid function, cost function and gradient descent.If you are inspired by the opportunities provided by machine learning, enrol in our  Data Science and Machine Learning Courses for more lucrative career options in this landscape.
Rated 4.0/5 based on 34 customer reviews

What is Logistic Regression in Machine Learning

12131
What is Logistic Regression in Machine Learning

Every machine learning algorithm performs best under a given set of conditions. To ensure good performance, we must know which algorithm to use depending on the problem at hand. You cannot just use one particular algorithm for all problems. For example: Linear regression algorithm cannot be applied on a categorical dependent variable. This is where Logistic Regression comes in.

Customer Churn in Machine Learning

Logistic Regression is a popular statistical model used for binary classification, that is for predictions of the type this or that, yes or no, A or B, etc. Logistic regression can, however, be used for multiclass classification, but here we will focus on its simplest application. It is one of the most frequently used machine learning algorithms for binary classifications that translates the input to 0 or 1.  For example, 

  • 0: negative class
  • 1: positive class

Some examples of classification are mentioned below:

  • Email: spam / not spam
  • Online transactions: fraudulent / not fraudulent
  • Tumor: malignant / not malignant

Let us look at the issues we encounter in Linear Regression.

Issue 1 of Linear Regression

As you can see on the graph mentioned below, the prediction would leave out malignant tumors as the gradient becomes less steep with an additional data point on the extreme right.

Linear Regression issue in Machine Learning

Linear Regression issue in Machine Learning

Issue 2 of Linear Regression

  • Hypothesis can be larger than 1 or smaller than zero
  • Hence, we have to use logistic regression

What is Logistic Regression?

Logistic Regression is the appropriate regression analysis to conduct when the dependent variable has a binary solution. Similar to all other types of regression systems, Logistic Regression is also a type of predictive regression system. Logistic regression is used to evaluate the relationship between one dependent binary variable and one or more independent variables. It gives discrete outputs ranging between 0 and 1.

A simple example of Logistic Regression is: Does calorie intake, weather, and age have any influence on the risk of having a heart attack? The question can have a discrete answer, either “yes” or “no”.

Logistic Regression Hypothesis

The logistic regression classifier can be derived by analogy to the linear regression hypothesis which is:

Logistic Regression Hypothesis in Machine LearningLinear regression hypothesis

However, the logistic regression hypothesis generalizes from the linear regression hypothesis in that it uses the logistic function:

logistic function in machine Learning

The result is the logistic regression hypothesis:

Logistic regression hypothesis In Machine LearningLogistic regression hypothesis

The function g(z) is the logistic function, also known as the sigmoid function.

The logistic function has asymptotes at 0 and 1, and it crosses the y-axis at 0.5.

Logistic Function In Machine Learning

How Logistic Regression works?

Logistic Regression uses a more complex cost function than Linear Regression, this cost function is called the ‘Sigmoid function’ or also known as the ‘logistic function’ instead of a linear function.

The hypothesis of logistic regression tends to limit the cost function between 0 and 1. Therefore linear functions fail to represent it as it can have a value greater than 1 or less than 0 which is not possible as per the hypothesis of logistic regression.

Logistic Regression in Machine Learning

Sigmoid function maps any real value into another value between 0 and 1. In machine learning, we use sigmoid to map predictions to probabilities.

Formula:

Logistic Regression in Machine Learning

Where,

f(x) = output between 0 and 1 (probability estimate)
x = input to the function
e = base of natural log

Logistic Regression in Machine Learning

Decision Boundary

The prediction function returns a probability score between 0 and 1. If you want to map the discrete class (true/false, yes/no), you will have to select a threshold value above which you will be classifying values into class 1 and below the threshold value into class 2.

p≥0.5,class=1
p<0.5,class=0

For example, suppose the threshold value is 0.5 and your prediction function returns 0.7, it will be classified as positive. If your predicted value is 0.2, which is less than the threshold value, it will be classified as negative. For logistic regression with multiple classes we could select the class with the highest predicted probability.

Decision Boundary in machine Learning

Our aim should be to maximize the likelihood that a random data point gets classified correctly, which is called Maximum Likelihood Estimation. Maximum Likelihood Estimation is a general approach to estimating parameters in statistical models. The likelihood can be maximized using an optimization algorithm. Newton’s Method is one such algorithm which can be used to find maximum (or minimum) of many different functions, including the likelihood function. Other than Newton’s Method, you can also use Gradient Descent.

Cost Function

We have covered Cost Function earlier in the blog on Linear Regression. In brief, a cost function is created for optimization purpose so that we can minimize it and create a model with minimum error.

Cost function for Logistic Regression are:

  • Cost(hθ(x),y) = −log(hθ(x))   if y = 1
  • Cost(hθ(x),y) = −log(1−hθ(x))   if y = 0

The above functions can be written together as:

Cost Function equation In Machine Learning

Gradient Descent

After finding out the cost function for Logistic Regression, our job should be to minimize it i.e. min J(θ). The cost function can be reduced by using Gradient Descent.

The general form of gradient descent:

Gradient Descent in Machine Learning

The derivative part can be solved using calculus so the equation comes to:

Gradient Descent in Machine Learning

When to use Logistic Regression?

Logistic Regression is used when the input needs to be separated into “two regions” by a linear boundary. The data points are separated using a linear line as shown:

When to use Logistic Regression in Machine Learning

Based on the number of categories, Logistic regression can be classified as:

  1. binomial: target variable can have only 2 possible types: “0” or “1” which may represent “win” vs “loss”, “pass” vs “fail”, “dead” vs “alive”, etc.
  2. multinomial: target variable can have 3 or more possible types which are not ordered(i.e. types have no quantitative significance) like “disease A” vs “disease B” vs “disease C”.
  3. ordinal: it deals with target variables with ordered categories. For example, a test score can be categorized as:“very poor”, “poor”, “good”, “very good”. Here, each category can be given a score like 0, 1, 2, 3.

Let us explore the simplest form of Logistic Regression, i.e Binomial Logistic Regression. It  can be used while solving a classification problem, i.e. when the y-variable takes on only two values. Such a variable is said to be a “binary” or “dichotomous” variable. “Dichotomous” basically means two categories such as yes/no, defective/non-defective, success/failure, and so on. “Binary” refers to the 0's and 1’s.

Linear vs Logistic Regression


Linear RegressionLogistic Regression
OutcomeIn linear regression, the outcome (dependent variable) is continuous. It can have any one of an infinite number of possible values.In logistic regression, the outcome (dependent variable) has only a limited number of possible values.
The dependent variableLinear regression is used when your response variable is continuous. For instance, weight, height, number of hours, etc.Logistic regression is used when the response variable is categorical in nature. For instance, yes/no, true/false, red/green/blue, 1st/2nd/3rd/4th, etc.
The independent variableIn Linear Regression, the independent variables can be correlated with each other.In logistic Regression, the independent variables should not be correlated with each other. (no  multi-collinearity)
EquationLinear regression gives an equation which is of the form Y = mX + C, means equation with degree 1.Logistic regression gives an equation which is of the form Y = eX + e-X.
Coefficient interpretationIn linear regression, the coefficient interpretation of independent variables are quite straightforward (i.e. holding all other variables constant, with a unit increase in this variable, the dependent variable is expected to increase/decrease by xxx).In logistic regression, depends on the family (binomial, Poisson, etc.) and link (log, logit, inverse-log, etc.) you use, the interpretation is different.
Error minimization techniqueLinear regression uses ordinary least squares method to minimise the errors and arrive at a best possible fit, while logistic regression uses maximum likelihood method to arrive at the solution.Logistic regression is just the opposite. Using the logistic loss function causes large errors to be penalized to an asymptotic constant.


Graphical Representation between Linear and Logistic Regression

How is OLS different from MLE?

Linear regression is estimated using Ordinary Least Squares (OLS) while logistic regression is estimated using Maximum Likelihood Estimation (MLE) approach.

Ordinary Least Squares (OLS) also called the linear least squares is a method to approximately determine the unknown parameters of a linear regression model. Ordinary least squares is obtained by minimizing the total squared vertical distances between the observed responses within the dataset and the responses predicted by the linear approximation(represented by the line of best fit or regression line). The resulting estimator can be represented using a simple formula.

For example, let’s say you have a set of equations which consist of several equations with unknown parameters. The ordinary least squares method may be used because this is the most standard approach in finding the approximate solution to your overly determined systems. In other words, it is your overall solution in minimizing the sum of the squares of errors in your equation. Data that best fits the ordinary least squares minimizes the sum of squared residuals. Residual is the difference between an observed value and the predicted value provided by a model.

Maximum likelihood estimation, or MLE, is a method used in estimating the parameters of a statistical model, and for fitting a statistical model to data. If you want to find the height measurement of every basketball player in a specific location, maximum likelihood estimation can be used. If you could not afford to measure all of the basketball players’ heights, the maximum likelihood estimation can come in very handy. Using the maximum likelihood estimation, you can estimate the mean and variance of the height of your subjects. The MLE would set the mean and variance as parameters in determining the specific parametric values in a given model.

To sum it up, the maximum likelihood estimation covers a set of parameters which can be used for predicting the data needed in a normal distribution. A given, fixed set of data and its probability model would likely produce the predicted data. The MLE would give us a unified approach when it comes to the estimation. But in some cases, we cannot use the maximum likelihood estimation because of recognized errors or the problem actually doesn’t even exist in reality.

Building Logistic Regression Model 

To build a logistic regression model we can use statsmodel and the inbuilt logistic regression function present in the sklearn library.

# Importing Packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# Reading German Credit Data
raw_data = pd.read_csv("/content/German_Credit_data.csv")
raw_data.head()

Building Logistic Regression Base Model after data preparation:

import statsmodels.api as sm
#Build Logit Model
logit = sm.Logit(y_train,x_train)

# fit the model
model1 = logit.fit()

# Printing Logistic Regression model results
model1.summary2()
Optimization terminated successfully.
Current function value: 0.480402
Iterations 6
Model:                  Logit                            Pseudo R-squared:  0.197    
Dependent Variable:     Creditability                    AIC:               712.5629
Date:                   2019-09-19 09:55                 BIC:               803.5845
No. Observations:       700                              Log-Likelihood:   -336.28
Df Model:               19                               LL-Null:          -418.79
Df Residuals:           680                              LLR p-value:       2.6772e-25
Converged:              1.0000                           Scale:             1.0000
No. Iterations:         6.0000

We will calculate the model accuracy on the test dataset using ‘score’ function.

# Checking the accuracy with test data
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test,predicted_df['Predicted_Class']))
0.74

We can see the accuracy of 74%.

Model Evaluation

Model evaluation metrics are used to find out the goodness of the fit between model and data, to compare the different models, in the context of model selection, and to predict how predictions are expected to be accurate.

What is a Confusion Matrix?

A confusion matrix is a summary of prediction results on a classification problem. The number of correct and incorrect predictions are summarized with count values and broken down by each class. This is the key to the confusion matrix. The confusion matrix shows the ways in which your classification model is confused when it makes predictions.

Confusion Matrix in Machine Learning

Confusion Matrix gives insight not only into the errors being made by your classifier but more importantly the types of errors that are being made. It is this breakdown that overcomes the limitation of using classification accuracy alone.

How to Calculate a Confusion Matrix

Below is the process for calculating a confusion Matrix:

  1. You need a test dataset or a validation dataset with expected outcome values.
  2. Make a prediction for each row in your test dataset.
  3. From the expected outcomes and predictions count:
  • The number of correct predictions for each class.
  • The number of incorrect predictions for each class, organized by the class that was predicted.

These numbers are then organized into a table or a matrix as follows:

  • Expected down the side: Each row of the matrix corresponds to a predicted class.
  • Predicted across the top: Each column of the matrix corresponds to an actual class.

The counts of correct and incorrect classification are then filled into the table.
The total number of correct predictions for a class goes into the expected row for that class value and the predicted column for that class value.

In the same way, the total number of incorrect predictions for a class goes into the expected row for that class value and the predicted column for that class value.

2-Class Confusion Matrix Case Study

Let us consider we have a two-class classification problem of predicting whether a photograph contains a man or a woman. We have a test dataset of 10 records with expected outcomes and a set of predictions from our classification algorithm.

2-Class Confusion Matrix Case Study in Machine Learning

ExpectedPredicted
ManWoman
ManMan
WomanWoman
ManMan
WomanMan
WomanWoman
WomanWoman
ManMan
ManWoman
WomanWoman

Let’s start off and calculate the classification accuracy for this set of predictions.

Suppose the algorithm made 7 of the 10 predictions correct with an accuracy of 70%, then:

accuracy = total correct predictions / total predictions made * 100
accuracy = 7/10∗100

But what are the types of errors made?
We can determine that by turning our results into a confusion matrix:
First, we must calculate the number of correct predictions for each class.

  • men classified as men: 3
  • women classified as women: 4

Now, we can calculate the number of incorrect predictions for each class, organized by the predicted value:

  • men classified as women: 2
  • woman classified as men: 1

We can now arrange these values into the 2-class confusion matrix:


menwomen
men31
women24

From the above table we learn that:

  • The total actual men in the dataset is the sum of the values on the men column.
  • The total actual women in the dataset is the sum of values in the women's column.
  • The correct values are organized in a diagonal line from top left to bottom-right of the matrix.
  • More errors were made by predicting men as women than predicting women as men.

Two-Class Problems Are Special

In a two-class problem, we are often looking to discriminate between observations with a specific outcome, from normal observations. Such as a disease state or event from no-disease state or no-event. In this way, we can assign the event row as “positive” and the no-event row as “negative“. We can then assign the event column of predictions as “true” and the no-event as “false“.

This gives us:

  • “true positive” for correctly predicted event values.
  • “false positive” for incorrectly predicted event values.
  • “true negative” for correctly predicted no-event values.
  • “false negative” for incorrectly predicted no-event values.

We can summarize this in the confusion matrix as follows:


eventno-event
men31
women24

This can help in calculating more advanced classification metrics such as precision, recall, specificity and sensitivity of our classifier. 

Sensitivity/ recall= 7/ (7+5)= 0.583
Specificity= 3/ (3+5)= 0.375
Precision= 7/ (7+3)= 0.7

The code mentioned below shows the implementation of confusion matrix in Python with respect to the example used earlier:

# Confusion Matrix
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(y_test,
predicted_df['Predicted_Class']).ravel()
confusion_matrix
array([ 37,  63,  15, 185])

The results from the confusion matrix are telling us that 37 and 185 are the number of correct predictions. 63 and 15 are the number of incorrect predictions.

Receiver Operating Characteristic (ROC)

The receiver operating characteristic (ROC), or the ROC curve, is a graphical plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied. The curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The true-positive rate is also known as sensitivity or the sensitivity index d', known as "d-prime" in signal detection and biomedical informatics, or recall in machine learning. The false-positive rate is also known as the fall-out and can be calculated as (1 - specificity). The ROC curve is thus the sensitivity as a function of fall-out.

Receiver Operating Characteristic (ROC) in Machine Learning

There are a number of methods of evaluating whether a logistic model is a good model. One such way is sensitivity and specificity. Sensitivity and specificity are statistical measures of the performance of a binary classification test, also known in statistics as classification function:

Sensitivity / Recall (also known as the true positive rate, or the recall) measures the proportion of actual positives which are correctly identified as such (e.g., the percentage of sick people who are correctly identified as having the condition), and is complementary to the false negative rate. It shows how good a test is at detecting the positives. A test can cheat and maximize this by always returning “positive”.

 Sensitivity= true positives/ (true positive + false negative)

Specificity (also called the true negative rate) measures the proportion of negatives which are correctly identified as such (e.g., the percentage of healthy people who are correctly identified as not having the condition), and is complementary to the false positive rate. It shows how good a test is at avoiding false alarms. A test can cheat and maximize this by always returning “negative”.

Specificity= true negatives/ (true negative + false positives)

Precision is used as a measure to calculate the success of predicted values to the values which were supposed to be successful. Precision is used with recall, the percent of all relevant documents that is returned by the search. The two measures are sometimes used together in the F1 Score (or f-measure) to provide a single measurement for a system. It shows how many of the positively classified were relevant. A test can cheat and maximize this by only returning positive on one result it’s most confident in.

Precision= true positives/ (true positive + true negative)

The precision-recall curve shows the trade-off between precision and recall for different threshold. The decision for the value of the threshold value is majorly affected by the values of precision and recall. Ideally, we want both precision and recall to be 1, but this seldom is the case. In case of a Precision-Recall tradeoff we use the following arguments to decide upon the threshold:-

  1. Low Precision/High Recall: In applications where we want to reduce the number of false negatives without necessarily reducing the number of false positives, we choose a decision value which has a low value of Precision or high value of Recall. For example, in a cancer diagnosis application, we do not want any affected patient to be classified as not affected without giving much heed to if the patient is being wrongfully diagnosed with cancer. This is because, the absence of cancer can be detected by further medical diseases but the presence of the disease cannot be detected in an already rejected candidate.
  2. High Precision/Low Recall: In applications where we want to reduce the number of false positives without necessarily reducing the number of false negatives, we choose a decision value which has a high value of Precision or low value of Recall. For example, if we are classifying customers whether they will react positively or negatively to a personalised advertisement, we want to be absolutely sure that the customer will react positively to the advertisement because otherwise, a negative reaction can cause a loss of potential sales from the customer.

The code mentioned below shows the implementation in Python with respect to the example used earlier:

from sklearn.metrics import classification_report

print(classification_report(y_test, predicted_df['Predicted_Class']))

precision-recall curve in Machine Learning

The f1-score tells you the accuracy of the classifier in classifying the data points in that particular class compared to all other classes. It is calculated by taking the harmonic mean of precision and recall. The support is the number of samples of the true response that lies in that class.

y_pred_prob = model1.predict(x_test)

from sklearn.metrics import roc_curve
# Generate ROC curve values: fpr, tpr, thresholds
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)

# Plot ROC curve
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()
Positive rate in Machine Learning# AUCfrom sklearn.metrics import roc_auc_score
roc_auc_score(y_test,predicted_df['Predicted_Class'])
0.6475

Area Under the Curve is 0.6475

Hosmer Lemeshow Goodness-of-Fit

  • It measures the association between actual events and predicted probability.
  • How well our model fits depends on the difference between the model and the observed data. One approach for binary data is to implement a Hosmer Lemeshow goodness of fit test
  • In HL test, the null hypothesis states, the model fits the data well. Model appears to fit well if we have no significant difference between the model and the observed data (i.e. the p-value > 0.05, so not rejecting the Ho)
  • Or in other words, if the test is NOT statistically significant, that indicates the model is a good fit.
  • As with all measures of model fit, use this as just one piece of information in deciding how well this model fits. It doesn’t work well in very large or very small data sets, but is often useful nonetheless.
       n    
G2HL = ∑ {[(Oj-Ej)2]/[Ej(1-Ej/nj)]} ~Xs2
      j=1
  • Χ2 = chi squared.
  • nj = number of observations in the group.
  • Oj = number of observed cases in the j th group.
  • Oj = number of expected cases in the  j th group.

Gini Coefficient

  • The Gini coefficient is sometimes used in classification problems.
  • Gini coefficient can be straight away derived from the AUC ROC number. Gini is nothing but the ratio between area between the ROC curve and the diagonal line & the area of the above triangle. Following is the formulae used :
Gini=2*AUC–1
  • Gini above 60% is a good model.

Akaike Information Criterion and Bayesian Information Criterion

  • AIC and BIC values are like adjusted R-squared values in linear regression.
  • AIC= -2ln(SSE)+ 2k
  • BIC = n*ln(SSE/n) + k*ln(n)

Pros and Cons of Logistic Regression

Many of the pros and cons of the linear regression model also apply to the logistic regression model. Although Logistic regression is used widely by many people for solving various types of problems, it fails to hold up its performance due to its various limitations and also other predictive models provide better predictive results. 

Pros

  • The logistic regression model not only acts as a classification model, but also gives you probabilities. This is a big advantage over other models where they can only provide the final classification. Knowing that an instance has a 99% probability for a class compared to 51% makes a big difference. Logistic Regression performs well when the dataset is linearly separable.
  • Logistic Regression not only gives a measure of how relevant a predictor (coefficient size) is, but also its direction of association (positive or negative). We see that Logistic regression is easier to implement, interpret and very efficient to train.

Cons

  • Logistic regression can suffer from complete separation. If there is a feature that would perfectly separate the two classes, the logistic regression model can no longer be trained. This is because the weight for that feature would not converge, because the optimal weight would be infinite. This is really a bit unfortunate, because such a feature is really very useful. But you do not need machine learning if you have a simple rule that separates both classes. The problem of complete separation can be solved by introducing penalization of the weights or defining a prior probability distribution of weights.
  • Logistic regression is less prone to overfitting but it can overfit in high dimensional datasets and in that case, regularization techniques should be considered to avoid over-fitting in such scenarios. 

In this article we have seen what Logistic Regression is, how it works, when we should use it, comparison of Logistic and Linear Regression, the difference between the approach and usage of two estimation techniques: Maximum Likelihood Estimation and Ordinary Least Square Method, evaluation of model using Confusion Matrix and the advantages and disadvantages of Logistic Regression. We have also covered some basics of sigmoid function, cost function and gradient descent.

If you are inspired by the opportunities provided by machine learning, enrol in our  Data Science and Machine Learning Courses for more lucrative career options in this landscape.

Priyankur

Priyankur Sarkar

Data Science Enthusiast

Priyankur Sarkar loves to play with data and get insightful results out of it, then turn those data insights and results in business growth. He is an electronics engineer with a versatile experience as an individual contributor and leading teams, and has actively worked towards building Machine Learning capabilities for organizations.

Join the Discussion

Your email address will not be published. Required fields are marked *

Suggested Blogs

Top Data Science Trends in 2020

Industry experts are of the view that 2020 will be a huge year for data science and AI. The expected growth rate for AI market will be at 118.6 billion by 2025. The focus areas in the overall AI market will include everything from natural language processing to robotic process automation. Since the beginning of digital era, data has been growing at the speed of light! There will only be a surge in this growth. New data will not only generate more innovative use cases but also spearhead a revolution of innovation.About 77% of the devices today have AI incorporated into them. Smart devices, Netflix recommendations, Amazon’s Alexa Google Home have transformed the way we live in the digital age.  The renowned AI-powered virtual nurses “Molly” and “Angel”, have taken healthcare to new heights and robots have already been performing various surgical procedures.Dynamic technologies like data science and AI have some intriguing data science trends to watch out for, in 2020. Check out the top 6 data science trends in 2020 any data science enthusiast should know:1. Advent of Deep LearningSimply put, deep learning is a machine learning technique that trains computers to think and act like humans i.e., by example. Ever since, deep learning models have proven their efficacy by exceeding human limitations and performance. Deep learning models are usually trained using a large set of labelled data and multi-layered neural network architectures.What’s new for Deep Learning in 2020?In 2020, deep learning will be quite significant. Its capacity to foresee and understand human behaviour and how enterprises can utilize this knowledge to stay ahead of their competitors will come in handy.2. Spotlight on Augmented AnalyticsAlso hailed as the future of Business Intelligence, Augmented analytics employs machine learning/artificial intelligence (ML/AI) techniques to automate data preparation, insight discovery and sharing, data science and ML model development, management and deployment. This can be greatly beneficial for companies to improve their offerings and customer experience. The global augmented analytics market size is projected to reach $29,856 million by 2025. Its growth rate is expected to be at a CAGR of 28.4% from 2018 to 2025.What’s New for Augmented Analytics in 2020?This year, augmented analytics platforms will help enterprises leverage social component. The use of interactive dashboards and visualizations in augmented analytics will help stakeholders share important insights and create a crystal-clear narrative that echoes the company’s mission.3. Impact of IoT, ML and AI2020 will see the rise of AI/ML, 5G, cybersecurity and IoT. The rise in automation will create opportunities for new skills to be explored. Upskilling in emerging new technologies will make professionals competent in the dynamic tech space today. As per a survey by IDC, over 75% of organizations will invest in reskilling programs or their workforce to bridge the rising skill gap by 2025.What’s new for IoT, ML and AI in 2020?It has been estimated that over 24 billion devices will be connected to the Internet of Things this year. This means industries can create a world of difference by developing smart devices that make a difference to the way we live.4. Better Mobile Analytics StrategiesMobile analytics deals with measuring and analysing data created across various mobile platform sites and applications alone. It helps businesses keep track of the behaviour of their users on mobile sites and apps. This technology will aid in boosting the cross-channel marketing initiatives of an enterprise, while optimizing the mobile experience and growing user engagement and retention.What’s new for Mobile Analytics in 2020?With the ever-increasing number of mobile phone users globally, there will be a heightened focus on mobile app marketing and app analytics. Currently, mobile advertising ranks first in digital advertising worldwide. This had made mobile analytics quintessential, as businesses today can track in-app traffic, potential security threats, as well as the levels of customer satisfaction.5. Enhanced Levels of CustomizationAccess to real-time data and customer behaviour has made it possible to cater to each customer’s specific needs. As customer expectations soar, companies would have to buckle up to deliver more personalized, relevant and superior customer experience. The use of data and AI will make it possible.What’s new for User Experience in 2020?Interpreting user experience can be done easily using data derived from conversions, pageviews, and other user actions. These insights will help user experience professionals make better decisions while providing the user exactly what he/she needs.6. Better Cybersecurity2019 brought to light the grim reality of data privacy and security breach. With over 24 billion devices estimated to be connected to the internet this year, stringent measures will be deployed by enterprises to protect data privacy and prevent security breach. Industry experts are of the view that a combination of cybersecurity and AI-enabled technology results will lead to effective attack surface coverage that is a 20x more effective attack than traditional methods.What’s new for Cybersecurity in 2020?Since data shows no signs of stopping in its growth, the number of threats will also keep looming on the horizon. In 2020, cybersecurity professionals would have to gear themselves to conjure new and improved ways to secure data. Therefore, AI-supported cybersecurity measures will be deployed to prevent malicious attacks from malware and ensure better cybersecurity.The Way Ahead in 2020Data Science will be one of the fastest-growing technologies in 2020. Its wide range of applications in various industries will pave way for more innovative trends like the ones above.
Rated 4.5/5 based on 0 customer reviews
5903
Top Data Science Trends in 2020

Industry experts are of the view that 2020 will be... Read More

Growing Applications of Artificial Intelligence in Healthcare

Artificial intelligence (AI) deals with building smart devices/machines that are capable of performing tasks without the presence of human intelligence, autonomously. Globally, it’s one of the most popular new-age skills today with a wide range of applications across sectors like finance, healthcare, space exploration and manufacturing.AI-driven Healthcare marketAccording to a report by MarketsandMarkets, AI in healthcare domain is projected to reach USD 36.15 billion by 2025. There’s a growing need for lowering healthcare-related costs, processing large amounts of data, widespread adoption of precision medicine, as well as a dip in hardware costs in this domain.According to a recent report, there would be 58 million AI jobs by 2022. Researchers Frost & Sullivan claimed in their study that about $6.7 billion will be generated globally using artificial intelligence systems in healthcare by 2021. Hence, the demand for professionals trained in AI with expertise in healthcare is on the rise.Below are some applications of AI in healthcare:AI-based diagnostic technologyDisease diagnosis is the most integral part of healthcare. Accuracy in diagnosis is imperative to the right treatment plan. According to a recent Harvard Medical Practice study, faulty diagnosis accounted for 17% of preventable errors among hospitalized patients. This is where the accuracy and sophistication of AI will come handy. AI-supported diagnostic machines are being increasingly used in hospitals today. Their ability to analyze large amounts of data from medical images has paved the way for early diagnosis of diseases. Early diagnosis of cancer, stroke, heart attack and tumors can help doctors develop an exhaustive treatment plan for patients before the disease progresses.AI and Machine Learning have been commonly employed to detect the following conditions:Lung cancer and strokesHeart diseases and its severitySkin lesions and its classificationDiabetic retinopathyAI-Powered BiomarkersConventional medicine used to follow the ethos ‘one-size-fits-all’. However, precision medicine advocates that diagnosis of certain diseases and medicines for a patient varies according to their genetic makeup, lifestyle and environmental influences. This is where biomarkers based on AI have a role to play. These markers are intelligent enough to track real-time audio/visual information on a patient’s vital health parameters. Eventually, doctors can take a look at the data collected and devise a comprehensive treatment plan exclusive to the patient. Virtual nursing assistance and remote monitoringNurses and other healthcare assistants are the backbones of any healthcare network. AI-based virtual assistants and chatbots provide support to these care providers by reducing their workload.  Specifically, AI-based virtual assistants aid in monitoring patients post-discharge and making a lot of outpatient services hassle-free.AI-enabled wearable devices serve as virtual health assistants that remind patients to follow their diet/medicine chart. Round-the-clock remote monitoring of their vital signs sends real-time alerts to care providers too. This AI-based tracking approach prevents unnecessary visits to the physician as well.AI-based drug discoveryIn the United States, only 5 in 5000 drugs undergo the process from preclinical testing to human testing. The chance of a new drug to reach the market is just 1 in 5000!When AI-driven computing is applied to drug testing and research, there is a greater degree of accuracy. The route of drug discovery (from testing to market availability) also becomes faster and more cost-efficient. Many pharmaceutical companies have adopted AI-based drug discovery to develop drugs that could support the treatment of cancer and other chronic diseases.AI-enabled hospital careFor patients suffering from chronic diseases or even acute conditions, AI can help a great deal in simplifying care delivery. Procedures like the monitoring of IV solutions, tracking the patient’s medications, patient alert/feedback systems, performance assessment systems and patient movement tracking within hospitals can be managed with AI assistance. Robot-assisted surgeries are highly accurate, and have an added advantage of reducing human error.A 2017 report by Accenture claimed that AI-assisted clinical health applications could help save a whopping $40 billion dollars in robot-assisted surgery and $18 billion in administrative workflow assistance.AI is here to stayDespite the presence of AI applications in healthcare for many years, they still have a long way to go. However, one of the major challenges faced in AI-based healthcare is the lack of skilled AI-experts with domain knowledge in life sciences.Exciting opportunities for trained AI professionalsThe industry is growing rapidly at a pace of 40% per year. However, there is a dire shortage of professionals with fewer than 10,000 professionals with the right skills to create fully functional artificial intelligence systems. Aspiring AI engineers thus have a plethora of opportunities to redefine healthcare as well as land exciting job roles.
Rated 4.5/5 based on 2 customer reviews
4895
Growing Applications of Artificial Intelligence in...

Artificial intelligence (AI) deals with building s... Read More

What are Decision Trees in Machine Learning (Classification And Regression)

Introduction to Machine Learning and its typesMachine Learning is an interdisciplinary field of study and is a sub-domain of Artificial Intelligence. It gives computers the ability to learn and infer from a huge amount of homogeneous data, without having to be programmed explicitly.Types of Machine Learning: Machine Learning can broadly be classified into three types:Supervised Learning: If the available dataset has predefined features and labels, on which the machine learning models are trained, then the type of learning is known as Supervised Machine Learning. Supervised Machine Learning Models can broadly be classified into two sub-parts: Classification and Regression. These have been discussed further in detail.Unsupervised Learning: If the available dataset has predefined features but lacks labels, then the Machine Learning algorithms perform operations on this data to assign labels to it or to reduce the dimensionality of the data. There are several types of Unsupervised Learning Models, the most common of them being: Principal Component Analysis (PCA) and Clustering.Reinforcement Learning: Reinforcement Learning is a more advanced type of learning, where, the model learns from “Experience”. Here, features and labels are not clearly defined. The model is just given a “Situation” and is rewarded or penalized based on the “Outcome”. The model thus learns to optimize the “Situation” to maximize the Rewards and hence improves the “Outcome” with “Experience”.ClassificationClassification is the process of determination/prediction of the category to which a data-point may belong to. It is the process by which a Supervised Learning Algorithm learns to draw inference from the features of a given dataset and predict which class or group or category does the particular data point belongs to.Example of Classification: Let’s assume that we are given a few images of handwritten digits (0-9). The problem statement is to “teach” the machine to classify correctly which image corresponds to which digit. A small sample of the dataset is given below:The machine has to be thus trained, such that, when given an input of any such hand-written digit, it has to correctly classify the digits and mention which digit the image represents. This is classed classification of hand-written digits.Looking at another example which is not image-based, we have 2D data (x1 and x2) which is plotted in the form of a graph shown below.The red and green dots represent two different classes or categories of data. The main goal of the classifier is that given one such “dot” of unknown class, based on its “features”, the algorithm should be able to correctly classify if that dot belongs to the red or green class. This is also shown by the line going through the middle, which correctly classifies the majority of the dots.Applications of Classification: Listed below are some of the real-world applications of classification Algorithms.Face Recognition: Face recognition finds its applications in our smartphones and any other place with Biometric security. Face Recognition is nothing but face detection followed by classification. The classification algorithm determines if the face in the image matches with the registered user or not.Medical Image Classification: Given the data of patients, a model that is well trained is often used to classify if the patient has a malignant tumor (cancer), heart ailments, fractures, etc.RegressionRegression is also a type of supervised learning. Unlike classification, it does not predict the class of the given data. Instead, it predicts the corresponding values of a given dataset based on the “features” it encounters.Example of Regression: For this, we will look at a dataset consisting of California Housing Prices. The contents of this dataset are shown below.Here, there are several columns. Each of the columns shows the “features” based on which the machine learning algorithm predicts the housing price (shown by yellow highlight). The primary goal of the regression algorithm is that, given the features of a given house, it should be able to correctly estimate the price of the house. This is called a regression problem. It is similar to curve fitting and is often confused with the same.Applications of Regression: Listed below are some of the real-world applications of regression Algorithms.Stock Market Prediction: Regression algorithms are used to predict the future price of stocks based on certain past features like time of the day or festival time, etc. Stock Market Prediction also falls under a subdomain of study called Time Series Analysis.Object Detection Algorithms: Object Detection is the process of detection of the location of a given object in an image or video. This process returns the coordinates of the pixel values stating the location of the object in the image. These coordinates are determined by using regression algorithms alongside classification.ClassificationRegressionAssign specific classes to the data based on its features.Predict values based on the features of the dataset.Prediction is discrete or categorical in nature.Prediction is continuous in nature.Introduction to the building blocks of Decision TreesIn order to get started with Decision Trees, it is important to understand the basic building blocks of decision trees. Hence, we start building the concepts slowly with some basic theory.1. EntropyDefinition: It is a commonly used concept in Information Theory and is a measure of “purity” of an arbitrary collection of information.Mathematical Equation:Here, given a collection S, containing positive and negative examples, the Entropy of S is given by the above equation, where, p represents the probability of occurrence of that example in the given data.In a more generalized form, Entropy is given by the following equation:Example: As an example, a sample S is taken, which contains 14 data samples and includes 9 positive and 5 negative samples. The same is denoted by the mathematical notion: [9+, 5­­–].Thus, Entropy of the given sample can be calculated as follows:2. Information GainDefinition: With the knowledge of Entropy, the amount of relevant information that is gained form a given random sample size can be calculated and is known as Information Gain.Mathematical Equation:Here, the Gain (S, A) is the Information Gain of an attribute A relative to a sample S. The Values(A) is a set of all possible values for attribute A.Example: As an example, let’s assume S is a collection of 14 training-examples. Here, in this example, we will consider the Attribute to be Wind and the values of that corresponding attribute will be Weak and Strong. In addition to the previous example information, we will assume that out of the previously mentioned 9 positives and 5 negative samples, 6 positive and 2 negative samples have the value of the attribute Wind=Weak, and the remaining have Wind=Strong. Thus, under such a circumstance, the information gained by the attribute Wind is shown below.Decision TreeIntroduction: Since we have the basic building blocks out of the way, let’s try to understand what exactly is a Decision Tree. As the name suggests, it is a Tree which is developed based on certain decisions taken by the algorithm in accordance with the given data that it has been trained on.In simple words, a Decision Tree uses the features in the given data to perform Supervised Learning and develop a tree-like structure (data structure) whose branches are developed in such a way that given the feature-set, the decision tree can predict the expected output relatively accurately.Example:  Let us look at the structure of a decision tree. For this, we will take up an example dataset called the “PlayTennis” dataset. A sample of the dataset is shown below.In summary, the target of the model is to predict if the weather conditions are suitable to play tennis or not, as guided by the dataset shown above.As it can be seen in the dataset, it contains certain information (features) for each day. In this, we have the feature-attributes: Outlook, Temperature, Humidity and Wind and the target-attribute PlayTennis. Each of these attributes can take up certain values, for example, the attribute Outlook has the values Sunny, Rain and Overcast.With a clear idea of the dataset, jumping a bit forward, let us look at the structure of the learned Decision Tree as developed from the above dataset.As shown above, it can clearly be seen that, given certain values for each of the attributes, the learned decision tree is capable of giving a clear answer as to whether the weather is suitable for Tennis or not.Algorithm: With the overall intuition of decision trees, let us look at the formal Algorithm:ID3(Samples, Target_attribute, Attributes):Create a root node for the TreeIf all the Samples are positive, Return a single-node tree Root, with label = +If all the Samples are negative, Return a single-node tree Root, with label = –If Attribute is empty, Return the single-node tree Root with the label = Most common value of the Target_attribute among the Samples.Otherwise:A ← the attribute from Attributes that best classifies the SamplesThe decision attribute for Root ← AFor each possible value of A:Add a new tree branch below Root, corresponding to the test A = viLet the Samplesvi be the subset of Samples that have value vi for AIf Samplesvi is empty:Then below the new branch add a leaf node with the label = most common value of Target_attribute in the samplesElse below the new branch add the subtree:ID3(Samplesvi, Target_attribute, Attributes – {A})EndReturn RootConnecting the dots: Since the overall idea of decision trees have been explained, let’s try to figure out how Entropy and Information Gain fits into this entire process.Entropy (E) is used to calculate Information Gain, which is used to identify which attribute of a given dataset provides the highest amount of information. The attribute which provides the highest amount of information for the given dataset is considered to have more contribution towards the outcome of the classifier and hence is given the higher priority in the tree.For Example, considering the PlayTennis Example, if we calculate the Information Gain for two corresponding Attributes: Humidity and Wind, we would find that Humidity plays a more important role in deciding whether to play tennis or not. Hence, in this case, Humidity is considered as a better classifier. The detailed calculation is shown in the figure below:Applications of Decision TreeWith the basic idea out of the way, let’s look at where decision trees can be used:Select a flight to travel: Decision trees are very good at classification and hence can be used to select which flight would yield the best “bang-for-the-buck”. There are a lot of parameters to consider, such as if the flight is connecting or non-stop, or how reliable is the service record of the given airliner, etc.Selecting alternative products: Often in companies, it is important to determine which product will be more profitable at launch. Given the sales attributes such as market conditions, competition, price, availability of raw materials, demand, etc. a Decision Tree classifier can be used to accurately determine which of the products would maximize the profits.Sentiment Analysis: Sentiment Analysis is the determination of the overall opinion of a given piece of text and is especially used to determine if the writer’s comment towards a given product/service is positive, neutral or negative. Decision trees are very versatile classifiers and are used for sentiment analysis in many Natural Language Processing (NLP) applications.Energy Consumption: It is very important for electricity supply boards to correctly predict the amount of energy consumption in the near future for a particular region. This is to make sure that un-used power can be diverted towards an area with a higher demand to keep a regular and uninterrupted supply of power throughout the grid. Decision Trees are often used to determine which region is expected to require more or less power in the up-coming time-frame.Fault Diagnosis: In the Engineering domain, one of the widely used applications of decision trees is the determination of faults. In the case of load-bearing rotatory machines, it is important to determine which of the component(s) have failed and which ones can directly or indirectly be affected by the failure. This is determined by a set of measurements that are taken. Unfortunately, there are numerous measurements to take and among them, there are some measurements which are not relevant to the detection of the fault. A Decision Tree classifier can be used to quickly determine which of these measurements are relevant in the determination of the fault.Advantages of Decision TreeListed below are some of the advantages of Decision Trees:Comprehensive: Another significant advantage of a decision tree is that it forces the algorithm to take into consideration all the possible outcomes of a decision and traces each path to a conclusion.Specific: The output of decision trees is very specific and reduces uncertainty in the prediction. Hence, they are considered as really good classifiers.Easy to use: Decision Trees are one of the simplest, yet most versatile algorithms in Machine Learning. It is based on simple math and no complex formulas. They are easy to visualize, understand and explain.Versatile: A lot of business problems can be solved using Decision Trees. They find their applications in the field of Engineering, Management, Medicine, etc. basically, any situation where data is available and a decision needs to be taken in uncertain conditions.Resistant to data abnormalities: Data is never perfect and there are always many abnormalities in the dataset. Some of the most common abnormalities are outliers, missing data and noise. While most Machine Learning algorithms fail with even a minor set of abnormalities, Decision Trees are very resilient and is able to handle a fair percentage of such abnormalities quite well without altering the results.Visualization of the decision taken: Often in Machine Learning models, data scientists struggle to reason as to why a certain model is giving a certain set of outputs. Unfortunately, for most of the algorithms, it is not possible to clearly determine and visualize the actual process of classification that leads to the final outcome. However, decision trees are very easy to visualize. Once the tree is trained, it can be visualized and the programmer can see exactly how and why the conclusion was reached. It is also easy to explain the outcome to a non-technical team with the “tree” type visualization. This is why many organizations prefer to use decision trees over other Machine Learning Algorithms.Limitations of Decision TreeListed below are some of the limitations of Decision Trees:Sensitivity to hyperparameter tuning: Decision Trees are very sensitive to hyperparameter tuning. Hyperparameters are those parameters which are in control of the programmer and can be tuned to get better performance out of a given model. Unfortunately, the output of a decision tree can vary drastically if the hyperparameters are inaccurately tuned.Overfitting: Decision trees are prone to overfitting. Overfitting is a concept where the model learns the data too well and hence performs well on training dataset but fails to perform on testing dataset. Decision trees are prone to overfitting if the breadth and depth of the tree is set to very high for a simpler dataset.Underfitting: Similar to overfitting, decision trees are also prone to underfitting. Underfitting is a concept where the model is too simple for it to learn the dataset effectively. Decision tree suffers from underfitting if the breadth and depth of the model or the number of nodes are set too low. This does not allow the model to fit the data properly and hence fails to learn.Code ExamplesWith the theory out of the way, let’s look at the practical implementation of decision tree classifiers and regressors.1. ClassificationIn order to conduct classification, a diabetes dataset from Kaggle has been used. It can be downloaded.The initial step for any data science application is data visualization. Hence, the dataset is shown below:The highlighted column is the target value that the model is expected to predict, given the parameters.Load the Libraries. We will be using pandas to load and manipulate data. Sklearn is used for applying Machine Learning models on the data.# Load libraries import pandas as pd from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier from sklearn.model_selection import train_test_split # Import train_test_split function from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation Load the data. Pandas is used to read the data from the CSV.col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label'] # load dataset pima = pd.read_csv("pima-indians-diabetes.csv", header=None, names=col_names)Feature Selection: The relevant features are selected for the classification.#split dataset in features and target variable feature_cols = ['pregnant', 'insulin', 'bmi', 'age','glucose','bp','pedigree'] X = pima[feature_cols] # Features y = pima.label # Target variablesplitting the data: The dataset needs to be split into training and testing data. The training data is used to train the model, while the testing data is used to test the model’s performance on unseen data.# Split dataset into training set and test set X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # 70% training and 30% testBuilding the decision tree. These few lines initialize, train and predict on the dataset.# Create Decision Tree classifier object clf = DecisionTreeClassifier() # Train Decision Tree Classifier clf = clf.fit(X_train,y_train) #Predict the response for test dataset y_pred = clf.predict(X_test)The model’s accuracy is evaluated by using Sklearn’s metrics library. # Model Accuracy, how often is the classifier correct? print("Accuracy:",metrics.accuracy_score(y_test, y_pred)) Output: Accuracy: 0.6753246753246753This will generate the decision tree that is shown in the following image2. RegressionIn order to conduct classification, a diabetes dataset from Kaggle has been used.For this example, we will generate a Numpy Array which will simulate a scatter plot resembling a sine wave with a few randomly added noise elements. # Import the necessary modules and libraries import numpy as np from sklearn.tree import DecisionTreeRegressor import matplotlib.pyplot as plt # Create a random dataset rng = np.random.RandomState(1) X = np.sort(5 * rng.rand(80, 1), axis=0) y = np.sin(X).ravel() y[::5] += 3 * (0.5 - rng.rand(16))This time we create two regression models to experiment and see how overfitting looks like for a decision tree. Hence, we initialize the two Decision Tree Regression objects and train them on the given data.# Fit regression model regr_1 = DecisionTreeRegressor(max_depth=2) regr_2 = DecisionTreeRegressor(max_depth=5) regr_1.fit(X, y) regr_2.fit(X, y)After fitting the model, we predict on a custom test dataset and plot the results to see how it performed.# Predict X_test = np.arange(0.0, 5.0, 0.01)[:, np.newaxis] y_1 = regr_1.predict(X_test) y_2 = regr_2.predict(X_test) # Plot the results plt.figure() plt.scatter(X, y, s=20, edgecolor="black",             c="darkorange", label="data") plt.plot(X_test, y_1, color="cornflowerblue",         label="max_depth=2", linewidth=2) plt.plot(X_test, y_2, color="yellowgreen", label="max_depth=5", linewidth=2) plt.xlabel("data") plt.ylabel("target") plt.title("Decision Tree Regression") plt.legend() plt.show() The graph that is thus generated is shown below. Here we can clearly see that for a simple dataset  when we used max_depth=5 (green), the model started to overfit and learned the patterns of the noise along with the sine wave. Such kinds of models do not perform well. Meanwhile, for max_depth=3 (blue), it has fitted the dataset in a better way when compared to the other one.ConclusionIn this article, we tried to build an intuition, by starting from the basics of the theory behind the working of a decision tree classifier. However, covering every aspect of detail is beyond the scope of this article. Hence, it is suggested to go through this book to dive deeper into the specifics. Further, moving on, the code snippets introduces the “Hello World” of how to use both, real-world data and artificially generated data to train a Decision Tree model and predict using the same. This will allow any novice to get an overall balanced theoretical and practical idea about the workings of Classification and Regression Trees and their implementation.
Rated 4.5/5 based on 12 customer reviews
8126
What are Decision Trees in Machine Learning (Class...

Introduction to Machine Learning and its typesMach... Read More