Regression is a part of machine learning that helps in solving tasks which can’t be explicitly programmed.

There are various techniques that are used in machine learning. This includes supervised learning, unsupervised learning, semi-supervised learning and reinforcement learning.

It is one of the most popular learning methods, since it is easy to understand and relatively easier to implement ad get relevant outputs.

Consider this example: How does a child learn? It is taught how to walk, run, talk, and it is made to understand the difference between walking and running.

Supervised learning works in a similar way, there is human supervision involved in the form of features being labelled, feedback given to the data (whether it predicted correctly, and if not what the right prediction has to be) and so on.

Once the algorithm has been fully trained on such data, it can predict outputs for never-before-seen inputs in-line with the data on which the model was trained with good accuracy. It is also understood as a task-oriented algorithm since it focuses on a single task and is trained on huge number of examples until it predicts output accurately.

Supervised learning algorithms can be classified into regression and classification problems. Regression problems include linear regression, logistic regression and classification problems include multi-class classification, decision trees, and much more.

Regression problem basically means the model would yield a real value or a continuous value. The simplest model which is used to predict continuous variables is Linear Regression.

Linear Regression refers to an approach/algorithm that helps establish a linear relationship between the dependant and the independent variable.

As the name indicates, it is a linear process, which means it is 2 dimensional, i.e. it has 2 variables associated with it. These variables have continuous values (in contrast to 0s and 1s in logistic regression). The word ‘regression’ refers to finding relationship between two variables amongst which one is a dependant variable and the other one is independent.

In simple words, it goes like this- we will be provided with a basic linear equation, say y = 3x-1. Here ‘y’ is considered to be the dependant variable (since it depends on the value of x) and ‘x’ (trivially) is the independent variable. This means, as and when ‘x’ changes, the value of ‘y’ keeps changing according to the above-mentioned linear equation. Different values for ‘x’ are supplied, which helps calculate various values for ‘y’. The values for ‘x’ and ‘y’ have been shown in a table below:

X | Y |
---|---|

1 | 2 |

2 | 5 |

3 | 8 |

4 | 11 |

5 | 14 |

6 | 17 |

7 | 20 |

These values are plotted on a graph and we try to fit all these points (or most of them) to a straight line. During the process of fitting these values to a straight line, we try and grab most of the points whose vertical distance from the straight line (that is being fit) is minimum. Some points don’t make it on the straight line since they don’t contribute in forming a straight line. These are the ones whose vertical distance from the straight line isn’t the smallest.

The idea is to grab all the points in the graph and fit them on a straight line that have minimum vertical distance from the line. Below is an example illustrating the same:

When the number of points that don’t contribute to fitting a straight line are more in comparison to the ones that contribute to fitting the line, it is considered that the ‘prediction error’ is more. The ‘error’ basically refers to the shortest distance (vertical distance) between the line and the point.

From the above graph, it can be observed that points 1,2,3 and 4 beginning from the bottom left corner don’t really fit the line, and don’t contribute to forming the straight line.

When such a linear regression model is trained, it helps calculate an attribute called ‘cost function’ that helps in measuring the ‘Root Mean Squared Error’ or RMSE in short. RMSE basically gives the difference between the values that are predicted and the input values. These values are then normalized by squaring them so as to remove any negative values and calculating the average of these values (i.e. dividing them by the total number of observations) and taking the square root of this value.

The resultant is a single number that is used to understand how well the regression algorithm has predicted output for a given input value and how close it is to the actual output. The ‘cost function’

needs to be minimal, thereby corresponding to a minimum difference between the actual value and the predicted value.

It is a supervised classification algorithm that is used to differentiate between different events or values. For example- filtering spam emails, classifying a transaction as legit or fraudulent, and much more. The variable in question is classified as 0 or 1, True or False, Yes or No depending on the input.

It is a regression model that helps in building a model that predicts the probability of a data item belonging to a certain category. Logistic Regression uses a ‘sigmoid’ function, which has been defined below:

`g(z) = 1/ (1+ − ) `

**Note: **The outcome of a Logistic Regression lies between the values 0 and 1, it can’t be greater than 1,and can’t be less than 0.

The logistic regression becomes a classification problem when a decision threshold comes into play.

Other types of regression include:

- Polynomial regression
- Stepwise regression
- Ridge regression
- Lasso regression
- ElasticNet regression

The sigmoid function/logistic function looks like below:

**Note: **The outcome of a Logistic Regression lies between the values 0 and 1, it can’t be greater than 1,and can’t be less than 0.

The logistic regression becomes a classification problem when a decision threshold comes into play.

From scratch, it can be implemented without using the scikit-learn module.

import numpy as np import matplotlib.pyplot as plt import pandas as pd import scipy.optimize as opt def data_loading(path, header): marks_data_frame = pd.read_csv(path, header=header) return marks_data_frame if __name__ == "__main__": # load data from the file data = data_loading("path to marks.csv file",None) X = feature values, all columns except the last one X_data = data.iloc[:, :-1] y = target values, last column of data frame y_data = data.iloc[:, -1] filter out the applicants who were eligible admitted = data.loc[y_data == 1] filter out the applicants who weren’t eligible not_admitted = data.loc[y_data == 0] plot the insights plt.scatter(admitted.iloc[:, 0], admitted.iloc[:, 1], s=10, label='Eligible') plt.scatter(not_admitted.iloc[:, 0], not_admitted.iloc[:, 1], s=10, label='Not eligible') plt.legend() plt.show() X_data = np.c_[np.ones((X.shape[0], 1)), X_data] y_data = y_data[:, np.newaxis] theta = np.zeros((X_data.shape[1], 1)) def sigmoid(x):

Activation function that maps a real value between 0 and

1 return 1 / (1 + np.exp(-x)) def total_input(theta, x):

Computes weighted sum of

inputs return np.dot(x, theta) def probability(theta, x): Returns probability after it goes through sigmoid function return sigmoid(total_input(theta, x)) def cost_function( theta, x, y): Cost function for all the training samples is computed m = x.shape[0] total_cost = -(1 / m) * np.sum(y * np.log(probability(theta, x)) + (1 - y) * np.log(1 - probability(theta,x))) return total_cost def gradient( theta, x, y): Computes the gradient of the cost function at the point theta m = x.shape[0] return (1 / m) * np.dot(x.T, sigmoid(total_input(theta, x)) - y) def fit(x, y, theta): opt_weights = opt.fmin_tnc(func=cost_function,x0=theta,fprime=gradient,args=(x, y.flatten())) return opt_weights[0] parameters = fit(X_data, y_data, theta) x_values = [np.min(X_data[:, 1] - 5), np.max(X_data[:, 2] + 5)] y_values = - (parameters[0] + np.dot(parameters[1], x_values)) / parameters[2] plt.plot(x_values, y_values, label='Decision Boundary') plt.xlabel('Marks in 1st Exam') plt.ylabel('Marks in 2nd Exam') plt.legend() plt.show() def predict( x): theta = parameters[:, np.newaxis] return probability(theta, x) def accuracy( x, actual_classes, prob_threshold=0.5): predicted_classes = (predict(x) >= prob_threshold).astype(int) predicted_classes = predicted_classes.flatten() accuracy = np.mean(predicted_classes == actual_classes) return accuracy * 100 accuracy(X_data, y_data.flatten())

88.88888888888889

Logistic Regression implemented using scikit-learn module

It is implemented using MLE (Maximum Likelihood Estimation), which is an iterative process. A random weight/value is provided for the independent variable and this process goes on until an optimal weight is reached after which there is less to no change in the output when the weights change.

import numpy as np import matplotlib.pyplot as plt import pandas as pd import scipy def data_loading(path, header): marks_data_frame = pd.read_csv(path, header=header) return marks_data_frame if __name__ == "__main__": # load data from the file data = data_loading("path-to-marks.csv file", None) X = feature values, all columns except the last one X_data = data.iloc[:, :-1] y = target values, last column of the data frame y_data = data.iloc[:, -1] filter out applicants who are eligible admitted = data.loc[y_data == 1] filter out applicants who aren’t eligible not_admitted = data.loc[y_data == 0] plot the insights plt.scatter(admitted.iloc[:, 0], admitted.iloc[:, 1], s=10, label='Eligible') plt.scatter(not_admitted.iloc[:, 0], not_admitted.iloc[:, 1], s=10, label='Not eligible') plt.legend() plt.show() from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score model = LogisticRegression() model.fit(X_data, y_data) predicted_classes = model.predict(X_data) accuracy = accuracy_score(y_data,predicted_classes) parameters = model.coef_

**Output:**

**Applications of logistic regression **

- Weather forecasting
- Stock prediction
- Election poll results

In this post, we understood what Logistic Regression means, and its Python implementation using scikit-learn library as well as from scratch.

## Leave a Reply

Your email address will not be published. Required fields are marked *