Search

Series List Filter

What is Linear Regression in Machine Learning

Machine Learning, being a subset of Artificial Intelligence (AI), has been playing a dominant role in our daily lives. Data science engineers and developers working in various domains are widely using machine learning algorithms to make their tasks simpler and life easier. For example, certain machine learning algorithms enable Google Maps to find the fastest route to our destinations, allow Tesla to make driverless cars, help Amazon to generate almost 35% of their annual income, AccuWeather to get the weather forecast of 3.5 million locations weeks in advance, Facebook to automatically detect faces and suggest tags and so on.In statistics and machine learning, linear regression is one of the most popular and well understood algorithms. Most data science enthusiasts and machine learning  fanatics begin their journey with linear regression algorithms. In this article, we will look into how linear regression algorithm works and how it can be efficiently used in your machine learning projects to build better models.Linear Regression is one of the machine learning algorithms where the result is predicted by the use of known parameters which are correlated with the output. It is used to predict values within a continuous range rather than trying to classify them into categories. The known parameters are used to make a continuous and constant slope which is used to predict the unknown or the result.What is a Regression Problem?Majority of the machine learning algorithms fall under the supervised learning category. It is the process where an algorithm is used to predict a result based on the previously entered values and the results generated from them. Suppose we have an input variable ‘x’ and an output variable ‘y’ where y is a function of x (y=f{x}). Supervised learning reads the value of entered variable ‘x’ and the resulting variable ‘y’ so that it can use those results to later predict a highly accurate output data of ‘y’ from the entered value of ‘x’. A regression problem is when the resulting variable contains a real or a continuous value. It tries to draw the line of best fit from the data gathered from a number of points.For example, which of these is a regression problem?How much gas will I spend if I drive for 100 miles?What is the nationality of a person?What is the age of a person?Which is the closest planet to the Sun?Predicting the amount of gas to be spent and the age of a person are regression problems. Predicting nationality is categorical and the closest planet to the Sun is discrete.What is Linear Regression?Let’s say we have a dataset which contains information about the relationship between ‘number of hours studied’ and ‘marks obtained’. A number of students have been observed and their hours of study along with their grades are recorded. This will be our training data. Our goal is to design a model that can predict the marks if number of hours studied is provided. Using the training data, a regression line is obtained which will give minimum error. This linear equation is then used to apply for a new data. That is, if we give the number of hours studied by a student as an input, our model should be able to predict their mark with minimum error.Hypothesis of Linear RegressionThe linear regression model can be represented by the following equation:where,Y is the predicted valueθ₀ is the bias term.θ₁,…,θn are the model parametersx₁, x₂,…,xn are the feature values.The above hypothesis can also be represented byWhere, θ is the model’s parameter vector including the bias term θ₀; x is the feature vector with x₀ =1Y (pred) = b0 + b1*xThe values b0 and b1 must be chosen so that the error is minimum. If sum of squared error is taken as a metric to evaluate the model, then the goal is to obtain a line that best reduces the error.If we don’t square the error, then the positive and negative points will cancel each other out.For a model with one predictor,Exploring ‘b1’If b1 > 0, then x (predictor) and y(target) have a positive relationship. That is an increase in x will increase y.If b1 < 0, then x (predictor) and y(target) have a negative relationship. That is an increase in x will decrease y.Exploring ‘b0’If the model does not include x=0, then the prediction will become meaningless with only b0. For example, we have a dataset that relates height(x) and weight(y). Taking x=0 (that is height as 0), will make the equation have only b0 value which is completely meaningless as in real-time height and weight can never be zero. This resulted due to considering the model values beyond its scope.If the model includes value 0, then ‘b0’ will be the average of all predicted values when x=0. But, setting zero for all the predictor variables is often impossible.The value of b0 guarantees that the residual will have mean zero. If there is no ‘b0’ term, then the regression will be forced to pass over the origin. Both the regression coefficient and prediction will be biased.How does Linear Regression work?Let’s look at a scenario where linear regression might be useful: losing weight. Let us consider that there’s a connection between how many calories you take in and how much you weigh; regression analysis can help you understand that connection. Regression analysis will provide you with a relation which can be visualized into a graph in order to make predictions about your data. For example, if you’ve been putting on weight over the last few years, it can predict how much you’ll weigh in the next ten years if you continue to consume the same amount of calories and burn them at the same rate.The goal of regression analysis is to create a trend line based on the data you have gathered. This then allows you to determine whether other factors apart from the amount of calories consumed affect your weight, such as the number of hours you sleep, work pressure, level of stress, type of exercises you do etc. Before taking into account, we need to look at these factors and attributes and determine whether there is a correlation between them. Linear Regression can then be used to draw a trend line which can then be used to confirm or deny the relationship between attributes. If the test is done over a long time duration, extensive data can be collected and the result can be evaluated more accurately. By the end of this article we will build a model which looks like the below picture i.e, determine a line which best fits the data.How do we determine the best fit line?The best fit line is considered to be the line for which the error between the predicted values and the observed values is minimum. It is also called the regression line and the errors are also known as residuals. The figure shown below shows the residuals. It can be visualized by the vertical lines from the observed data value to the regression line.When to use Linear Regression?Linear Regression’s power lies in its simplicity, which means that it can be used to solve problems across various fields. At first, the data collected from the observations need to be collected and plotted along a line. If the difference between the predicted value and the result is almost the same, we can use linear regression for the problem.Assumptions in linear regressionIf you are planning to use linear regression for your problem then there are some assumptions you need to consider:The relation between the dependent and independent variables should be almost linear.The data is homoscedastic, meaning the variance between the results should not be too much.The results obtained from an observation should not be influenced by the results obtained from the previous observation.The residuals should be normally distributed. This assumption means that the probability density function of the residual values is normally distributed at each independent value.You can determine whether your data meets these conditions by plotting it and then doing a bit of digging into its structure.Few properties of Regression LineHere are a few features a regression line has:Regression passes through the mean of independent variable (x) as well as mean of the dependent variable (y).Regression line minimizes the sum of “Square of Residuals”. That’s why the method of Linear Regression is known as “Ordinary Least Square (OLS)”. We will discuss more in detail about Ordinary Least Square later on.B1 explains the change in Y with a change in x  by one unit. In other words, if we increase the value of ‘x’ it will result in a change in value of Y.Finding a Linear Regression lineLet’s say we want to predict ‘y’ from ‘x’ given in the following table and assume they are correlated as “y=B0+B1∗x”xyPredicted 'y'12Β0+B1∗121Β0+B1∗233Β0+B1∗346Β0+B1∗459Β0+B1∗5611Β0+B1∗6713Β0+B1∗7815Β0+B1∗8917Β0+B1∗91020Β0+B1∗10where,Std. Dev. of x3.02765Std. Dev. of y6.617317Mean of x5.5Mean of y9.7Correlation between x & y0.989938If the Residual Sum of Square (RSS) is differentiated with respect to B0 & B1 and the results equated to zero, we get the following equation:B1 = Correlation * (Std. Dev. of y/ Std. Dev. of x)B0 = Mean(Y) – B1 * Mean(X)Putting values from table 1 into the above equations,B1 = 2.64B0 = -2.2Hence, the least regression equation will become –Y = -2.2 + 2.64*xxY - ActualY - Predicted120.44213.08335.72468.36591161113.6471316.2881518.9291721.56102024.2As there are only 10 data points, the results are not too accurate but if we see the correlation between the predicted and actual line, it has turned out to be very high; both the lines are moving almost together and here is the graph for visualizing our predicted values:Model PerformanceAfter the model is built, if we see that the difference in the values of the predicted and actual data is not much, it is considered to be a good model and can be used to make future predictions. The amount that we consider “not much” entirely depends on the task you want to perform and to what percentage the variation in data can be handled. Here are a few metric tools we can use to calculate error in the model-R – Square (R2)Total Sum of Squares (TSS): total sum of squares (TSS) is a quantity that appears as part of a standard way of presenting results of such an analysis. Sum of squares is a measure of how a data set varies around a central number (like the mean). The Total Sum of Squares tells how much variation there is in the dependent variable.TSS = Σ (Y – Mean[Y])2Residual Sum of Squares (RSS): The residual sum of squares tells you how much of the dependent variable’s variation your model did not explain. It is the sum of the squared differences between the actual Y and the predicted Y.RSS = Σ (Y – f[Y])2(TSS – RSS) measures the amount of variability in the response that is explained by performing the regression.Properties of R2R2 always ranges between 0 to 1.R2 of 0 means that there is no correlation between the dependent and the independent variable.R2 of 1 means the dependent variable can be predicted from the independent variable without any error. An R2 between 0 and 1 indicates the extent to which the dependent variable is predictable. An R2 of 0.20 means that there is 20% of the variance in Y is predictable from X; an R2 of 0.40 means that 40% is predictable; and so on.Root Mean Square Error (RMSE)Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors). The formula for calculating RMSE is:Where N : Total number of observationsWhen standardized observations are used as RMSE inputs, there is a direct relationship with the correlation coefficient. For example, if the correlation coefficient is 1, the RMSE will be 0, because all of the points lie on the regression line (and therefore there are no errors).Mean Absolute Percentage Error (MAPE)There are certain limitations to the use of RMSE, so analysts prefer MAPE over RMSE which gives error in terms of percentages so that different models can be considered for the task and see how they perform. Formula for calculating MAPE can be written as:Where N : Total number of observationsFeature SelectionFeature selection is the automatic selection of attributes for your data that are most relevant to the predictive model you are working on. It seeks to reduce the number of attributes in the dataset by eliminating the features which are not required for the model construction. Feature selection does not totally eliminate an attribute which is considered for the model, rather it mutes that particular characteristic and works with the features which affects the model.Feature selection method aids your mission to create an accurate predictive model. It helps you by choosing features that will give you as good or better accuracy whilst requiring less data. Feature selection methods can be used to identify and remove unnecessary, irrelevant and redundant attributes from the data that do not contribute to the accuracy of the model or may even decrease the accuracy of the model. Having fewer attributes is desirable because it reduces the complexity of the model, and a simpler model is easier to understand, explain and to work with.Feature Selection Algorithms:Filter Method: This method involves assigning scores to individual features and ranking them. The features that have very little to almost no impact are removed from consideration while constructing the model.Wrapper Method: Wrapper method is quite similar to Filter method except the fact that it considers attributes in a group i.e. a number of attributes are taken and checked whether they are having an impact on the model and if not another combination is applied.Embedded Method: Embedded method is the best and most accurate of all the algorithms. It learns the features that affect the model while the model is being constructed and takes into consideration only those features. The most common type of embedded feature selection methods are regularization methods.Cost FunctionCost function helps to figure out the best possible plots which can be used to draw the line of best fit for the data points. As we want to reduce the error of the resulting value we change the process of finding out the actual result to a process which can reduce the error between the predicted value and the actual value.Here, J is the cost function.The above function is made in this format to calculate the error difference between the predicted values and the plotted values. We take the square of the summation of all the data points and divide it by the total number of data points. This cost function J is also called the Mean Squared Error (MSE) function. Using this MSE function we are going to predict values such that the MSE value settles at the minima, reducing the cost function.Gradient DescentGradient Descent is an optimization algorithm that helps machine learning models to find out paths to a minimum value using repeated steps. Gradient descent is used to minimize a function so that it gives the lowest output of that function. This function is called the Loss Function. The loss function shows us how much error is produced by the machine learning model compared to actual results. Our aim should be to lower the cost function as much as possible. One way of achieving a low cost function is by the process of gradient descent. Complexity of some equations makes it difficult to use, partial derivative of the cost function with respect to the considered parameter can provide optimal coefficient value. You may refer to the article on Gradient Descent for Machine Learning.Simple Linear RegressionOptimization is a big part of machine learning and almost every machine learning algorithm has an optimization technique at its core for increased efficiency. Gradient Descent is such an optimization algorithm used to find values of coefficients of a function that minimizes the cost function. Gradient Descent is best applied when the solution cannot be obtained by analytical methods (linear algebra) and must be obtained by an optimization technique.Residual Analysis: Simple linear regression models the relationship between the magnitude of one variable and that of a second—for example, as x increases, y also increases. Or as x increases, y decreases. Correlation is another way to measure how two variables are related. The models done by simple linear regression estimate or try to predict the actual result but most often they deviate from the actual result. Residual analysis is used to calculate by how much the estimated value has deviated from the actual result.Null Hypothesis and p-value: During feature selection, null hypothesis is used to find which attributes will not affect the result of the model. Hypothesis tests are used to test the validity of a claim that is made about a particular attribute of the model. This claim that’s on trial, in essence, is called the null hypothesis. A p-value helps to determine the significance of the results. p-value is a number between 0 and 1 and is interpreted in the following way:A small p-value (less than 0.05) indicates a strong evidence against the null hypothesis, so the null hypothesis is to be rejected.A large p-value (greater than 0.05) indicates weak evidence against the null hypothesis, so the null hypothesis is to be considered.p-value very close to the cut-off (equal to 0.05) is considered to be marginal (could go either way). In this case, the p-value should be provided to the readers so that they can draw their own conclusions.Ordinary Least SquareOrdinary Least Squares (OLS), also known as Ordinary least squares regression or least squared errors regression is a type of linear least squares method for estimating the unknown parameters in a linear regression model. OLS chooses the parameters for a linear function, the goal of which is to minimize the sum of the squares of the difference of the observed variables and the dependent variables i.e. it tries to attain a relationship between them. There are two types of relationships that may occur: linear and curvilinear. A linear relationship is a straight line that is drawn through the central tendency of the points; whereas a curvilinear relationship is a curved line. Association between the variables are depicted by using a scatter plot. The relationship could be positive or negative, and result variation also differs in strength.The advantage of using Ordinary Least Squares regression is that it can be easily interpreted and is highly compatible with recent computers’ built-in algorithms from linear algebra. It can be used to apply to problems with lots of independent variables which can efficiently conveyed to thousands of data points. In Linear Regression, OLS is used to estimate the unknown parameters by creating a model which will minimize the sum of the squared errors between the observed data and the predicted one.Let us simulate some data and look at how the predicted values (Yₑ) differ from the actual value (Y):import pandas as pd import numpy as np from matplotlib import pyplot as plt # Generate 'random' data np.random.seed(0) X = 2.5 * np.random.randn(100) + 1.5   # Array of 100 values with mean = 1.5, stddev = 2.5 res = 0.5 * np.random.randn(100)         # Generate 100 residual terms y = 2 + 0.3 * X + res                   # Actual values of Y # Create pandas dataframe to store our X and y values df = pd.DataFrame(     {'X': X,       'y': y} ) # Show the first five rows of our dataframe df.head()XY05.9101314.71461512.5003932.07623823.9468452.54881137.1022334.61536846.1688953.264107To estimate y using the OLS method, we need to calculate xmean and ymean, the covariance of X and y (xycov), and the variance of X (xvar) before we can determine the values for alpha and beta.# Calculate the mean of X and y xmean = np.mean(X) ymean = np.mean(y) # Calculate the terms needed for the numator and denominator of beta df['xycov'] = (df['X'] - xmean) * (df['y'] - ymean) df['xvar'] = (df['X'] - xmean)**2 # Calculate beta and alpha beta = df['xycov'].sum() / df['xvar'].sum() alpha = ymean - (beta * xmean) print(f'alpha = {alpha}') print(f'beta = {beta}')alpha = 2.0031670124623426 beta = 0.3229396867092763Now that we have an estimate for alpha and beta, we can write our model as Yₑ = 2.003 + 0.323 X, and make predictions:ypred = alpha + beta * XLet’s plot our prediction ypred against the actual values of y, to get a better visual understanding of our model.# Plot regression against actual data plt.figure(figsize=(12, 6)) plt.plot(X, ypred) # regression line plt.plot(X, y, 'ro')   # scatter plot showing actual data plt.title('Actual vs Predicted') plt.xlabel('X') plt.ylabel('y') plt.show()The blue line in the above graph is our line of best fit, Yₑ = 2.003 + 0.323 X.  If you observe the graph carefully, you will notice that there is a linear relationship between X and Y. Using this model, we can predict Y from any values of X. For example, for X = 8,Yₑ = 2.003 + 0.323 (8) = 4.587RegularizationRegularization is a type of regression that is used to decrease the coefficient estimates down to zero. This helps to eliminate the data points that don’t actually represent the true properties of the model, but have appeared by random chance. The process is done by identifying the points which have deviated from the line of best-fit by a large extent. Earlier we saw that to estimate the regression coefficients β in the least squares method, we must minimize the term Residual Sum of Squares (RSS). Let the RSS equation in this case be:The general linear regression model can be expressed using a condensed formula:Here, β=[β0 ,β1, ….. βp]The RSS value will adjust the coefficient, β based on the training data. If the resulting data deviates too much from the training data, then the estimated coefficients won’t generalize well to the future data. This is where regularization comes in and shrinks or regularizes these learned estimates towards zero.Ridge regressionRidge regression is very similar to least squares, except that the Ridge coefficients are estimated by minimizing a different quantity. In particular, the Ridge regression coefficients β are the values that minimize the following quantity:Here, λ is the tuning parameter that decides how much we want to penalize the flexibility of the model. λ controls the relative impact of the two components: RSS and the penalty term. If λ = 0, the Ridge regression will produce a result similar to least squares method. If λ → ∞, all estimated coefficients tend to zero. Ridge regression produces different estimates for different values of λ. The optimal choice of λ is crucial and should be done with cross-validation. The coefficient estimates produced by ridge regression method is also known as the L2 norm.The coefficients generated by Ordinary Least Squares method is independent of scale, which means that if each input variable is multiplied by a constant, the corresponding coefficient will be divided by the same constant, as a result of which the multiplication of the coefficient and the input variables will remain the same. The same is not true for ridge regression and we need to bring the coefficients to the same scale before we perform the process. To standardize the variables, we must subtract their means and divide it by their standard deviations.Lasso RegressionLeast Absolute Shrinkage and Selection Operator (LASSO) regression also shrinks the coefficients by adding a penalty to the sum of squares of the residuals, but the lasso penalty has a slightly different effect. The lasso penalty is the sum of the absolute values of the coefficient vector, which corresponds to its L1 norm. Hence, the lasso estimate is defined by:Similar to ridge regression, the input variables need to be standardized. The lasso penalty makes the solution nonlinear, and there is no closed-form expression for the coefficients as in ridge regression. Instead, the lasso solution is a quadratic programming problem and there are available efficient algorithms that compute the entire path of coefficients that result for different values of λ with the same computational cost as for ridge regression.The lasso penalty had the effect of gradually reducing some coefficients to zero as the regularization increases. For this reason, the lasso can be used for the continuous selection of a subset of features.Linear Regression with multiple variablesLinear regression with multiple variables is also known as "multivariate linear regression". We now introduce notation for equations where we can have any number of input variables.x(i)j=value of feature j in the ith training examplex(i)=the input (features) of the ith training examplem=the number of training examplesn=the number of featuresThe multivariable form of the hypothesis function accommodating these multiple features is as follows:hθ(x)=θ0+θ1x1+θ2x2+θ3x3+⋯+θnxnIn order to develop intuition about this function, we can think about θ0 as the basic price of a house, θ1 as the price per square meter, θ2 as the price per floor, etc. x1 will be the number of square meters in the house, x2 the number of floors, etc.Using the definition of matrix multiplication, our multivariable hypothesis function can be concisely represented as:This is a vectorization of our hypothesis function for one training example; see the lessons on vectorization to learn more.Remark: Note that for convenience reasons in this course we assume x0 (i) =1 for (i∈1,…,m). This allows us to do matrix operations with θ and x. Hence making the two vectors ‘θ’and x(i) match each other element-wise (that is, have the same number of elements: n+1).Multiple Linear RegressionHow is it different?In simple linear regression we use a single independent variable to predict the value of a dependent variable whereas in multiple linear regression two or more independent variables are used to predict the value of a dependent variable. The difference between the two is the number of independent variables. In both cases there is only a single dependent variable.MulticollinearityMulticollinearity tells us the strength of the relationship between independent variables. Multicollinearity is a state of very high intercorrelations or inter-associations among the independent variables. It is therefore a type of disturbance in the data, and if present in the data the statistical inferences made about the data may not be reliable. VIF (Variance Inflation Factor) is used to identify the Multicollinearity. If VIF value is greater than 4, we exclude that variable from our model.There are certain reasons why multicollinearity occurs:It is caused by an inaccurate use of dummy variables.It is caused by the inclusion of a variable which is computed from other variables in the data set.Multicollinearity can also result from the repetition of the same kind of variable.Generally occurs when the variables are highly correlated to each other.Multicollinearity can result in several problems. These problems are as follows:The partial regression coefficient due to multicollinearity may not be estimated precisely. The standard errors are likely to be high.Multicollinearity results in a change in the signs as well as in the magnitudes of the partial regression coefficients from one sample to another sample.Multicollinearity makes it tedious to assess the relative importance of the independent variables in explaining the variation caused by the dependent variable.Iterative ModelsModels should be tested and upgraded again and again for better performance. Multiple iterations allows the model to learn from its previous result and take that into consideration while performing the task again.Making predictions with Linear RegressionLinear Regression can be used to predict the value of an unknown variable using a known variable by the help of a straight line (also called the regression line). The prediction can only be made if it is found that there is a significant correlation between the known and the unknown variable through both a correlation coefficient and a scatterplot.The general procedure for using regression to make good predictions is the following:Research the subject-area so that the model can be built based on the results produced by similar models. This research helps with the subsequent steps.Collect data for appropriate variables which have some correlation with the model.Specify and assess the regression model.Run repeated tests so that the model has more data to work with.To test if the model is good enough observe whether:The scatter plot forms a linear pattern.The correlation coefficient r, has a value above 0.5 or below -0.5. A positive value indicates a positive relationship and a negative value represents a negative relationship.If the correlation coefficient shows a strong relationship between variables but the scatter plot is not linear, the results can be misleading. Examples on how to use linear regression have been shown earlier.Data preparation for Linear RegressionStep 1: Linear AssumptionThe first step for data preparation is checking for the variables which have some sort of linear correlation between the dependent and the independent variables.Step 2: Remove NoiseIt is the process of reducing the number of attributes in the dataset by eliminating the features which have very little to no requirement for the construction of the model.Step 3: Remove CollinearityCollinearity tells us the strength of the relationship between independent variables. If two or more variables are highly collinear, it would not make sense to keep both the variables while evaluating the model and hence we can keep one of them.Step 4: Gaussian DistributionsThe linear regression model will produce more reliable results if the input and output variables have a Gaussian distribution. The Gaussian theorem states that  states that a sample mean from an infinite population is approximately normal, or Gaussian, with mean the same as the underlying population, and variance equal to the population variance divided by the sample size. The approximation improves as the sample size gets large.Step 5: Rescale InputsLinear regression model will produce more reliable predictions if the input variables are rescaled using standardization or normalization.Linear Regression with statsmodelsWe have already discussed OLS method, now we will move on and see how to use the OLS method in the statsmodels library. For this we will be using the popular advertising dataset. Here, we will only be looking at the TV variable and explore whether spending on TV advertising can predict the number of sales for the product. Let’s start by importing this csv file as a pandas dataframe using read_csv():# Import and display first five rows of advertising dataset advert = pd.read_csv('advertising.csv') advert.head()TVRadioNewspaperSales0230.137.869.222.1144.539.345.110.4217.245.969.312.03151.541.358.516.54180.810.858.417.9Now we will use statsmodels’ OLS function to initialize simple linear regression model. It will take the formula y ~ X, where X is the predictor variable (TV advertising costs) and y is the output variable (Sales). Then, we will fit the model by calling the OLS object’s fit() method.import statsmodels.formula.api as smf # Initialise and fit linear regression model using `statsmodels` model = smf.ols('Sales ~ TV', data=advert) model = model.fit()Once we have fit the simple regression model, we can predict the values of sales based on the equation we just derived using the .predict method and also visualise our regression model by plotting sales_pred against the TV advertising costs to find the line of best fit.# Predict values sales_pred = model.predict() # Plot regression against actual data plt.figure(figsize=(12, 6)) plt.plot(advert['TV'], advert['Sales'], 'o')       # scatter plot showing actual data plt.plot(advert['TV'], sales_pred, 'r', linewidth=2)   # regression line plt.xlabel('TV Advertising Costs') plt.ylabel('Sales') plt.title('TV vs Sales') plt.show()In the above graph, if you notice you will see that there is a positive linear relationship between TV advertising costs and Sales. You may also summarize by saying that spending more on TV advertising predicts a higher number of sales.Linear Regression with scikit-learnLet us learn to implement linear regression models using sklearn. For this model as well, we will continue to use the advertising dataset but now we will use two predictor variables to create a multiple linear regression model. Yₑ = α + β₁X₁ + β₂X₂ + … + βₚXₚ, where p is the number of predictors.In our example, we will be predicting Sales using the variables TV and Radio i.e. our model can be written as:Sales = α + β₁*TV + β₂*Radiofrom sklearn.linear_model import LinearRegression # Build linear regression model using TV and Radio as predictors # Split data into predictors X and output Y predictors = ['TV', 'Radio'] X = advert[predictors] y = advert['Sales'] # Initialise and fit model lm = LinearRegression() model = lm.fit(X, y) print(f'alpha = {model.intercept_}') print(f'betas = {model.coef_}')alpha = 4.630879464097768 betas = [0.05444896 0.10717457]model.predict(X)Now that we have fit a multiple linear regression model to our data, we can predict sales from any combination of TV and Radio advertising costs. For example, you want to know how many sales we would make if we invested $600 in TV advertising and $300 in Radio advertising. You can simply find it out by:new_X = [[600, 300]] print(model.predict(new_X))[69.4526273]We get the output as 69.45 which means if we invest $600 on TV and $300 on Radio advertising, we can expect to sell 69 units approximately.SummaryLet us sum up what we have covered in this article so far —How to understand a regression problemWhat is linear regression and how it worksOrdinary Least Square method and RegularizationImplementing Linear Regression in Python using statsmodel and sklearn libraryWe have discussed about a couple of ways to implement linear regression and build efficient models for certain business problems. If you are inspired by the opportunities provided by machine learning, enrol in our  Data Science and Machine Learning Courses for more lucrative career options in this landscape.
Rated 4.5/5 based on 4 customer reviews

What is Linear Regression in Machine Learning

7967
What is Linear Regression in Machine Learning

Machine Learning, being a subset of Artificial Intelligence (AI), has been playing a dominant role in our daily lives. Data science engineers and developers working in various domains are widely using machine learning algorithms to make their tasks simpler and life easier. For example, certain machine learning algorithms enable Google Maps to find the fastest route to our destinations, allow Tesla to make driverless cars, help Amazon to generate almost 35% of their annual income, AccuWeather to get the weather forecast of 3.5 million locations weeks in advance, Facebook to automatically detect faces and suggest tags and so on.

In statistics and machine learning, linear regression is one of the most popular and well understood algorithms. Most data science enthusiasts and machine learning  fanatics begin their journey with linear regression algorithms. In this article, we will look into how linear regression algorithm works and how it can be efficiently used in your machine learning projects to build better models.

Linear Regression is one of the machine learning algorithms where the result is predicted by the use of known parameters which are correlated with the output. It is used to predict values within a continuous range rather than trying to classify them into categories. The known parameters are used to make a continuous and constant slope which is used to predict the unknown or the result.

What is a Regression Problem?

Majority of the machine learning algorithms fall under the supervised learning category. It is the process where an algorithm is used to predict a result based on the previously entered values and the results generated from them. Suppose we have an input variable ‘x’ and an output variable ‘y’ where y is a function of x (y=f{x}). Supervised learning reads the value of entered variable ‘x’ and the resulting variable ‘y’ so that it can use those results to later predict a highly accurate output data of ‘y’ from the entered value of ‘x’. A regression problem is when the resulting variable contains a real or a continuous value. It tries to draw the line of best fit from the data gathered from a number of points.

What is a Regression Problem?

For example, which of these is a regression problem?

  • How much gas will I spend if I drive for 100 miles?
  • What is the nationality of a person?
  • What is the age of a person?
  • Which is the closest planet to the Sun?

Predicting the amount of gas to be spent and the age of a person are regression problems. Predicting nationality is categorical and the closest planet to the Sun is discrete.

What is Linear Regression?

Let’s say we have a dataset which contains information about the relationship between ‘number of hours studied’ and ‘marks obtained’. A number of students have been observed and their hours of study along with their grades are recorded. This will be our training data. Our goal is to design a model that can predict the marks if number of hours studied is provided. Using the training data, a regression line is obtained which will give minimum error. This linear equation is then used to apply for a new data. That is, if we give the number of hours studied by a student as an input, our model should be able to predict their mark with minimum error.

Hypothesis of Linear Regression

The linear regression model can be represented by the following equation:

The linear regression model equation

where,

Y is the predicted value

θ₀ is the bias term.

θ₁,…,θn are the model parameters

x₁, x₂,…,xn are the feature values.

The above hypothesis can also be represented by

The above hypothesis

Where, θ is the model’s parameter vector including the bias term θ₀; x is the feature vector with x₀ =1

Y (pred) = b0 + b1*x

The values b0 and b1 must be chosen so that the error is minimum. If sum of squared error is taken as a metric to evaluate the model, then the goal is to obtain a line that best reduces the error.

Error Calculation in Linear Regression

If we don’t square the error, then the positive and negative points will cancel each other out.

For a model with one predictor,

Intercept Calculation in Linear Regression

Coefficient Formula in Linear Regression

Exploring ‘b1

If b1 > 0, then x (predictor) and y(target) have a positive relationship. That is an increase in x will increase y.

If b1 < 0, then x (predictor) and y(target) have a negative relationship. That is an increase in x will decrease y.

Exploring ‘b0

If the model does not include x=0, then the prediction will become meaningless with only b0. For example, we have a dataset that relates height(x) and weight(y). Taking x=0 (that is height as 0), will make the equation have only b0 value which is completely meaningless as in real-time height and weight can never be zero. This resulted due to considering the model values beyond its scope.

If the model includes value 0, then ‘b0’ will be the average of all predicted values when x=0. But, setting zero for all the predictor variables is often impossible.

The value of b0 guarantees that the residual will have mean zero. If there is no ‘b0’ term, then the regression will be forced to pass over the origin. Both the regression coefficient and prediction will be biased.

How does Linear Regression work?

Let’s look at a scenario where linear regression might be useful: losing weight. Let us consider that there’s a connection between how many calories you take in and how much you weigh; regression analysis can help you understand that connection. Regression analysis will provide you with a relation which can be visualized into a graph in order to make predictions about your data. For example, if you’ve been putting on weight over the last few years, it can predict how much you’ll weigh in the next ten years if you continue to consume the same amount of calories and burn them at the same rate.

The goal of regression analysis is to create a trend line based on the data you have gathered. This then allows you to determine whether other factors apart from the amount of calories consumed affect your weight, such as the number of hours you sleep, work pressure, level of stress, type of exercises you do etc. Before taking into account, we need to look at these factors and attributes and determine whether there is a correlation between them. Linear Regression can then be used to draw a trend line which can then be used to confirm or deny the relationship between attributes. If the test is done over a long time duration, extensive data can be collected and the result can be evaluated more accurately. By the end of this article we will build a model which looks like the below picture i.e, determine a line which best fits the data.

How does Linear Regression work?

How do we determine the best fit line?

The best fit line is considered to be the line for which the error between the predicted values and the observed values is minimum. It is also called the regression line and the errors are also known as residuals. The figure shown below shows the residuals. It can be visualized by the vertical lines from the observed data value to the regression line.

How do we determine the best fit line?

When to use Linear Regression?

Linear Regression’s power lies in its simplicity, which means that it can be used to solve problems across various fields. At first, the data collected from the observations need to be collected and plotted along a line. If the difference between the predicted value and the result is almost the same, we can use linear regression for the problem.

Assumptions in linear regression

If you are planning to use linear regression for your problem then there are some assumptions you need to consider:

  • The relation between the dependent and independent variables should be almost linear.
  • The data is homoscedastic, meaning the variance between the results should not be too much.
  • The results obtained from an observation should not be influenced by the results obtained from the previous observation.
  • The residuals should be normally distributed. This assumption means that the probability density function of the residual values is normally distributed at each independent value.

You can determine whether your data meets these conditions by plotting it and then doing a bit of digging into its structure.

Few properties of Regression Line

Here are a few features a regression line has:

  • Regression passes through the mean of independent variable (x) as well as mean of the dependent variable (y).
  • Regression line minimizes the sum of “Square of Residuals”. That’s why the method of Linear Regression is known as “Ordinary Least Square (OLS)”. We will discuss more in detail about Ordinary Least Square later on.
  • B1 explains the change in Y with a change in x  by one unit. In other words, if we increase the value of ‘x’ it will result in a change in value of Y.

Finding a Linear Regression line

Let’s say we want to predict ‘y’ from ‘x’ given in the following table and assume they are correlated as “y=B0+B1∗x”

xyPredicted 'y'
12Β0+B1∗1
21Β0+B1∗2
33Β0+B1∗3
46Β0+B1∗4
59Β0+B1∗5
611Β0+B1∗6
713Β0+B1∗7
815Β0+B1∗8
917Β0+B1∗9
1020Β0+B1∗10

where,

Std. Dev. of x3.02765
Std. Dev. of y6.617317
Mean of x5.5
Mean of y9.7
Correlation between x & y0.989938

If the Residual Sum of Square (RSS) is differentiated with respect to B0 & B1 and the results equated to zero, we get the following equation:

B1 = Correlation * (Std. Dev. of y/ Std. Dev. of x)

B0 = Mean(Y) – B1 * Mean(X)

Putting values from table 1 into the above equations,

B1 = 2.64

B0 = -2.2

Hence, the least regression equation will become –

Y = -2.2 + 2.64*x

xY - ActualY - Predicted
120.44
213.08
335.72
468.36
5911
61113.64
71316.28
81518.92
91721.56
102024.2

As there are only 10 data points, the results are not too accurate but if we see the correlation between the predicted and actual line, it has turned out to be very high; both the lines are moving almost together and here is the graph for visualizing our predicted values:

Finding a Linear Regression line

Model Performance

After the model is built, if we see that the difference in the values of the predicted and actual data is not much, it is considered to be a good model and can be used to make future predictions. The amount that we consider “not much” entirely depends on the task you want to perform and to what percentage the variation in data can be handled. Here are a few metric tools we can use to calculate error in the model-

R – Square (R2)

Model Performance

Total Sum of Squares (TSS): total sum of squares (TSS) is a quantity that appears as part of a standard way of presenting results of such an analysis. Sum of squares is a measure of how a data set varies around a central number (like the mean). The Total Sum of Squares tells how much variation there is in the dependent variable.

TSS = Σ (Y – Mean[Y])2

Residual Sum of Squares (RSS): The residual sum of squares tells you how much of the dependent variable’s variation your model did not explain. It is the sum of the squared differences between the actual Y and the predicted Y.

RSS = Σ (Y – f[Y])2

(TSS – RSS) measures the amount of variability in the response that is explained by performing the regression.

Properties of R2

  • R2 always ranges between 0 to 1.
  • R2 of 0 means that there is no correlation between the dependent and the independent variable.
  • R2 of 1 means the dependent variable can be predicted from the independent variable without any error. 
  • An R2 between 0 and 1 indicates the extent to which the dependent variable is predictable. An R2 of 0.20 means that there is 20% of the variance in Y is predictable from X; an R2 of 0.40 means that 40% is predictable; and so on.

Root Mean Square Error (RMSE)

Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors). The formula for calculating RMSE is:

Root Mean Square Error (RMSE)

Where N : Total number of observations

When standardized observations are used as RMSE inputs, there is a direct relationship with the correlation coefficient. For example, if the correlation coefficient is 1, the RMSE will be 0, because all of the points lie on the regression line (and therefore there are no errors).

Mean Absolute Percentage Error (MAPE)

There are certain limitations to the use of RMSE, so analysts prefer MAPE over RMSE which gives error in terms of percentages so that different models can be considered for the task and see how they perform. Formula for calculating MAPE can be written as:

Mean Absolute Percentage Error (MAPE)

Where N : Total number of observations

Feature Selection

Feature selection is the automatic selection of attributes for your data that are most relevant to the predictive model you are working on. It seeks to reduce the number of attributes in the dataset by eliminating the features which are not required for the model construction. Feature selection does not totally eliminate an attribute which is considered for the model, rather it mutes that particular characteristic and works with the features which affects the model.

Feature selection method aids your mission to create an accurate predictive model. It helps you by choosing features that will give you as good or better accuracy whilst requiring less data. Feature selection methods can be used to identify and remove unnecessary, irrelevant and redundant attributes from the data that do not contribute to the accuracy of the model or may even decrease the accuracy of the model. Having fewer attributes is desirable because it reduces the complexity of the model, and a simpler model is easier to understand, explain and to work with.

Feature Selection Algorithms:

  • Filter Method: This method involves assigning scores to individual features and ranking them. The features that have very little to almost no impact are removed from consideration while constructing the model.
  • Wrapper Method: Wrapper method is quite similar to Filter method except the fact that it considers attributes in a group i.e. a number of attributes are taken and checked whether they are having an impact on the model and if not another combination is applied.
  • Embedded Method: Embedded method is the best and most accurate of all the algorithms. It learns the features that affect the model while the model is being constructed and takes into consideration only those features. The most common type of embedded feature selection methods are regularization methods.

Cost Function

Cost function helps to figure out the best possible plots which can be used to draw the line of best fit for the data points. As we want to reduce the error of the resulting value we change the process of finding out the actual result to a process which can reduce the error between the predicted value and the actual value.

Cost Function in Linear Regression

Here, J is the cost function.

The above function is made in this format to calculate the error difference between the predicted values and the plotted values. We take the square of the summation of all the data points and divide it by the total number of data points. This cost function J is also called the Mean Squared Error (MSE) function. Using this MSE function we are going to predict values such that the MSE value settles at the minima, reducing the cost function.

Gradient Descent

Gradient Descent is an optimization algorithm that helps machine learning models to find out paths to a minimum value using repeated steps. Gradient descent is used to minimize a function so that it gives the lowest output of that function. This function is called the Loss Function. The loss function shows us how much error is produced by the machine learning model compared to actual results. Our aim should be to lower the cost function as much as possible. One way of achieving a low cost function is by the process of gradient descent. Complexity of some equations makes it difficult to use, partial derivative of the cost function with respect to the considered parameter can provide optimal coefficient value. You may refer to the article on Gradient Descent for Machine Learning.

Simple Linear Regression

Optimization is a big part of machine learning and almost every machine learning algorithm has an optimization technique at its core for increased efficiency. Gradient Descent is such an optimization algorithm used to find values of coefficients of a function that minimizes the cost function. Gradient Descent is best applied when the solution cannot be obtained by analytical methods (linear algebra) and must be obtained by an optimization technique.

Residual Analysis: Simple linear regression models the relationship between the magnitude of one variable and that of a second—for example, as x increases, y also increases. Or as x increases, y decreases. Correlation is another way to measure how two variables are related. The models done by simple linear regression estimate or try to predict the actual result but most often they deviate from the actual result. Residual analysis is used to calculate by how much the estimated value has deviated from the actual result.

Null Hypothesis and p-value: During feature selection, null hypothesis is used to find which attributes will not affect the result of the model. Hypothesis tests are used to test the validity of a claim that is made about a particular attribute of the model. This claim that’s on trial, in essence, is called the null hypothesis. A p-value helps to determine the significance of the results. p-value is a number between 0 and 1 and is interpreted in the following way:

  • A small p-value (less than 0.05) indicates a strong evidence against the null hypothesis, so the null hypothesis is to be rejected.
  • A large p-value (greater than 0.05) indicates weak evidence against the null hypothesis, so the null hypothesis is to be considered.
  • p-value very close to the cut-off (equal to 0.05) is considered to be marginal (could go either way). In this case, the p-value should be provided to the readers so that they can draw their own conclusions.

Ordinary Least Square

Ordinary Least Squares (OLS), also known as Ordinary least squares regression or least squared errors regression is a type of linear least squares method for estimating the unknown parameters in a linear regression model. OLS chooses the parameters for a linear function, the goal of which is to minimize the sum of the squares of the difference of the observed variables and the dependent variables i.e. it tries to attain a relationship between them. 

There are two types of relationships that may occur: linear and curvilinear. A linear relationship is a straight line that is drawn through the central tendency of the points; whereas a curvilinear relationship is a curved line. Association between the variables are depicted by using a scatter plot. The relationship could be positive or negative, and result variation also differs in strength.

The advantage of using Ordinary Least Squares regression is that it can be easily interpreted and is highly compatible with recent computers’ built-in algorithms from linear algebra. It can be used to apply to problems with lots of independent variables which can efficiently conveyed to thousands of data points. In Linear Regression, OLS is used to estimate the unknown parameters by creating a model which will minimize the sum of the squared errors between the observed data and the predicted one.

Let us simulate some data and look at how the predicted values (Yₑ) differ from the actual value (Y):

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

# Generate 'random' data
np.random.seed(0)
X = 2.5 * np.random.randn(100) + 1.5         # Array of 100 values with mean = 1.5, stddev = 2.5
res = 0.5 * np.random.randn(100)         # Generate 100 residual terms
y = 2 + 0.3 * X + res                    # Actual values of Y

# Create pandas dataframe to store our X and y values
df = pd.DataFrame(
    {'X': X,
      'y': y}
)

# Show the first five rows of our dataframe
df.head()

XY
05.9101314.714615
12.5003932.076238
23.9468452.548811
37.1022334.615368
46.1688953.264107

To estimate y using the OLS method, we need to calculate xmean and ymean, the covariance of X and y (xycov), and the variance of X (xvar) before we can determine the values for alpha and beta.

# Calculate the mean of X and y
xmean = np.mean(X)
ymean = np.mean(y)

# Calculate the terms needed for the numator and denominator of beta
df['xycov'] = (df['X'] - xmean) * (df['y'] - ymean)
df['xvar'] = (df['X'] - xmean)**2

# Calculate beta and alpha
beta = df['xycov'].sum() / df['xvar'].sum()
alpha = ymean - (beta * xmean)
print(f'alpha = {alpha}')
print(f'beta = {beta}')
alpha = 2.0031670124623426
beta = 0.3229396867092763

Now that we have an estimate for alpha and beta, we can write our model as Yₑ = 2.003 + 0.323 X, and make predictions:

ypred = alpha + beta * X

Let’s plot our prediction ypred against the actual values of y, to get a better visual understanding of our model.

# Plot regression against actual data
plt.figure(figsize=(12, 6))
plt.plot(X, ypred) # regression line
plt.plot(X, y, 'ro')   # scatter plot showing actual data
plt.title('Actual vs Predicted')
plt.xlabel('X')
plt.ylabel('y')

plt.show()

The blue line in the above graph is our line of best fit

The blue line in the above graph is our line of best fit, Yₑ = 2.003 + 0.323 X.  If you observe the graph carefully, you will notice that there is a linear relationship between X and Y. Using this model, we can predict Y from any values of X. For example, for X = 8,

Yₑ = 2.003 + 0.323 (8) = 4.587

Regularization

Regularization is a type of regression that is used to decrease the coefficient estimates down to zero. This helps to eliminate the data points that don’t actually represent the true properties of the model, but have appeared by random chance. The process is done by identifying the points which have deviated from the line of best-fit by a large extent. Earlier we saw that to estimate the regression coefficients β in the least squares method, we must minimize the term Residual Sum of Squares (RSS). Let the RSS equation in this case be:

Regularization in Linear Regression

The general linear regression model can be expressed using a condensed formula:

expressed using a condensed formula

Here, β=[β01, ….. βp]

The RSS value will adjust the coefficient, β based on the training data. If the resulting data deviates too much from the training data, then the estimated coefficients won’t generalize well to the future data. This is where regularization comes in and shrinks or regularizes these learned estimates towards zero.

Ridge regression

Ridge regression is very similar to least squares, except that the Ridge coefficients are estimated by minimizing a different quantity. In particular, the Ridge regression coefficients β are the values that minimize the following quantity:

Ridge regression in Linear Regression

Here, λ is the tuning parameter that decides how much we want to penalize the flexibility of the model. λ controls the relative impact of the two components: RSS and the penalty term. If λ = 0, the Ridge regression will produce a result similar to least squares method. If λ → ∞, all estimated coefficients tend to zero. Ridge regression produces different estimates for different values of λ. The optimal choice of λ is crucial and should be done with cross-validation. The coefficient estimates produced by ridge regression method is also known as the L2 norm.

The coefficients generated by Ordinary Least Squares method is independent of scale, which means that if each input variable is multiplied by a constant, the corresponding coefficient will be divided by the same constant, as a result of which the multiplication of the coefficient and the input variables will remain the same. The same is not true for ridge regression and we need to bring the coefficients to the same scale before we perform the process. To standardize the variables, we must subtract their means and divide it by their standard deviations.

Lasso Regression

Least Absolute Shrinkage and Selection Operator (LASSO) regression also shrinks the coefficients by adding a penalty to the sum of squares of the residuals, but the lasso penalty has a slightly different effect. The lasso penalty is the sum of the absolute values of the coefficient vector, which corresponds to its L1 norm. Hence, the lasso estimate is defined by:

Lasso Regression in Linear Regression

Similar to ridge regression, the input variables need to be standardized. The lasso penalty makes the solution nonlinear, and there is no closed-form expression for the coefficients as in ridge regression. Instead, the lasso solution is a quadratic programming problem and there are available efficient algorithms that compute the entire path of coefficients that result for different values of λ with the same computational cost as for ridge regression.

The lasso penalty had the effect of gradually reducing some coefficients to zero as the regularization increases. For this reason, the lasso can be used for the continuous selection of a subset of features.

Linear Regression with multiple variables

Linear regression with multiple variables is also known as "multivariate linear regression". We now introduce notation for equations where we can have any number of input variables.

x(i)j=value of feature j in the ith training example

x(i)=the input (features) of the ith training example

m=the number of training examples

n=the number of features

The multivariable form of the hypothesis function accommodating these multiple features is as follows:

hθ(x)=θ01x12x23x3+⋯+θnxn

In order to develop intuition about this function, we can think about θ0 as the basic price of a house, θ1 as the price per square meter, θ2 as the price per floor, etc. x1 will be the number of square meters in the house, x2 the number of floors, etc.

Using the definition of matrix multiplication, our multivariable hypothesis function can be concisely represented as:

Linear Regression with multiple variables

This is a vectorization of our hypothesis function for one training example; see the lessons on vectorization to learn more.

Remark: Note that for convenience reasons in this course we assume x0 (i) =1 for (i∈1,…,m). This allows us to do matrix operations with θ and x. Hence making the two vectors ‘θ’and x(i) match each other element-wise (that is, have the same number of elements: n+1).

Multiple Linear Regression

How is it different?

In simple linear regression we use a single independent variable to predict the value of a dependent variable whereas in multiple linear regression two or more independent variables are used to predict the value of a dependent variable. The difference between the two is the number of independent variables. In both cases there is only a single dependent variable.

Multicollinearity

Multicollinearity tells us the strength of the relationship between independent variables. Multicollinearity is a state of very high intercorrelations or inter-associations among the independent variables. It is therefore a type of disturbance in the data, and if present in the data the statistical inferences made about the data may not be reliable. VIF (Variance Inflation Factor) is used to identify the Multicollinearity. If VIF value is greater than 4, we exclude that variable from our model.

There are certain reasons why multicollinearity occurs:

  • It is caused by an inaccurate use of dummy variables.
  • It is caused by the inclusion of a variable which is computed from other variables in the data set.
  • Multicollinearity can also result from the repetition of the same kind of variable.
  • Generally occurs when the variables are highly correlated to each other.

Multicollinearity can result in several problems. These problems are as follows:

  • The partial regression coefficient due to multicollinearity may not be estimated precisely. The standard errors are likely to be high.
  • Multicollinearity results in a change in the signs as well as in the magnitudes of the partial regression coefficients from one sample to another sample.
  • Multicollinearity makes it tedious to assess the relative importance of the independent variables in explaining the variation caused by the dependent variable.

Iterative Models

Models should be tested and upgraded again and again for better performance. Multiple iterations allows the model to learn from its previous result and take that into consideration while performing the task again.

Making predictions with Linear Regression

Linear Regression can be used to predict the value of an unknown variable using a known variable by the help of a straight line (also called the regression line). The prediction can only be made if it is found that there is a significant correlation between the known and the unknown variable through both a correlation coefficient and a scatterplot.

The general procedure for using regression to make good predictions is the following:

  • Research the subject-area so that the model can be built based on the results produced by similar models. This research helps with the subsequent steps.
  • Collect data for appropriate variables which have some correlation with the model.
  • Specify and assess the regression model.
  • Run repeated tests so that the model has more data to work with.

To test if the model is good enough observe whether:

  • The scatter plot forms a linear pattern.
  • The correlation coefficient r, has a value above 0.5 or below -0.5. A positive value indicates a positive relationship and a negative value represents a negative relationship.

If the correlation coefficient shows a strong relationship between variables but the scatter plot is not linear, the results can be misleading. Examples on how to use linear regression have been shown earlier.

Data preparation for Linear Regression

Step 1: Linear Assumption

The first step for data preparation is checking for the variables which have some sort of linear correlation between the dependent and the independent variables.

Step 2: Remove Noise

It is the process of reducing the number of attributes in the dataset by eliminating the features which have very little to no requirement for the construction of the model.

Step 3: Remove Collinearity

Collinearity tells us the strength of the relationship between independent variables. If two or more variables are highly collinear, it would not make sense to keep both the variables while evaluating the model and hence we can keep one of them.

Step 4: Gaussian Distributions

The linear regression model will produce more reliable results if the input and output variables have a Gaussian distribution. The Gaussian theorem states that  states that a sample mean from an infinite population is approximately normal, or Gaussian, with mean the same as the underlying population, and variance equal to the population variance divided by the sample size. The approximation improves as the sample size gets large.

Step 5: Rescale Inputs

Linear regression model will produce more reliable predictions if the input variables are rescaled using standardization or normalization.

Linear Regression with statsmodels

We have already discussed OLS method, now we will move on and see how to use the OLS method in the statsmodels library. For this we will be using the popular advertising dataset. Here, we will only be looking at the TV variable and explore whether spending on TV advertising can predict the number of sales for the product. Let’s start by importing this csv file as a pandas dataframe using read_csv():

# Import and display first five rows of advertising dataset
advert = pd.read_csv('advertising.csv')
advert.head()

TVRadioNewspaperSales
0230.137.869.222.1
144.539.345.110.4
217.245.969.312.0
3151.541.358.516.5
4180.810.858.417.9

Now we will use statsmodels’ OLS function to initialize simple linear regression model. It will take the formula y ~ X, where X is the predictor variable (TV advertising costs) and y is the output variable (Sales). Then, we will fit the model by calling the OLS object’s fit() method.

import statsmodels.formula.api as smf

# Initialise and fit linear regression model using `statsmodels`
model = smf.ols('Sales ~ TV', data=advert)
model = model.fit()

Once we have fit the simple regression model, we can predict the values of sales based on the equation we just derived using the .predict method and also visualise our regression model by plotting sales_pred against the TV advertising costs to find the line of best fit.

# Predict values
sales_pred = model.predict()

# Plot regression against actual data
plt.figure(figsize=(12, 6))
plt.plot(advert['TV'], advert['Sales'], 'o')       # scatter plot showing actual data
plt.plot(advert['TV'], sales_pred, 'r', linewidth=2)   # regression line
plt.xlabel('TV Advertising Costs')
plt.ylabel('Sales')
plt.title('TV vs Sales')

plt.show()

Linear Regression with statsmodels

In the above graph, if you notice you will see that there is a positive linear relationship between TV advertising costs and Sales. You may also summarize by saying that spending more on TV advertising predicts a higher number of sales.

Linear Regression with scikit-learn

Let us learn to implement linear regression models using sklearn. For this model as well, we will continue to use the advertising dataset but now we will use two predictor variables to create a multiple linear regression model. 

Yₑ = α + β₁X₁ + β₂X₂ + … + βₚXₚ, where p is the number of predictors.

In our example, we will be predicting Sales using the variables TV and Radio i.e. our model can be written as:

Sales = α + β₁*TV + β₂*Radio

from sklearn.linear_model import LinearRegression

# Build linear regression model using TV and Radio as predictors
# Split data into predictors X and output Y
predictors = ['TV', 'Radio']
X = advert[predictors]
y = advert['Sales']

# Initialise and fit model
lm = LinearRegression()
model = lm.fit(X, y)
print(f'alpha = {model.intercept_}')
print(f'betas = {model.coef_}')
alpha = 4.630879464097768
betas = [0.05444896 0.10717457]
model.predict(X)

Linear Regression with scikit-learn

Now that we have fit a multiple linear regression model to our data, we can predict sales from any combination of TV and Radio advertising costs. For example, you want to know how many sales we would make if we invested $600 in TV advertising and $300 in Radio advertising. You can simply find it out by:

new_X = [[600, 300]]
print(model.predict(new_X))
[69.4526273]

We get the output as 69.45 which means if we invest $600 on TV and $300 on Radio advertising, we can expect to sell 69 units approximately.

Summary

Let us sum up what we have covered in this article so far —

  • How to understand a regression problem
  • What is linear regression and how it works
  • Ordinary Least Square method and Regularization
  • Implementing Linear Regression in Python using statsmodel and sklearn library

We have discussed about a couple of ways to implement linear regression and build efficient models for certain business problems. If you are inspired by the opportunities provided by machine learning, enrol in our  Data Science and Machine Learning Courses for more lucrative career options in this landscape.

Priyankur

Priyankur Sarkar

Data Science Enthusiast

Priyankur Sarkar loves to play with data and get insightful results out of it, then turn those data insights and results in business growth. He is an electronics engineer with a versatile experience as an individual contributor and leading teams, and has actively worked towards building Machine Learning capabilities for organizations.

Join the Discussion

Your email address will not be published. Required fields are marked *

Suggested Blogs

Essential Skills to Become a Data Scientist

The demand for Data Science professionals is now at an all-time high. There are companies in virtually every industry looking to extract the most value from the heaps of information generated on a daily basis.With the trend for Data Science catching up like never before, organizations are making complete use of their internal data assets to further examine the integration of hundreds of third-party data sources. What is crucial here is the role of the data scientists.Not very long back, the teams playing the key role of working on the data always found their places in the back rooms of multifold IT organizations. The teams though sitting on the backseat would help in steering the various corporate systems with the required data that acted as the fuel to keep the activities running. The critical database tasks performed by the teams responsible allowed corporate executives to report on operations activities and deliver financial results.When you take up a career in Data Science, your previous experience or skills do not matter. As a matter of fact, you would need a whole new range of skills to pursue a career in Data Science. Below are the skills required to become a top dog in Data Science.What should Data Scientists knowData scientists are expected to have knowledge and expertise in the following domains:The areas arch over dozens of languages, frameworks, and technologies that data scientists need to learn. Data scientists should always have the curiosity to amass more knowledge in their domain so that they stay relevant in this dynamic field.The world of Data Science demands certain important attributes and skills, according to IT leaders, industry analysts, data scientists, and others.How to become a Data Scientist?A majority of Data scientists already have a Master’s degree. If Master’s degree does not quench their thirst for more degrees, some even go on to acquire PhD degrees. Mind you, there are exceptions too. It isn’t mandatory that you should be an expert in a particular subject to become a Data Scientist. You could become one even with a qualification in Computer Science, Physical Sciences, Natural Sciences, Statistics or even Social Sciences. However, a degree in Mathematics and Statistics is always an added benefit for enhanced understanding of the concepts.Qualifying with a degree is not the end of the requirements. Brush up your skills by taking online lessons in a special skill set of your choice — get certified on how to use Hadoop, Big Data or R. You can also choose to enroll yourself for a Postgraduate degree in the field of Data Science, Mathematics or any other related field.Remember, learning does not end with earning a degree or certification. You need to practice what you learned — blog and share your knowledge, build an app and explore other avenues and applications of data.The Data Scientists of the modern world have a major role to play in businesses across the globe. They have the ability to extract useful insights from vast amounts of raw data using sophisticated techniques. The business acumen of the Data Scientists help a big deal in predicting what lies ahead for enterprises. The models that the Data Scientists create also bring out measures to mitigate potential threats if any.Take up organizational challenges with ABCDE skillsetAs a Data Scientist, you may have to face challenges while working on projects and finding solutions to problems.A = AnalyticsIf you are a Data Scientist, you are expected not just to study the data and identify the right tools and techniques; you need to have your answers ready to all the questions that come across while you are strategizing on working on a solution with or without a business model.B = Business AcumenOrganizations vouch for candidates with strong business acumen. As a Data Scientist, you are expected to showcase your skills in a way that will make the organization stand one step ahead of the competition. Undertaking a project and working on it is not the end of the path scaled by you. You need to understand and be able to make others understand how your business models influence business outcomes and how the outcomes will prove beneficial to the organization.C = CodingAnd a Data Scientist is expected to be adept at coding too. You may encounter technical issues where you need to sit and work on codes. If you know how to code, it will make you further versatile in confidently assisting your team.D = DomainThe world does not expect Data Scientists to be perfect with knowledge of all domains. However, it is always assumed that a Data Scientist has know-how of various industrial operations. Reading helps as a plus point. You can gain knowledge in various domains by reading the resources online.E = ExplainTo be a successful Data Scientist, you should be able to explain the problem you are faced with to figure out a solution to the problem and share it with the relevant stakeholders. You need to create a difference in the way you explain without leaving any communication gaps.The Important Skills for a Data ScientistLet us now understand the important skills to become an expert Data Scientist – all the skills that go in, to become one. The skills are as follows:Critical thinkingCodingMathML, DL, AICommunication1. Critical thinkingData scientists need to keep their brains racing with critical thinking. They should be able to apply the objective analysis of facts when faced with a complex problem. Upon reaching a logical analysis, a data scientist should formulate opinions or render judgments.Data scientists are counted upon for their understanding of complex business problems and the risks involved with decision-making. Before they plunge into the process of analysis and decision-making, data scientists are required to come up with a 'model' or 'abstract' on what is critical to coming up with the solution to a problem. Data scientists should be able to determine the factors that are extraneous and can be ignored while churning out a solution to a complex business problem.According to Jeffry Nimeroff, CIO at Zeta Global, which provides a cloud-based marketing platform – A data scientist needs to have experience but also have the ability to suspend belief...Before arriving at a solution, it is very important for a Data Scientist to be very clear on what is being expected and if the expected solution can be arrived at. It is only with experience that your intuition works stronger. Experience brings in benefits.If you are a novice and a problem is posed in front of you; all that the one who put the problem in front of you would get is a wide-eyed expression, perhaps. Instead, if you have hands-on experience of working with complex problems no matter what, you will step back, look behind at your experience, draw some inference from multiple points of view and try assessing the problem that is put forth.In simple steps, critical thinking involves the following steps:a. Describe the problem posed in front of you.b. Analyse the arguments involved – The IFs and BUTs.c. Evaluate the significance of the decisions being made and the successes or failures thereafter.2. CodingHandling a complex task might at times call for the execution of a chain of programming tasks. So, if you are a data scientist, you should know how to go about writing code. It does not stop at just writing the code; the code should be executable and should be crucial in helping you find a solution to a complex business problem.In the present scenario, Data Scientists are more inclined towards learning and becoming an expert with Python as the language of choice. There is a substantial crowd following R as well. Scala, Clojure, Java and Octave are a few other languages that find prominence too.Consider the following aspects to be a successful Data Scientist that can dab with programming skills –a) You need to deal with humongous volumes of data.b) Working with real-time data should be like a cakewalk for you.c) You need to hop around cloud computing and work your way with statistical models like the ones shown below:Different Statistical ModelsRegressionOptimizationClusteringDecision treesRandom forestsData scientists are expected to understand and have the ability to code in a bundle of languages – Python, C++ or Java.Gaining the knack to code helps Data Scientists; however, this is not the end requirement. A Data Scientist can always be surrounded by people who code.3. MathIf you have never liked Mathematics as a subject or are not proficient in Mathematics, Data Science is probably not the right career choice for you.You might own an organization or you might even be representing it; the fact is while you engage with your clients, you might have to look into many disparate issues. To deal with the issues that lay in front of you, you will be required to develop complex financial or operational models. To finally be able to build a worthy model, you will end up pulling chunks from large volumes of data. This is where Mathematics helps you.If you have the expertise in Mathematics, building statistical models is easier. Statistical models further help in developing or switching over to key business strategies. With skills in both Mathematics and Statistics, you can get moving in the world of Data Science. Spell the mantra of Mathematics and Statistics onto your lamp of Data Science, lo and behold you can be the genie giving way to the best solutions to the most complex problems.4. Machine learning, Deep Learning, AIData Science overlaps with the fields of Machine Learning, Deep Learning and AI.There is an increase in the way we work with computers, we now have enhanced connectivity; a large amount of data is being collected and industries make use of this data and are moving extremely fast.AI and deep learning may not show up in the requirements of job postings; yet, if you have AI and deep learning skills, you end up eating the big pie.A data scientist needs to be hawk-eyed and alert to the changes in the curve while research is in progress to come up with the best methodology to a problem. Coming up with a model might not be the end. A Data Scientist must be clear as to when to apply which practice to solve a problem without making it more complex.Data scientists need to understand the depth of problems before finding solutions. A data scientist need not go elsewhere to study the problems; all that is there in the data fetched is what is needed to bring out the best solution.A data scientist should be aware of the computational costs involved in building an environment and the following system boundary conditions:a. Interpretabilityb. Latencyc. BandwidthStudying a customer can act as a major plus point for both a data scientist and an orgaStudying nization… This helps in understanding what technology to apply.No matter how generations advance with the use of automated tools and open source is readily available, statistical skills are considered the much-needed add-ons for a data scientist.Understanding statistics is not an easy job; a data scientist needs to be competent to comprehend the assumptions made by the various tools and software.Experts have put forth a few important requisites for data scientists to make the best use of their models:Data scientists need to be handy with proper data interpretation techniques and ought to understand –a. the various functional interfaces to the machine learning algorithmsb. the statistics within the methodsIf you are a data scientist, try dabbing your profile with colours of computer science skills. You must be proficient in working with the keyboard and have a sound knowledge of fundamentals in software engineering.5. CommunicationCommunication and technology show a cycle of operations wherein, there is an integration between people, applications, systems, and data. Data science does not stand separate in this. Working with Data Science is no different. As a Data Scientist, you should be able to communicate with various stakeholders. Data plays a key attribute in the wheel of communication.Communication in Data Science ropes in the ‘storytelling’ ability. This helps you translate a solution you have arrived at into action or intervention that you have put in the pipeline. As a Data Scientist, you should be adept at knitting with the data you have extracted and communicated it clearly to your stakeholders.What does a data scientist communicate to the stakeholders?The benefits of dataThe technology and the computational costs involved in the process of extracting and making use of the dataThe challenges posed in the form of data quality, privacy, and confidentialityA Data Scientist also needs to keep an eye on the wide horizons for better prospects. The organization can be shown a map highlighting other areas of interest that can prove beneficial.If you are a Data Scientist with different feathers in your cap, one being that of a good communicator, you should be able to change a complex form of technical information to a simple and compact form before you present it to the various stakeholders. The information should highlight the challenges, the details of the data, the criteria for success and the anticipated results.If you want to excel in the field of Data Science, you must have an inquisitive bent of mind. The more you ask questions, the more information you gather, the easier it is to come up with paramount business models.6. Data architectureLet us draw some inference from the construction of a building and the role of an architect. Architects have the most knowledge of how the different blocks of buildings can go together and how the different pillars for a block make a strong support system. Like how architects manage and coordinate the entire construction process, so do the Data Scientists while building business models.A Data Scientist needs to understand all that happens to the data from the inception level to when it becomes a model and further until a decision is made based on the model.Not understanding the data architecture can have a tremendous impact on the assumptions made in the process and the decisions arrived at. If a Data Scientist is not familiar with the data architecture, it may lead to the organization taking wrong decisions leading to unexpected and unfavourable results.A slight change within the architecture might lead to situations getting worse for all the involved stakeholders.7. Risk analysis, process improvement, systems engineeringA Data Scientist with sharp business acumen should have the ability to analyse business risks, suggest improvements if any and facilitate further changes in various business processes. As a Data Scientist, you should understand how systems engineering works.If you want to be a Data Scientist and have sharp risk analysis, process improvement and systems engineering skills, you can set yourself for a smooth sail in this vast sea of Data Science.And, rememberYou will no more be a Data Scientist if you stop following scientific theories… After all, Data Science in itself is a major breakthrough in the field of Science.It is always recommended to analyse all the risks that may confront a business before embarking on a journey of model development. This helps in mitigating risks that an organization may have to encounter later. For a smooth business flow, a Data Scientist should also have the nature to probe into the strategies of the various stakeholders and the problems encountered by customers.A Data Scientist should be able to get the picture of the prevailing risks or the various systems that can have a whopping impact on the data or if a model can lead to positive fruition in the form of customer satisfaction.8. Problem-solving and strong business acumenData scientists are not very different when compared to the commoners. We can say this on the lines of problem-solving. The problem solving traits are inherent in every human being. What makes a data scientist stand apart is very good problem-solving skills. We come across complex problems even in everyday situations. How we differ in solving problems is in the perspectives that we apply. Understanding and analyzing before moving on to actually solving the problems by pulling out all the tools in practice is what Data Scientists are good at.The approach that a Data Scientist takes to solve a problem reaps more success than failure. With their approach, they bring critical thinking to the forefront.  Finding a Data Scientist with skill sets at variance is a problem faced by most of the employers.Technical Skills for a Data ScientistWhen the employers are on a hunt to trap the best, they look out for specialization in languages, libraries, and expertise in tech tools. If a candidate comes in with experience, it helps in boosting the profile.Let us see some very important technical skills:PythonRSQLHadoop/Apache SparkJava/SASTableauLet us briefly understand how these languages are in demand.PythonPython is one of the most in-demand languages. This has gained immense popularity as an open-source language. It is widely used both by beginners and experts. Data Scientists need to have Python as one of the primary languages in their kit.RR is altogether a new programming language for statisticians. Anyone with a mathematical bent of mind can learn it. Nevertheless, if you do not appreciate the nuances of Mathematics then it’s difficult to understand R. This never means that you cannot learn it, but without having that mathematical creativity, you cannot harness the power of it.SQLStructured Query Language or SQL is also highly in demand. The language helps in interacting with relational databases. Though it is not of much prominence yet, with a know-how in SQL you can gain a stand in the job market.Hadoop & SparkBoth Hadoop and Spark are open source tools from Apache for big data.Apache Hadoop is an open source software platform. Apache Hadoop helps when you have large data sets on computer clusters built from commodity hardware and you find it difficult to store and process the data sets.Apache Spark is a lightning-fast cluster computing and data processing engine designed for fast computation. It comes with a bunch of development APIs. It supports data workers with efficient execution of streaming, machine learning or SQL workloads.Java & SASWe also have Java and SAS joining the league of languages. These are in-demand languages by large players. Employers offer whopping packages to candidates with expertise in Java and SAS.TableauTableau joins the list as an analytics platform and visualization tool. The tool is powerful and user-friendly. The public version of the tool is available for free. If you wish to keep your data private, you have to consider the costs involved too.Easy tips for a Data ScientistLet us see the in-demand skill set for a Data Scientist in brief.a. A Data Scientist should have the acumen to handle data processing and go about setting models that will help various business processes.b. A Data Scientist should understand the depth of a business problem and the structure of the data that will be used in the process of solving it.c. A Data Scientist should always be ready with an explanation on how the created business models work; even the minute details count.A majority of the crowd out there is good at Maths, Statistics, Engineering or other related subjects. However, when interviewed, they may not show the required traits and when recruited may fail to shine in their performance levels. Sometimes the recruitment process to hire a Data Scientist gets so tedious that employers end up searching with lanterns even in broad daylight. Further, the graphical representation below shows some smart tips for smart Data Scientists.Smart tips for a Data ScientistWhat employers seek the most from Data Scientists?Let us now throw some light into what employers seek the most from Data Scientists:a. A strong sense of analysisb. Machine learning is at the core of what is sought from Data Scientists.c. A Data Scientist should infer and refer to data that has been in practice and will be in practice.d. Data Scientists are expected to be adept at Machine Learning and create models predicting the performance on the basis of demand.e. And, a big NOD to a combo skill set of statistics, Computer Science and Mathematics.Following screenshot shows the requirements of a topnotch employer from a Data Scientist. The requirements were posted on a jobs’ listing website.Let us do a sneak peek into the same job-listing website and see the skills in demand for a Data Scientist.ExampleRecommendations for a Data ScientistWhat are some general recommendations for Data Scientists in the present scenario? Let us walk you through a few.Exhibit your demonstration skills with data analysis and aim to become learned at Machine Learning.Focus on your communication skills. You would have a tough time in your career if you cannot show what you have and cannot communicate what you know. Experts have recommended reading Made to Stick for far-reaching impact of the ideas that you generate.Gain proficiency in deep learning. You must be familiar with the usage, interest, and popularity of deep learning framework.If you are wearing the hat of a Python expert, you must also have the know-how of common python data science libraries – numpy, pandas, matplotlib, and scikit-learn.ConclusionData Science is all about contributing more data to the technologically advanced world. Make your online presence a worthy one; learn while you earn.Start by browsing through online portals. If you are a professional, make your mark on LinkedIn. Securing a job through LinkedIn is now easier than scouring through job sites.Demonstrate all the skills that you are good at on the social portals you are associated with. Suppose you write an article on LinkedIn, do not refrain from sharing the link to the article on your Facebook account.Most important of all – when faced with a complex situation, understand why and what led to the problem. A deeper understanding of a problem will help you come up with the best model. The more you empathize with a situation, the more will be your success count. And in no time, you can become that extraordinary whiz in Data Science.Wishing you immense success if you happen to choose or have already chosen Data Science as the path for your career.All the best for your career endeavour!
Rated 4.5/5 based on 1 customer reviews
9136
Essential Skills to Become a Data Scientist

The demand for Data Science professionals is now a... Read More

Boosting and AdaBoost in Machine Learning

Ensemble learning is a strategy in which a group of models are used to find a solution to a challenging problem, by using a strategy and combining diverse machine learning models into one single predictive model.In general, ensemble methods are mainly used for improving the overall performance accuracy of a model and combine several different models, also known as the base learners, to predict the results, instead of using a single model.In one of the articles related to ensemble learning, we have already discussed about the popular ensemble method, Bootstrap Aggregation. Bagging tries to implement similar learners on small sample populations and then takes a mean of all the predictions. It combines Bootstrapping and Aggregation to form one ensemble model. It basically reduces the variance error and helps to avoid overfitting. In this article we will look into the limitations of bagging and how a boosting algorithm can be used to overcome those limitations. We will also learn about various types of boosting algorithms and implement one of them in Python. Let’s get started.What are the limitations of Bagging?Let us recall the concept of bagging and consider a binary classification problem. We are either classifying an observation as 0 or as 1.In bagging, T bootstrap samples are selected, a classifier is fitted on each of these samples, and the models are trained in parallel. In a Random Forest, decision trees are trained in parallel. Then the results of all classifiers are averaged into a bagging classifier:Formula for a Bagging ClassifierLet us consider 3 classifiers and the result for the classification can either be right or wrong. If we plot the results of the 3 classifiers, there are regions in which the classifiers will be wrong. These regions are represented in red in the figure below.Example case in which Bagging works wellThe above example works pretty well as when one classifier is wrong, the two others are correct. By voting classifier, you can achieve a better accuracy. However, there are cases where Bagging does not work properly, when all classifiers are mistaken to be in the same region.Due to this reason, the intuition behind the discovery of Boosting was the following :instead of training parallel models, one should train models sequentiallyeach model should focus on where the performance of the previous classifier was poorWith this intuition, Boosting algorithm was introduced. Let us understand what Boosting is all about.What is Boosting?Boosting is an ensemble modeling technique which attempts to build a strong classifier from the number of weak classifiers. It is done by building a model using weak models in series. First, a model is built from the training data. Then the second model is built which tries to correct the errors present in the first model. This procedure is continued and models are added until either the complete training data set is predicted correctly or the maximum number of models are added.Boosting being a sequential process, each subsequent model attempts to correct the errors of the previous model. It is focused on reducing the bias unlike bagging. It makes the boosting algorithms prone to overfitting. To avoid overfitting, parameter tuning plays an important role in boosting algorithms, which will be discussed in the later part of this article. Some examples of boosting are XGBoost, GBM, ADABOOST etc..How can boosting identify weak learners?To find weak learners, we apply base learning (ML) algorithms with a different distribution. As each time base learning algorithm is applied, it generates a new weak prediction rule. This is an iterative process. After many iterations, the boosting algorithm combines these weak rules into a single strong prediction rule.How do we choose a different distribution for each round?Step 1: The base learner takes all the distributions and assigns equal weight or attention to each observation.Step 2: If there is any prediction error caused by first base learning algorithm, then we pay higher attention to observations having prediction error. Then, we apply the next base learning algorithm.Step 3: Iterate Step 2 till the limit of base learning algorithm is reached or higher accuracy is achieved.Finally, it combines the outputs from weak learner and creates a strong learner which eventually improves the prediction power of the model. Boosting gives higher focus to examples which are mis-classified or have higher errors by preceding weak rules.How would you classify an email as SPAM or not?Our initial approach would be to identify ‘SPAM’ and ‘NOT SPAM’ emails using the following criteria. If: Email has only one image file (promotional image), It’s a SPAM.Email has only link(s), It’s a SPAM.Email body consists of sentences like “You won a prize money of $ xxxxxx”, It’s a SPAM.Email from our official domain “www.knowledgehut.com” , Not a SPAM.Email from known source, Not a SPAM.Individually, these rules are not powerful enough to classify an email into ‘SPAM’ or ‘NOT SPAM’. Therefore, these rules are called as weak learner.To convert weak learner to strong learner, we’ll combine the prediction of each weak learner using methods like:Using average/ weighted averageConsidering prediction has higher voteExample: Above, we have defined 5 weak learners. Out of these 5, 3 are voted as ‘SPAM’ and 2 are voted as ‘Not a SPAM’. In this case, by default, we’ll consider an email as SPAM because we have higher(3) vote for ‘SPAM’Boosting helps in training a series of low performing algorithms, called weak learners, simply by adjusting the error metric over time. Weak learners are considered to be those algorithms whose error rate is slightly under 50% as illustrated below:Weighted errorsLet us consider data points on a 2D plot. Some of the data points will be well classified, others won’t. The weight attributed to each error when computing the error rate is 1/n where n is the number of data points to classify.Now if we apply some weight to the errors :You might now notice that we give more weight to the data points that are not well classified. An illustration of the weighting process is mentioned below:Example of weighting processIn the end, we want to build a strong classifier that may look like the figure mentioned below:Strong ClassifierTree stumpsThere might be a question in your mind about how many classifiers should one implement in order to ensure it works well. And how is each classifier chosen at each step?Well, Tree stumps defines a 1-level decision tree. At each step, we need to find the best stump, i.e the best data split, which will minimize the overall error. You can see a stump as a test, in which the assumption is that everything that lies on one side belongs to class 1, and everything that lies on the other side belongs to class 0.Many such combinations are possible for a tree stump. Let us look into an example to understand how many combinations we face.3 data points to splitWell there are 12 possible combinations. Let us check how.12 StumpsThere are 12 possible “tests” we could make. The “2” on the side of each separating line simply represents the fact that all points on one side could be points that belong to class 0, or to class 1. Therefore, there are 2 tests embedded in it.At each iteration t, we will choose ht the weak classifier that splits best the data, by reducing the overall error rate the most. Recall that the error rate is a modified error rate version that takes into account what has been introduced before.Finding the best splitThe best split is found by identifying at each iteration t, the best weak classifier ht, generally a decision tree with 1 node and 2 leaves (a stump). Let us consider an example of credit defaulter, i.e whether a person who borrowed money will return or not.Identifying the best splitIn this case, the best split at time t is to stump on the Payment history, since the weighted error resulting from this split is minimum.Simply note that decision tree classifiers like these ones can in practice be deeper than a simple stump. This will be considered as a hyper-parameter.Combining classifiersIn the next step we combine the classifiers into a Sign classifier, and depending on which side of the frontier a point will stand, it is classified as 0 or 1. It can be achieved by:Combining classifiersYou can improve the classifier by adding weights on each classifier, to avoid giving the same importance to the different classifiers.AdaBoostPseudo-codePseudo-codeThe key elements to keep in mind are:Z is a constant whose role is to normalize the weights so that they add up to 1αt is a weight that we apply to each classifierThis algorithm is called AdaBoost or Adaptive Boosting. This is one of the most important algorithms among all boosting methods.ComputationBoosting algorithms are generally fast to train, although we consider every stump possible and compute exponentials recursively.Well, if we choose αt and Z properly, the weights that are supposed to change at each step simplify to:Weights after choice of α and ZTypes of Boosting AlgorithmsUnderlying engine used for boosting algorithms can be anything.  It can be decision stamp, margin-maximizing classification algorithm etc. There are many boosting algorithms which use other types of engines such as: AdaBoost (Adaptive Boosting)Gradient Tree BoostingXGBoostIn this article, we will focus on AdaBoost and Gradient Boosting followed by their respective Python codes and a little bit about XGBoost.Where are Boosted algorithms required?Boosted algorithms are mainly used when there is plenty of data to make a prediction and high predictive power is expected. It is used to reduce bias and variance in supervised learning. It combines multiple weak predictors to build strong predictor.The underlying engine used for boosting algorithms can be anything. For instance, AdaBoost is a boosting done on Decision stump. There are many other boosting algorithms which use other types of engine such as:GentleBoostGradient BoostingLPBoostBrownBoostAdaptive BoostingAdaptive Boosting, or most commonly known AdaBoost, is a Boosting algorithm. This algorithm uses the method to correct its predecessor. It pays more attention to under fitted training instances by the previous model. Thus, at every new predictor the focus is more on the complicated cases more than the others.It fits a sequence of weak learners on different weighted training data. It starts by predicting the original data set and gives equal weight to each observation. If prediction is incorrect using the first learner, then it gives higher weight to observation which have been predicted incorrectly. Being an iterative process, it continues to add learner(s) until a limit is reached in the number of models or accuracy.Mostly, AdaBoost uses decision stamps. But, we can use any machine learning algorithm as base learner if it accepts weight on training data set. We can use AdaBoost algorithms for both classification and regression problems.Let us consider the example of the image mentioned above. In order to build an AdaBoost classifier, consider that as a first base classifier a Decision Tree algorithm is trained to make predictions on our training data. Applying the following methodology of AdaBoost, the weight of the misclassified training instances is increased. Then the second classifier is trained and the updated weights are acknowledged. It repeats the procedure over and over again.At the end of every model prediction we end up boosting the weights of the misclassified instances so that the next model does a better job on them, and so on.This sequential learning technique might sound similar to Gradient Descent, except that instead of tweaking a single predictor’s parameter to minimize the cost function, AdaBoost adds predictors to the ensemble, gradually making it better.One disadvantage of this algorithm is that the model cannot be parallelized since each predictor can only be trained after the previous one has been trained and evaluated.Below are the steps for performing the AdaBoost algorithm:Initially, all observations are given equal weights.A model is built on a subset of data.Using this model, predictions are made on the whole dataset.Errors are calculated by comparing the predictions and actual values.While creating the next model, higher weights are given to the data points which were predicted incorrectly.Weights can be determined using the error value. For instance,the higher the error the more is the weight assigned to the observation.This process is repeated until the error function does not change, or the maximum limit of the number of estimators is reached.Hyperparametersbase_estimators: specify the base type estimator, i.e. the algorithm to be used as base learner.n_estimators: It defines the number of base estimators, where the default is 10 but you can increase it in order to obtain a better performance.learning_rate: same impact as in gradient descent algorithmmax_depth: Maximum depth of the individual estimatorn_jobs: indicates to the system how many processors it is allowed to use. Value of ‘-1’ means there is no limit;random_state: makes the model’s output replicable. It will always produce the same results when you give it a fixed value as well as the same parameters and training data.Now, let us take a quick look at how to use AdaBoost in Python using a simple example on handwritten digit recognition.import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.ensemble import AdaBoostClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import accuracy_score from sklearn.model_selection import cross_val_score from sklearn.model_selection import cross_val_predict from sklearn.model_selection import train_test_split from sklearn.model_selection import learning_curve from sklearn.datasets import load_digitsLet us load the data :dataset = load_digits() X = dataset['data'] y = dataset['target']X contains arrays of length 64 which are simply flattened 8x8 images. The aim of this dataset is to recognize handwritten digits. Let’s take a look at a given handwritten digit:plt.imshow(X[4].reshape(8,8))If we stick to a Decision Tree Classifier of depth 1 (a stump), here’s how to implement AdaBoost classifier:reg_ada = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1)) scores_ada = cross_val_score(reg_ada, X, y, cv=6) scores_ada.mean()0.2636257855582272And it should head a result of around 26%, which can largely be improved. One of the key parameters is the depth of the sequential decision tree classifiers. How does accuracy improve with depth of the decision trees?score = [] for depth in [1,2,10] : reg_ada = AdaBoostClassifier(DecisionTreeClassifier(max_depth=depth)) scores_ada = cross_val_score(reg_ada, X, y, cv=6) score.append(scores_ada.mean()) score[0.2636257855582272, 0.5902852679072207, 0.9527524912410157]And the maximal score is reached for a depth of 10 in this simple example, with an accuracy of 95.3%.Gradient BoostingThis is another very popular Boosting algorithm which works pretty similar to what we’ve seen for AdaBoost. Gradient Boosting works by sequentially adding the previous predictors underfitted predictions to the ensemble, ensuring the errors made previously are corrected.The difference lies in what it does with the underfitted values of its predecessor. Contrary to AdaBoost, which tweaks the instance weights at every interaction, this method tries to fit the new predictor to the residual errors made by the previous predictor.So that you can understand Gradient Boosting it is important to understand Gradient Descent first.Below are the steps for performing the Gradient Boosting algorithm:A model is built on a subset of data.Using this model, predictions are made on the whole dataset.Errors are calculated by comparing the predictions and actual values.A new model is created using the errors calculated as target variable. Our objective is to find the best split to minimize the error.The predictions made by this new model are combined with the predictions of the previous.New errors are calculated using this predicted value and actual value.This process is repeated until the error function does not change, or the maximum limit of the number of estimators is reached.Hyperparametersn_estimators: It controls the number of weak learners.Learning_rate: Controls the contribution of weak learners in the final combination. There is a trade-off between learning_rate and n_estimators.min_samples_split: Minimum number of observation which is required in a node to be considered for splitting. It is used to control overfitting.min_samples_leaf: Minimum samples required in a terminal or leaf node. Lower values should be chosen for imbalanced class problems since the regions in which the minority class will be in the majority will be very small.min_weight_fraction_leaf: similar to the previous but defines a fraction of the total number of observations instead of an integer.max_depth : maximum depth of a tree. Used to control overfitting.max_lead_nodes : maximum number of terminal leaves in a tree. If this is defined max_depth is ignored.max_features : number of features it should consider while searching for the best split.You can tune loss function for better performance.Implementation in PythonYou can find Gradient Boosting function in Scikit-Learn’s library.# for regression from sklearn.ensemble import GradientBoostingRegressor model = GradientBoostingRegressor(n_estimators=3,learning_rate=1) model.fit(X,Y) # for classification from sklearn.ensemble import GradientBoostingClassifier model = GradientBoostingClassifier() model.fit(X,Y)XGBoostXG Boost or Extreme Gradient Boosting is an advanced implementation of the Gradient Boosting. This algorithm has high predictive power and is ten times faster than any other gradient boosting techniques. Moreover, it includes a variety of regularization which reduces overfitting and improves overall performance.AdvantagesIt implements regularization which helps in reducing overfit (Gradient Boosting does not have);It implements parallel processing which is much faster than Gradient Boosting;Allows users to define custom optimization objectives and evaluation criteria adding a whole new dimension to the model;XGBoost has an in-built routine to handle missing values;XGBoost makes splits up to the max_depth specified and then starts pruning the tree backwards and removes splits beyond which there is no positive gain;XGBoost allows a user to run a cross-validation at each iteration of the boosting process and thus it is easy to get the exact optimum number of boosting iterations in a single run.Boosting algorithms represent a different machine learning perspective which is turning a weak model to a stronger one to fix its weaknesses. I hope this article helped you understand how boosting works.We have covered most of the topics related to algorithms in our series of machine learning blogs, click here. If you are inspired by the opportunities provided by machine learning, enroll in our  Data Science and Machine Learning Courses for more lucrative career options in this landscape.
Rated 4.5/5 based on 12 customer reviews
7982
Boosting and AdaBoost in Machine Learning

Ensemble learning is a strategy in which a group o... Read More

Bagging and Random Forest in Machine Learning

In today’s world, innovations happen on a daily basis, rendering all the previous versions of that product, service or skill-set outdated and obsolete. In such a dynamic and chaotic space, how can we make an informed decision without getting carried away by plain hype? To make the right decisions, we must follow a set of processes; investigate the current scenario, chart down your expectations, collect reviews from others, explore your options, select the best solution after weighing the pros and cons, make a decision and take the requisite action. For example, if you are looking to purchase a computer, will you simply walk up to the store and pick any laptop or notebook? It’s highly unlikely that you would do so. You would probably search on Amazon, browse a few web portals where people have posted their reviews and compare different models, checking for their features, specifications and prices. You will also probably ask your friends and colleagues for their opinion. In short, you would not directly jump to a conclusion, but will instead make a decision considering the opinions and reviews of other people as well. Ensemble models in machine learning also operate on a similar manner. They combine the decisions from multiple models to improve the overall performance. The objective of this article is to introduce the concept of ensemble learning and understand algorithms like bagging and random forest which use a similar technique. What is Ensemble Learning? Ensemble methods aim at improving the predictive performance of a given statistical learning or model fitting technique. The general principle of ensemble methods is to construct a linear combination of some model fitting method, instead of using a single fit of the method. An ensemble is itself a supervised learning algorithm, because it can be trained and then used to make predictions. Ensemble methods combine several decision trees classifiers to produce better predictive performance than a single decision tree classifier. The main principle behind the ensemble model is that a group of weak learners come together to form a strong learner, thus increasing the accuracy of the model.When we try to predict the target variable using any machine learning technique, the main causes of difference in actual and predicted values are noise, variance, and bias. Ensemble helps to reduce these factors (except noise, which is irreducible error). The noise-related error is mainly due to noise in the training data and can't be removed. However, the errors due to bias and variance can be reduced.The total error can be expressed as follows: Total Error = Bias + Variance + Irreducible Error A measure such as mean square error (MSE) captures all of these errors for a continuous target variable and can be represented as follows: Where, E stands for the expected mean, Y represents the actual target values and fˆ(x) is the predicted values for the target variable. It can be broken down into its components such as bias, variance and noise as shown in the following formula: Using techniques like Bagging and Boosting helps to decrease the variance and increase the robustness of the model. Combinations of multiple classifiers decrease variance, especially in the case of unstable classifiers, and may produce a more reliable classification than a single classifier. Ensemble Algorithm The goal of ensemble algorithms is to combine the predictions of several base estimators built with a given learning algorithm in order to improve generalizability / robustness over a single estimator. There are two families of ensemble methods which are usually distinguished: Averaging methods. The driving principle is to build several estimators independently and then to average their predictions. On average, the combined estimator is usually better than any of the single base estimator because its variance is reduced.|Examples: Bagging methods, Forests of randomized trees. Boosting methods. Base estimators are built sequentially and one tries to reduce the bias of the combined estimator. The motivation is to combine several weak models to produce a powerful ensemble.Examples: AdaBoost, Gradient Tree Boosting.Advantages of Ensemble Algorithm Ensemble is a proven method for improving the accuracy of the model and works in most of the cases. Ensemble makes the model more robust and stable thus ensuring decent performance on the test cases in most scenarios. You can use ensemble to capture linear and simple as well nonlinear complex relationships in the data. This can be done by using two different models and forming an ensemble of two. Disadvantages of Ensemble Algorithm Ensemble reduces the model interpret-ability and makes it very difficult to draw any crucial business insights at the end It is time-consuming and thus might not be the best idea for real-time applications The selection of models for creating an ensemble is an art which is really hard to master Basic Ensemble Techniques Max Voting: Max-voting is one of the simplest ways of combining predictions from multiple machine learning algorithms. Each base model makes a prediction and votes for each sample. The sample class with the highest votes is considered in the final predictive class. It is mainly used for classification problems.  Averaging: Averaging can be used while estimating the probabilities in classification tasks. But it is usually used for regression problems. Predictions are extracted from multiple models and an average of the predictions are used to make the final prediction. Weighted Average: Like averaging, weighted averaging is also used for regression tasks. Alternatively, it can be used while estimating probabilities in classification problems. Base learners are assigned different weights, which represent the importance of each model in the prediction. Ensemble Methods Ensemble methods became popular as a relatively simple device to improve the predictive performance of a base procedure. There are different reasons for this: the bagging procedure turns out to be a variance reduction scheme, at least for some base procedures. On the other hand, boosting methods are primarily reducing the (model) bias of the base procedure. This already indicates that bagging and boosting are very different ensemble methods. From the perspective of prediction, random forests is about as good as boosting, and often better than bagging.  Bootstrap Aggregation or Bagging tries to implement similar learners on small sample populations and then takes a mean of all the predictions. It combines Bootstrapping and Aggregation to form one ensemble model Reduces the variance error and helps to avoid overfitting Bagging algorithms include: Bagging meta-estimator Random forest Boosting refers to a family of algorithms which converts weak learner to strong learners. Boosting is a sequential process, where each subsequent model attempts to correct the errors of the previous model. Boosting is focused on reducing the bias. It makes the boosting algorithms prone to overfitting. To avoid overfitting, parameter tuning plays an important role in boosting algorithms. Some examples of boosting are mentioned below: AdaBoost GBM XGBM Light GBM CatBoost Why use ensemble models? Ensemble models help in improving algorithm accuracy as well as the robustness of a model. Both Bagging and Boosting should be known by data scientists and machine learning engineers and especially people who are planning to attend data science/machine learning interviews. Ensemble learning uses hundreds to thousands of models of the same algorithm and then work hand in hand to find the correct classification. You may also consider the fable of the blind men and the elephant to understand ensemble learning, where each blind man found a feature of the elephant and they all thought it was something different. However, if they would work together and discussed among themselves, they might have figured out what it is. Using techniques like bagging and boosting leads to increased robustness of statistical models and decreased variance. Now the question becomes, between these different “B” words. Which is better? Which is better, Bagging or Boosting? There is no perfectly correct answer to that. It depends on the data, the simulation and the circumstances. Bagging and Boosting decrease the variance of your single estimate as they combine several estimates from different models. So the result may be a model with higher stability. If the problem is that the single model gets a very low performance, Bagging will rarely get a better bias. However, Boosting could generate a combined model with lower errors as it optimizes the advantages and reduces pitfalls of the single model. By contrast, if the difficulty of the single model is overfitting, then Bagging is the best option. Boosting for its part doesn’t help to avoid over-fitting; in fact, this technique is faced with this problem itself. For this reason, Bagging is effective more often than Boosting. In this article we will discuss about Bagging, we will cover Boosting in the next post. But first, let us look into the very important concept of bootstrapping. Bootstrap Sampling Sampling is the process of selecting a subset of observations from the population with the purpose of estimating some parameters about the whole population. Resampling methods, on the other hand, are used to improve the estimates of the population parameters. In machine learning, the bootstrap method refers to random sampling with replacement. This sample is referred to as a resample. This allows the model or algorithm to get a better understanding of the various biases, variances and features that exist in the resample. Taking a sample of the data allows the resample to contain different characteristics then it might have contained as a whole. This is demonstrated in figure 1 where each sample population has different pieces, and none are identical. This would then affect the overall mean, standard deviation and other descriptive metrics of a data set. In turn, it can develop more robust models. Bootstrapping is also great for small size data sets that can have a tendency to overfit. In fact, we recommended this to one company who was concerned because their data sets were far from “Big Data”. Bootstrapping can be a solution in this case because algorithms that utilize bootstrapping can be more robust and handle new data sets depending on the methodology chosen(boosting or bagging). The reason behind using the bootstrap method is because it can test the stability of a solution. By using multiple sample data sets and then testing multiple models, it can increase robustness. Perhaps one sample data set has a larger mean than another, or a different standard deviation. This might break a model that was overfit, and not tested using data sets with different variations. One of the many reasons bootstrapping has become very common is because of the increase in computing power. This allows for many times more permutations to be done with different resamples than previously. Bootstrapping is used in both Bagging and Boosting Let us assume we have a sample of ‘n’ values (x) and we’d like to get an estimate of the mean of the sample. mean(x) = 1/n * sum(x) Consider a sample of 100 values (x) and we’d like to get an estimate of the mean of the sample. We can calculate the mean directly from the sample as: We know that our sample is small and that the mean has an error in it. We can improve the estimate of our mean using the bootstrap procedure: Create many (e.g. 1000) random sub-samples of the data set with replacement (meaning we can select the same value multiple times). Calculate the mean of each sub-sample Calculate the average of all of our collected means and use that as our estimated mean for the data Example: Suppose we used 3 re-samples and got the mean values 2.3, 4.5 and 3.3. Taking the average of these we could take the estimated mean of the data to be 3.367. This process can be used to estimate other quantities like the standard deviation and even quantities used in machine learning algorithms, like learned coefficients. While using Python, we do not have to implement the bootstrap method manually. The scikit-learn library provides an implementation that creates a single bootstrap sample of a dataset. The resample() scikit-learn function can be used for sampling. It takes as arguments the data array, whether or not to sample with replacement, the size of the sample, and the seed for the pseudorandom number generator used prior to the sampling. For example, let us create a bootstrap that creates a sample with replacement with 4 observations and uses a value of 1 for the pseudorandom number generator. boot = resample(data, replace=True, n_samples=4, random_state=1)As the bootstrap API does not allow to easily gather the out-of-bag observations that could be used as a test set to evaluate a fit model, in the univariate case we can gather the out-of-bag observations using a simple Python list comprehension. # out of bag observations  oob = [x for x in data if x not in boot]Let us look at a small example and execute it.# scikit-learn bootstrap  from sklearn.utils import resample  # data sample  data = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]  # prepare bootstrap sample  boot = resample(data, replace=True, n_samples=4, random_state=1)  print('Bootstrap Sample: %s' % boot)  # out of bag observations  oob = [x for x in data if x not in boot]  print('OOB Sample: %s' % oob) The output will include the observations in the bootstrap sample and those observations in the out-of-bag sample.Bootstrap Sample: [0.6, 0.4, 0.5, 0.1]  OOB Sample: [0.2, 0.3]Bagging Bootstrap Aggregation, also known as Bagging, is a powerful ensemble method that was proposed by Leo Breiman in 1994 to prevent overfitting. The concept behind bagging is to combine the predictions of several base learners to create a more accurate output. Bagging is the application of the Bootstrap procedure to a high-variance machine learning algorithm, typically decision trees. Suppose there are N observations and M features. A sample from observation is selected randomly with replacement (Bootstrapping). A subset of features are selected to create a model with sample of observations and subset of features. Feature from the subset is selected which gives the best split on the training data. This is repeated to create many models and every model is trained in parallel Prediction is given based on the aggregation of predictions from all the models. This approach can be used with machine learning algorithms that have a high variance, such as decision trees. A separate model is trained on each bootstrap sample of data and the average output of those models used to make predictions. This technique is called bootstrap aggregation or bagging for short. Variance means that an algorithm’s performance is sensitive to the training data, with high variance suggesting that the more the training data is changed, the more the performance of the algorithm will vary. The performance of high variance machine learning algorithms like unpruned decision trees can be improved by training many trees and taking the average of their predictions. Results are often better than a single decision tree. What Bagging does is help reduce variance from models that are might be very accurate, but only on the data they were trained on. This is also known as overfitting. Overfitting is when a function fits the data too well. Typically this is because the actual equation is much too complicated to take into account each data point and outlier. Bagging gets around this by creating its own variance amongst the data by sampling and replacing data while it tests multiple hypothesis(models). In turn, this reduces the noise by utilizing multiple samples that would most likely be made up of data with various attributes(median, average, etc). Once each model has developed a hypothesis. The models use voting for classification or averaging for regression. This is where the “Aggregating” in “Bootstrap Aggregating” comes into play. Each hypothesis has the same weight as all the others. When we later discuss boosting, this is one of the places the two methodologies differ. Essentially, all these models run at the same time, and vote on which hypothesis is the most accurate. This helps to decrease variance i.e. reduce the overfit. Advantages Bagging takes advantage of ensemble learning wherein multiple weak learners outperform a single strong learner.  It helps reduce variance and thus helps us avoid overfitting. Disadvantages There is loss of interpretability of the model. There can possibly be a problem of high bias if not modeled properly. While bagging gives us more accuracy, it is computationally expensive and may not be desirable depending on the use case. There are many bagging algorithms of which perhaps the most prominent would be Random Forest.  Decision Trees Decision trees are simple but intuitive models. Using a top-down approach, a root node creates binary splits unless a particular criteria is fulfilled. This binary splitting of nodes results in a predicted value on the basis of the interior nodes which lead to the terminal or the final nodes. For a classification problem, a decision tree will output a predicted target class for each terminal node produced. We have covered decision tree algorithm  in detail for both classification and regression in another article. Limitations to Decision Trees Decision trees tend to have high variance when they utilize different training and test sets of the same data, since they tend to overfit on training data. This leads to poor performance when new and unseen data is added. This limits the usage of decision trees in predictive modeling. However, using ensemble methods, models that utilize decision trees can be created as a foundation for producing powerful results. Bootstrap Aggregating Trees We have already discussed about bootstrap aggregating (or bagging), we can create an ensemble (forest) of trees where multiple training sets are generated with replacement, meaning data instances. Once the training sets are created, a CART model can be trained on each subsample. Features of Bagged Trees Reduces variance by averaging the ensemble's results. The resulting model uses the entire feature space when considering node splits. Bagging trees allow the trees to grow without pruning, reducing the tree-depth sizes and resulting in high variance but lower bias, which can help improve predictive power. Limitations to Bagging Trees The main limitation of bagging trees is that it uses the entire feature space when creating splits in the trees. Suppose some variables within the feature space are indicating certain predictions, there is a risk of having a forest of correlated trees, which actually  increases bias and reduces variance. Why a Forest is better than One Tree?The main objective of a machine learning model is to generalize properly to new and unseen data. When we have a flexible model, overfitting takes place. A flexible model is said to have high variance because the learned parameters (such as the structure of the decision tree) will vary with the training data. On the other hand, an inflexible model is said to have high bias as it makes assumptions about the training data. An inflexible model may not have the capacity to fit even the training data and in both cases — high variance and high bias — the model is not able to generalize new and unseen data properly. You can through the article on one of the foundational concepts in machine learning, bias-variance tradeoff which will help you understand that the balance between creating a model that is so flexible memorizes the training data and an inflexible model cannot learn the training data.  The main reason why decision tree is prone to overfitting when we do not limit the maximum depth is because it has unlimited flexibility, which means it keeps growing unless there is one leaf node for every single observation. Instead of limiting the depth of the tree which results in reduced variance and increase in bias, we can combine many decision trees into a single ensemble model known as the random forest. What is Random Forest algorithm? Random forest is like bootstrapping algorithm with Decision tree (CART) model. Suppose we have 1000 observations in the complete population with 10 variables. Random forest will try to build multiple CART along with different samples and different initial variables. It will take a random sample of 100 observations and then chose 5 initial variables randomly to build a CART model. It will go on repeating the process say about 10 times and then make a final prediction on each of the observations. Final prediction is a function of each prediction. This final prediction can simply be the mean of each prediction. The random forest is a model made up of many decision trees. Rather than just simply averaging the prediction of trees (which we could call a “forest”), this model uses two key concepts that gives it the name random: Random sampling of training data points when building trees Random subsets of features considered when splitting nodes How the Random Forest Algorithm Works The basic steps involved in performing the random forest algorithm are mentioned below: Pick N random records from the dataset. Build a decision tree based on these N records. Choose the number of trees you want in your algorithm and repeat steps 1 and 2. In case of a regression problem, for a new record, each tree in the forest predicts a value for Y (output). The final value can be calculated by taking the average of all the values predicted by all the trees in the forest. Or, in the case of a classification problem, each tree in the forest predicts the category to which the new record belongs. Finally, the new record is assigned to the category that wins the majority vote. Using Random Forest for Regression Here we have a problem where we have to predict the gas consumption (in millions of gallons) in 48 US states based on petrol tax (in cents), per capita income (dollars), paved highways (in miles) and the proportion of population with the driving license. We will use the random forest algorithm via the Scikit-Learn Python library to solve this regression problem. First we import the necessary libraries and our dataset. import pandas as pd  import numpy as np  dataset = pd.read_csv('/content/petrol_consumption.csv')  dataset.head() Petrol_taxAverage_incomepaved_HighwaysPopulation_Driver_licence(%)Petrol_Consumption09.0357119760.52554119.0409212500.57252429.0386515860.58056137.5487023510.52941448.043994310.544410You will notice that the values in our dataset are not very well scaled. Let us scale them down before training the algorithm. Preparing Data For Training We will perform two tasks in order to prepare the data. Firstly we will divide the data into ‘attributes’ and ‘label’ sets. The resultant will then be divided into training and test sets. X = dataset.iloc[:, 0:4].values  y = dataset.iloc[:, 4].valuesNow let us divide the data into training and testing sets:from sklearn.model_selection import train_test_split  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)Feature Scaling The dataset is not yet a scaled value as you will see that the Average_Income field has values in the range of thousands while Petrol_tax has values in the range of tens. It will be better if we scale our data. We will use Scikit-Learn's StandardScaler class to do the same. # Feature Scaling  from sklearn.preprocessing import StandardScaler  sc = StandardScaler()  X_train = sc.fit_transform(X_train)  X_test = sc.transform(X_test)Training the Algorithm Now that we have scaled our dataset, let us train the random forest algorithm to solve this regression problem. from sklearn.ensemble import Random Forest Regressor  regressor = Random Forest Regressor(n_estimators=20,random_state=0)  regressor.fit(X_train, y_train)  y_pred = regressor.predict(X_test)The RandomForestRegressor is used to solve regression problems via random forest. The most important parameter of the RandomForestRegressor class is the n_estimators parameter. This parameter defines the number of trees in the random forest. Here we started with n_estimator=20 and check the performance of the algorithm. You can find details for all of the parameters of RandomForestRegressor here. Evaluating the Algorithm Let us evaluate the performance of the algorithm. For regression problems the metrics used to evaluate an algorithm are mean absolute error, mean squared error, and root mean squared error.  from sklearn import metrics  print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))  print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))  print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred))) Mean Absolute Error: 51.76500000000001 Mean Squared Error: 4216.166749999999 Root Mean Squared Error: 64.93201637097064 With 20 trees, the root mean squared error is 64.93 which is greater than 10 percent of the average petrol consumption i.e. 576.77. This may indicate, among other things, that we have not used enough estimators (trees). Let us now change the number of estimators to 200, the results are as follows: Mean Absolute Error: 48.33899999999999 Mean Squared Error: 3494.2330150000003  Root Mean Squared Error: 59.112037818028234 The graph below shows the decrease in the value of the root mean squared error (RMSE) with respect to number of estimators.  You will notice that the error values decrease with the increase in the number of estimators. You may consider 200 a good number for n_estimators as the rate of decrease in error diminishes. You may try playing around with other parameters to figure out a better result. Using Random Forest for ClassificationNow let us consider a classification problem to predict whether a bank currency note is authentic or not based on four attributes i.e. variance of the image wavelet transformed image, skewness, entropy, andkurtosis of the image. We will use Random Forest Classifier to solve this binary classification problem. Let’s get started. import pandas as pd  import numpy as np  dataset = pd.read_csv('/content/bill_authentication.csv')  dataset.head()VarianceSkewnessKurtosisEntropyClass03.621608.6661-2.8073-0.44699014.545908.1674-2.4586-1.46210023.86600-2.63831.92420.10645033.456609.5228-4.0112-3.59440040.32924-4.45524.5718-0.988800Similar to the data we used previously for the regression problem, this data is not scaled. Let us prepare the data for training. Preparing Data For Training The following code divides data into attributes and labels: X = dataset.iloc[:, 0:4].values  y = dataset.iloc[:, 4].values The following code divides data into training and testing sets:from sklearn.model_selection import train_test_split  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0) Feature Scaling We will do the same thing as we did for the previous problem. # Feature Scaling  from sklearn.preprocessing import StandardScaler  sc = StandardScaler()  X_train = sc.fit_transform(X_train)  X_test = sc.transform(X_test)Training the Algorithm Now that we have scaled our dataset, let us train the random forest algorithm to solve this classification problem. from sklearn.ensemble import Random Forest Classifier  classifier = RandomForestClassifier(n_estimators=20, random_state=0)  classifier.fit(X_train, y_train)  y_pred = classifier.predict(X_test)For classification, we have used RandomForestClassifier class of the sklearn.ensemble library. It takes n_estimators as a parameter. This parameter defines the number of trees in out random forest. Similar to the regression problem, we have started with 20 trees here. You can find details for all of the parameters of Random Forest Classifier here. Evaluating the Algorithm For evaluating classification problems,  the metrics used are accuracy, confusion matrix, precision recall, and F1 valuesfrom sklearn.metrics import classification_report, confusion_matrix, accuracy_score  print(confusion_matrix(y_test,y_pred))  print(classification_report(y_test,y_pred))  print(accuracy_score(y_test, y_pred)) The output will look something like this: Output:[ [ 155   2] [     1  117] ]Precisionrecallf1-scoresupport00.990.990.9915710.980.990.99118accuracy0.99275macro avg0.990.990.992750.98909090909090910.990.990.99275The accuracy achieved by our random forest classifier with 20 trees is 98.90%. Let us change the number of trees to 200.from sklearn.ensemble import Random Forest Classifier  classifier = Random Forest Classifier(n_estimators=200, random_state=0)  classifier.fit(X_train, y_train)  y_pred = classifier.predict(X_test) Output:[ [ 155   2] [     1  117] ]Precisionrecallf1-scoresupport00.990.990.9915710.980.990.99118accuracy0.99275macro avg0.990.990.992750.98909090909090910.990.990.99275Unlike the regression problem, changing the number of estimators for this problem did not make any difference in the results.An accuracy of 98.9% is pretty good. In this case, we have seen that there is not much improvement if the number of trees are increased. You may try playing around with other parameters of the RandomForestClassifier class and see if you can improve on our results. Advantages and Disadvantages of using Random Forest As with any algorithm, there are advantages and disadvantages to using it. Let us look into the pros and cons of using Random Forest for classification and regression. Advantages Random forest algorithm is unbiased as there are multiple trees and each tree is trained on a subset of data.  Random Forest algorithm is very stable. Introducing a new data in the dataset does not affect much as the new data impacts one tree and is pretty hard to impact all the trees. The random forest algorithm works well when you have both categorical and numerical features. With missing values in the dataset, the random forest algorithm performs very well. Disadvantages A major disadvantage of random forests lies in their complexity. More computational resources are required and also results in the large number of decision trees joined together. Due to their complexity, training time is more compared to other algorithms. Summary In this article we have covered what is ensemble learning and discussed about basic ensemble techniques. We also looked into bootstrap sampling involves iteratively resampling of a dataset with replacement which allows the model or algorithm to get a better understanding various features. Then we moved on to bagging followed by random forest. We also implemented random forest in Python for both regression and classification and came to a conclusion that increasing number of trees or estimators does not always make a difference in a classification problem. However, in regression there is an impact.  We have covered most of the topics related to algorithms in our series of machine learning blogs,click here. If you are inspired by the opportunities provided by machine learning, enrol in our  Data Science and Machine Learning Courses for more lucrative career options in this landscape. 0.99
Rated 4.5/5 based on 12 customer reviews
16695
Bagging and Random Forest in Machine Learning

In today’s world, innovations happen on a daily ... Read More