Conditions Apply

Search

Machine learning Filter

What is Linear Regression in Machine Learning

Machine Learning, being a subset of Artificial Intelligence (AI), has been playing a dominant role in our daily lives. Data science engineers and developers working in various domains are widely using machine learning algorithms to make their tasks simpler and life easier. For example, certain machine learning algorithms enable Google Maps to find the fastest route to our destinations, allow Tesla to make driverless cars, help Amazon to generate almost 35% of their annual income, AccuWeather to get the weather forecast of 3.5 million locations weeks in advance, Facebook to automatically detect faces and suggest tags and so on.In statistics and machine learning, linear regression is one of the most popular and well understood algorithms. Most data science enthusiasts and machine learning  fanatics begin their journey with linear regression algorithms. In this article, we will look into how linear regression algorithm works and how it can be efficiently used in your machine learning projects to build better models.Linear Regression is one of the machine learning algorithms where the result is predicted by the use of known parameters which are correlated with the output. It is used to predict values within a continuous range rather than trying to classify them into categories. The known parameters are used to make a continuous and constant slope which is used to predict the unknown or the result.What is a Regression Problem?Majority of the machine learning algorithms fall under the supervised learning category. It is the process where an algorithm is used to predict a result based on the previously entered values and the results generated from them. Suppose we have an input variable ‘x’ and an output variable ‘y’ where y is a function of x (y=f{x}). Supervised learning reads the value of entered variable ‘x’ and the resulting variable ‘y’ so that it can use those results to later predict a highly accurate output data of ‘y’ from the entered value of ‘x’. A regression problem is when the resulting variable contains a real or a continuous value. It tries to draw the line of best fit from the data gathered from a number of points.For example, which of these is a regression problem?How much gas will I spend if I drive for 100 miles?What is the nationality of a person?What is the age of a person?Which is the closest planet to the Sun?Predicting the amount of gas to be spent and the age of a person are regression problems. Predicting nationality is categorical and the closest planet to the Sun is discrete.What is Linear Regression?Let’s say we have a dataset which contains information about the relationship between ‘number of hours studied’ and ‘marks obtained’. A number of students have been observed and their hours of study along with their grades are recorded. This will be our training data. Our goal is to design a model that can predict the marks if number of hours studied is provided. Using the training data, a regression line is obtained which will give minimum error. This linear equation is then used to apply for a new data. That is, if we give the number of hours studied by a student as an input, our model should be able to predict their mark with minimum error.Hypothesis of Linear RegressionThe linear regression model can be represented by the following equation:where,Y is the predicted valueθ₀ is the bias term.θ₁,…,θn are the model parametersx₁, x₂,…,xn are the feature values.The above hypothesis can also be represented byWhere, θ is the model’s parameter vector including the bias term θ₀; x is the feature vector with x₀ =1Y (pred) = b0 + b1*xThe values b0 and b1 must be chosen so that the error is minimum. If sum of squared error is taken as a metric to evaluate the model, then the goal is to obtain a line that best reduces the error.If we don’t square the error, then the positive and negative points will cancel each other out.For a model with one predictor,Exploring ‘b1’If b1 > 0, then x (predictor) and y(target) have a positive relationship. That is an increase in x will increase y.If b1 < 0, then x (predictor) and y(target) have a negative relationship. That is an increase in x will decrease y.Exploring ‘b0’If the model does not include x=0, then the prediction will become meaningless with only b0. For example, we have a dataset that relates height(x) and weight(y). Taking x=0 (that is height as 0), will make the equation have only b0 value which is completely meaningless as in real-time height and weight can never be zero. This resulted due to considering the model values beyond its scope.If the model includes value 0, then ‘b0’ will be the average of all predicted values when x=0. But, setting zero for all the predictor variables is often impossible.The value of b0 guarantees that the residual will have mean zero. If there is no ‘b0’ term, then the regression will be forced to pass over the origin. Both the regression coefficient and prediction will be biased.How does Linear Regression work?Let’s look at a scenario where linear regression might be useful: losing weight. Let us consider that there’s a connection between how many calories you take in and how much you weigh; regression analysis can help you understand that connection. Regression analysis will provide you with a relation which can be visualized into a graph in order to make predictions about your data. For example, if you’ve been putting on weight over the last few years, it can predict how much you’ll weigh in the next ten years if you continue to consume the same amount of calories and burn them at the same rate.The goal of regression analysis is to create a trend line based on the data you have gathered. This then allows you to determine whether other factors apart from the amount of calories consumed affect your weight, such as the number of hours you sleep, work pressure, level of stress, type of exercises you do etc. Before taking into account, we need to look at these factors and attributes and determine whether there is a correlation between them. Linear Regression can then be used to draw a trend line which can then be used to confirm or deny the relationship between attributes. If the test is done over a long time duration, extensive data can be collected and the result can be evaluated more accurately. By the end of this article we will build a model which looks like the below picture i.e, determine a line which best fits the data.How do we determine the best fit line?The best fit line is considered to be the line for which the error between the predicted values and the observed values is minimum. It is also called the regression line and the errors are also known as residuals. The figure shown below shows the residuals. It can be visualized by the vertical lines from the observed data value to the regression line.When to use Linear Regression?Linear Regression’s power lies in its simplicity, which means that it can be used to solve problems across various fields. At first, the data collected from the observations need to be collected and plotted along a line. If the difference between the predicted value and the result is almost the same, we can use linear regression for the problem.Assumptions in linear regressionIf you are planning to use linear regression for your problem then there are some assumptions you need to consider:The relation between the dependent and independent variables should be almost linear.The data is homoscedastic, meaning the variance between the results should not be too much.The results obtained from an observation should not be influenced by the results obtained from the previous observation.The residuals should be normally distributed. This assumption means that the probability density function of the residual values is normally distributed at each independent value.You can determine whether your data meets these conditions by plotting it and then doing a bit of digging into its structure.Few properties of Regression LineHere are a few features a regression line has:Regression passes through the mean of independent variable (x) as well as mean of the dependent variable (y).Regression line minimizes the sum of “Square of Residuals”. That’s why the method of Linear Regression is known as “Ordinary Least Square (OLS)”. We will discuss more in detail about Ordinary Least Square later on.B1 explains the change in Y with a change in x  by one unit. In other words, if we increase the value of ‘x’ it will result in a change in value of Y.Finding a Linear Regression lineLet’s say we want to predict ‘y’ from ‘x’ given in the following table and assume they are correlated as “y=B0+B1∗x”xyPredicted 'y'12Β0+B1∗121Β0+B1∗233Β0+B1∗346Β0+B1∗459Β0+B1∗5611Β0+B1∗6713Β0+B1∗7815Β0+B1∗8917Β0+B1∗91020Β0+B1∗10where,Std. Dev. of x3.02765Std. Dev. of y6.617317Mean of x5.5Mean of y9.7Correlation between x & y0.989938If the Residual Sum of Square (RSS) is differentiated with respect to B0 & B1 and the results equated to zero, we get the following equation:B1 = Correlation * (Std. Dev. of y/ Std. Dev. of x)B0 = Mean(Y) – B1 * Mean(X)Putting values from table 1 into the above equations,B1 = 2.64B0 = -2.2Hence, the least regression equation will become –Y = -2.2 + 2.64*xxY - ActualY - Predicted120.44213.08335.72468.36591161113.6471316.2881518.9291721.56102024.2As there are only 10 data points, the results are not too accurate but if we see the correlation between the predicted and actual line, it has turned out to be very high; both the lines are moving almost together and here is the graph for visualizing our predicted values:Model PerformanceAfter the model is built, if we see that the difference in the values of the predicted and actual data is not much, it is considered to be a good model and can be used to make future predictions. The amount that we consider “not much” entirely depends on the task you want to perform and to what percentage the variation in data can be handled. Here are a few metric tools we can use to calculate error in the model-R – Square (R2)Total Sum of Squares (TSS): total sum of squares (TSS) is a quantity that appears as part of a standard way of presenting results of such an analysis. Sum of squares is a measure of how a data set varies around a central number (like the mean). The Total Sum of Squares tells how much variation there is in the dependent variable.TSS = Σ (Y – Mean[Y])2Residual Sum of Squares (RSS): The residual sum of squares tells you how much of the dependent variable’s variation your model did not explain. It is the sum of the squared differences between the actual Y and the predicted Y.RSS = Σ (Y – f[Y])2(TSS – RSS) measures the amount of variability in the response that is explained by performing the regression.Properties of R2R2 always ranges between 0 to 1.R2 of 0 means that there is no correlation between the dependent and the independent variable.R2 of 1 means the dependent variable can be predicted from the independent variable without any error. An R2 between 0 and 1 indicates the extent to which the dependent variable is predictable. An R2 of 0.20 means that there is 20% of the variance in Y is predictable from X; an R2 of 0.40 means that 40% is predictable; and so on.Root Mean Square Error (RMSE)Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors). The formula for calculating RMSE is:Where N : Total number of observationsWhen standardized observations are used as RMSE inputs, there is a direct relationship with the correlation coefficient. For example, if the correlation coefficient is 1, the RMSE will be 0, because all of the points lie on the regression line (and therefore there are no errors).Mean Absolute Percentage Error (MAPE)There are certain limitations to the use of RMSE, so analysts prefer MAPE over RMSE which gives error in terms of percentages so that different models can be considered for the task and see how they perform. Formula for calculating MAPE can be written as:Where N : Total number of observationsFeature SelectionFeature selection is the automatic selection of attributes for your data that are most relevant to the predictive model you are working on. It seeks to reduce the number of attributes in the dataset by eliminating the features which are not required for the model construction. Feature selection does not totally eliminate an attribute which is considered for the model, rather it mutes that particular characteristic and works with the features which affects the model.Feature selection method aids your mission to create an accurate predictive model. It helps you by choosing features that will give you as good or better accuracy whilst requiring less data. Feature selection methods can be used to identify and remove unnecessary, irrelevant and redundant attributes from the data that do not contribute to the accuracy of the model or may even decrease the accuracy of the model. Having fewer attributes is desirable because it reduces the complexity of the model, and a simpler model is easier to understand, explain and to work with.Feature Selection Algorithms:Filter Method: This method involves assigning scores to individual features and ranking them. The features that have very little to almost no impact are removed from consideration while constructing the model.Wrapper Method: Wrapper method is quite similar to Filter method except the fact that it considers attributes in a group i.e. a number of attributes are taken and checked whether they are having an impact on the model and if not another combination is applied.Embedded Method: Embedded method is the best and most accurate of all the algorithms. It learns the features that affect the model while the model is being constructed and takes into consideration only those features. The most common type of embedded feature selection methods are regularization methods.Cost FunctionCost function helps to figure out the best possible plots which can be used to draw the line of best fit for the data points. As we want to reduce the error of the resulting value we change the process of finding out the actual result to a process which can reduce the error between the predicted value and the actual value.Here, J is the cost function.The above function is made in this format to calculate the error difference between the predicted values and the plotted values. We take the square of the summation of all the data points and divide it by the total number of data points. This cost function J is also called the Mean Squared Error (MSE) function. Using this MSE function we are going to predict values such that the MSE value settles at the minima, reducing the cost function.Gradient DescentGradient Descent is an optimization algorithm that helps machine learning models to find out paths to a minimum value using repeated steps. Gradient descent is used to minimize a function so that it gives the lowest output of that function. This function is called the Loss Function. The loss function shows us how much error is produced by the machine learning model compared to actual results. Our aim should be to lower the cost function as much as possible. One way of achieving a low cost function is by the process of gradient descent. Complexity of some equations makes it difficult to use, partial derivative of the cost function with respect to the considered parameter can provide optimal coefficient value. You may refer to the article on Gradient Descent for Machine Learning.Simple Linear RegressionOptimization is a big part of machine learning and almost every machine learning algorithm has an optimization technique at its core for increased efficiency. Gradient Descent is such an optimization algorithm used to find values of coefficients of a function that minimizes the cost function. Gradient Descent is best applied when the solution cannot be obtained by analytical methods (linear algebra) and must be obtained by an optimization technique.Residual Analysis: Simple linear regression models the relationship between the magnitude of one variable and that of a second—for example, as x increases, y also increases. Or as x increases, y decreases. Correlation is another way to measure how two variables are related. The models done by simple linear regression estimate or try to predict the actual result but most often they deviate from the actual result. Residual analysis is used to calculate by how much the estimated value has deviated from the actual result.Null Hypothesis and p-value: During feature selection, null hypothesis is used to find which attributes will not affect the result of the model. Hypothesis tests are used to test the validity of a claim that is made about a particular attribute of the model. This claim that’s on trial, in essence, is called the null hypothesis. A p-value helps to determine the significance of the results. p-value is a number between 0 and 1 and is interpreted in the following way:A small p-value (less than 0.05) indicates a strong evidence against the null hypothesis, so the null hypothesis is to be rejected.A large p-value (greater than 0.05) indicates weak evidence against the null hypothesis, so the null hypothesis is to be considered.p-value very close to the cut-off (equal to 0.05) is considered to be marginal (could go either way). In this case, the p-value should be provided to the readers so that they can draw their own conclusions.Ordinary Least SquareOrdinary Least Squares (OLS), also known as Ordinary least squares regression or least squared errors regression is a type of linear least squares method for estimating the unknown parameters in a linear regression model. OLS chooses the parameters for a linear function, the goal of which is to minimize the sum of the squares of the difference of the observed variables and the dependent variables i.e. it tries to attain a relationship between them. There are two types of relationships that may occur: linear and curvilinear. A linear relationship is a straight line that is drawn through the central tendency of the points; whereas a curvilinear relationship is a curved line. Association between the variables are depicted by using a scatter plot. The relationship could be positive or negative, and result variation also differs in strength.The advantage of using Ordinary Least Squares regression is that it can be easily interpreted and is highly compatible with recent computers’ built-in algorithms from linear algebra. It can be used to apply to problems with lots of independent variables which can efficiently conveyed to thousands of data points. In Linear Regression, OLS is used to estimate the unknown parameters by creating a model which will minimize the sum of the squared errors between the observed data and the predicted one.Let us simulate some data and look at how the predicted values (Yₑ) differ from the actual value (Y):import pandas as pd import numpy as np from matplotlib import pyplot as plt # Generate 'random' data np.random.seed(0) X = 2.5 * np.random.randn(100) + 1.5   # Array of 100 values with mean = 1.5, stddev = 2.5 res = 0.5 * np.random.randn(100)         # Generate 100 residual terms y = 2 + 0.3 * X + res                   # Actual values of Y # Create pandas dataframe to store our X and y values df = pd.DataFrame(     {'X': X,       'y': y} ) # Show the first five rows of our dataframe df.head()XY05.9101314.71461512.5003932.07623823.9468452.54881137.1022334.61536846.1688953.264107To estimate y using the OLS method, we need to calculate xmean and ymean, the covariance of X and y (xycov), and the variance of X (xvar) before we can determine the values for alpha and beta.# Calculate the mean of X and y xmean = np.mean(X) ymean = np.mean(y) # Calculate the terms needed for the numator and denominator of beta df['xycov'] = (df['X'] - xmean) * (df['y'] - ymean) df['xvar'] = (df['X'] - xmean)**2 # Calculate beta and alpha beta = df['xycov'].sum() / df['xvar'].sum() alpha = ymean - (beta * xmean) print(f'alpha = {alpha}') print(f'beta = {beta}')alpha = 2.0031670124623426 beta = 0.3229396867092763Now that we have an estimate for alpha and beta, we can write our model as Yₑ = 2.003 + 0.323 X, and make predictions:ypred = alpha + beta * XLet’s plot our prediction ypred against the actual values of y, to get a better visual understanding of our model.# Plot regression against actual data plt.figure(figsize=(12, 6)) plt.plot(X, ypred) # regression line plt.plot(X, y, 'ro')   # scatter plot showing actual data plt.title('Actual vs Predicted') plt.xlabel('X') plt.ylabel('y') plt.show()The blue line in the above graph is our line of best fit, Yₑ = 2.003 + 0.323 X.  If you observe the graph carefully, you will notice that there is a linear relationship between X and Y. Using this model, we can predict Y from any values of X. For example, for X = 8,Yₑ = 2.003 + 0.323 (8) = 4.587RegularizationRegularization is a type of regression that is used to decrease the coefficient estimates down to zero. This helps to eliminate the data points that don’t actually represent the true properties of the model, but have appeared by random chance. The process is done by identifying the points which have deviated from the line of best-fit by a large extent. Earlier we saw that to estimate the regression coefficients β in the least squares method, we must minimize the term Residual Sum of Squares (RSS). Let the RSS equation in this case be:The general linear regression model can be expressed using a condensed formula:Here, β=[β0 ,β1, ….. βp]The RSS value will adjust the coefficient, β based on the training data. If the resulting data deviates too much from the training data, then the estimated coefficients won’t generalize well to the future data. This is where regularization comes in and shrinks or regularizes these learned estimates towards zero.Ridge regressionRidge regression is very similar to least squares, except that the Ridge coefficients are estimated by minimizing a different quantity. In particular, the Ridge regression coefficients β are the values that minimize the following quantity:Here, λ is the tuning parameter that decides how much we want to penalize the flexibility of the model. λ controls the relative impact of the two components: RSS and the penalty term. If λ = 0, the Ridge regression will produce a result similar to least squares method. If λ → ∞, all estimated coefficients tend to zero. Ridge regression produces different estimates for different values of λ. The optimal choice of λ is crucial and should be done with cross-validation. The coefficient estimates produced by ridge regression method is also known as the L2 norm.The coefficients generated by Ordinary Least Squares method is independent of scale, which means that if each input variable is multiplied by a constant, the corresponding coefficient will be divided by the same constant, as a result of which the multiplication of the coefficient and the input variables will remain the same. The same is not true for ridge regression and we need to bring the coefficients to the same scale before we perform the process. To standardize the variables, we must subtract their means and divide it by their standard deviations.Lasso RegressionLeast Absolute Shrinkage and Selection Operator (LASSO) regression also shrinks the coefficients by adding a penalty to the sum of squares of the residuals, but the lasso penalty has a slightly different effect. The lasso penalty is the sum of the absolute values of the coefficient vector, which corresponds to its L1 norm. Hence, the lasso estimate is defined by:Similar to ridge regression, the input variables need to be standardized. The lasso penalty makes the solution nonlinear, and there is no closed-form expression for the coefficients as in ridge regression. Instead, the lasso solution is a quadratic programming problem and there are available efficient algorithms that compute the entire path of coefficients that result for different values of λ with the same computational cost as for ridge regression.The lasso penalty had the effect of gradually reducing some coefficients to zero as the regularization increases. For this reason, the lasso can be used for the continuous selection of a subset of features.Linear Regression with multiple variablesLinear regression with multiple variables is also known as "multivariate linear regression". We now introduce notation for equations where we can have any number of input variables.x(i)j=value of feature j in the ith training examplex(i)=the input (features) of the ith training examplem=the number of training examplesn=the number of featuresThe multivariable form of the hypothesis function accommodating these multiple features is as follows:hθ(x)=θ0+θ1x1+θ2x2+θ3x3+⋯+θnxnIn order to develop intuition about this function, we can think about θ0 as the basic price of a house, θ1 as the price per square meter, θ2 as the price per floor, etc. x1 will be the number of square meters in the house, x2 the number of floors, etc.Using the definition of matrix multiplication, our multivariable hypothesis function can be concisely represented as:This is a vectorization of our hypothesis function for one training example; see the lessons on vectorization to learn more.Remark: Note that for convenience reasons in this course we assume x0 (i) =1 for (i∈1,…,m). This allows us to do matrix operations with θ and x. Hence making the two vectors ‘θ’and x(i) match each other element-wise (that is, have the same number of elements: n+1).Multiple Linear RegressionHow is it different?In simple linear regression we use a single independent variable to predict the value of a dependent variable whereas in multiple linear regression two or more independent variables are used to predict the value of a dependent variable. The difference between the two is the number of independent variables. In both cases there is only a single dependent variable.MulticollinearityMulticollinearity tells us the strength of the relationship between independent variables. Multicollinearity is a state of very high intercorrelations or inter-associations among the independent variables. It is therefore a type of disturbance in the data, and if present in the data the statistical inferences made about the data may not be reliable. VIF (Variance Inflation Factor) is used to identify the Multicollinearity. If VIF value is greater than 4, we exclude that variable from our model.There are certain reasons why multicollinearity occurs:It is caused by an inaccurate use of dummy variables.It is caused by the inclusion of a variable which is computed from other variables in the data set.Multicollinearity can also result from the repetition of the same kind of variable.Generally occurs when the variables are highly correlated to each other.Multicollinearity can result in several problems. These problems are as follows:The partial regression coefficient due to multicollinearity may not be estimated precisely. The standard errors are likely to be high.Multicollinearity results in a change in the signs as well as in the magnitudes of the partial regression coefficients from one sample to another sample.Multicollinearity makes it tedious to assess the relative importance of the independent variables in explaining the variation caused by the dependent variable.Iterative ModelsModels should be tested and upgraded again and again for better performance. Multiple iterations allows the model to learn from its previous result and take that into consideration while performing the task again.Making predictions with Linear RegressionLinear Regression can be used to predict the value of an unknown variable using a known variable by the help of a straight line (also called the regression line). The prediction can only be made if it is found that there is a significant correlation between the known and the unknown variable through both a correlation coefficient and a scatterplot.The general procedure for using regression to make good predictions is the following:Research the subject-area so that the model can be built based on the results produced by similar models. This research helps with the subsequent steps.Collect data for appropriate variables which have some correlation with the model.Specify and assess the regression model.Run repeated tests so that the model has more data to work with.To test if the model is good enough observe whether:The scatter plot forms a linear pattern.The correlation coefficient r, has a value above 0.5 or below -0.5. A positive value indicates a positive relationship and a negative value represents a negative relationship.If the correlation coefficient shows a strong relationship between variables but the scatter plot is not linear, the results can be misleading. Examples on how to use linear regression have been shown earlier.Data preparation for Linear RegressionStep 1: Linear AssumptionThe first step for data preparation is checking for the variables which have some sort of linear correlation between the dependent and the independent variables.Step 2: Remove NoiseIt is the process of reducing the number of attributes in the dataset by eliminating the features which have very little to no requirement for the construction of the model.Step 3: Remove CollinearityCollinearity tells us the strength of the relationship between independent variables. If two or more variables are highly collinear, it would not make sense to keep both the variables while evaluating the model and hence we can keep one of them.Step 4: Gaussian DistributionsThe linear regression model will produce more reliable results if the input and output variables have a Gaussian distribution. The Gaussian theorem states that  states that a sample mean from an infinite population is approximately normal, or Gaussian, with mean the same as the underlying population, and variance equal to the population variance divided by the sample size. The approximation improves as the sample size gets large.Step 5: Rescale InputsLinear regression model will produce more reliable predictions if the input variables are rescaled using standardization or normalization.Linear Regression with statsmodelsWe have already discussed OLS method, now we will move on and see how to use the OLS method in the statsmodels library. For this we will be using the popular advertising dataset. Here, we will only be looking at the TV variable and explore whether spending on TV advertising can predict the number of sales for the product. Let’s start by importing this csv file as a pandas dataframe using read_csv():# Import and display first five rows of advertising dataset advert = pd.read_csv('advertising.csv') advert.head()TVRadioNewspaperSales0230.137.869.222.1144.539.345.110.4217.245.969.312.03151.541.358.516.54180.810.858.417.9Now we will use statsmodels’ OLS function to initialize simple linear regression model. It will take the formula y ~ X, where X is the predictor variable (TV advertising costs) and y is the output variable (Sales). Then, we will fit the model by calling the OLS object’s fit() method.import statsmodels.formula.api as smf # Initialise and fit linear regression model using `statsmodels` model = smf.ols('Sales ~ TV', data=advert) model = model.fit()Once we have fit the simple regression model, we can predict the values of sales based on the equation we just derived using the .predict method and also visualise our regression model by plotting sales_pred against the TV advertising costs to find the line of best fit.# Predict values sales_pred = model.predict() # Plot regression against actual data plt.figure(figsize=(12, 6)) plt.plot(advert['TV'], advert['Sales'], 'o')       # scatter plot showing actual data plt.plot(advert['TV'], sales_pred, 'r', linewidth=2)   # regression line plt.xlabel('TV Advertising Costs') plt.ylabel('Sales') plt.title('TV vs Sales') plt.show()In the above graph, if you notice you will see that there is a positive linear relationship between TV advertising costs and Sales. You may also summarize by saying that spending more on TV advertising predicts a higher number of sales.Linear Regression with scikit-learnLet us learn to implement linear regression models using sklearn. For this model as well, we will continue to use the advertising dataset but now we will use two predictor variables to create a multiple linear regression model. Yₑ = α + β₁X₁ + β₂X₂ + … + βₚXₚ, where p is the number of predictors.In our example, we will be predicting Sales using the variables TV and Radio i.e. our model can be written as:Sales = α + β₁*TV + β₂*Radiofrom sklearn.linear_model import LinearRegression # Build linear regression model using TV and Radio as predictors # Split data into predictors X and output Y predictors = ['TV', 'Radio'] X = advert[predictors] y = advert['Sales'] # Initialise and fit model lm = LinearRegression() model = lm.fit(X, y) print(f'alpha = {model.intercept_}') print(f'betas = {model.coef_}')alpha = 4.630879464097768 betas = [0.05444896 0.10717457]model.predict(X)Now that we have fit a multiple linear regression model to our data, we can predict sales from any combination of TV and Radio advertising costs. For example, you want to know how many sales we would make if we invested $600 in TV advertising and $300 in Radio advertising. You can simply find it out by:new_X = [[600, 300]] print(model.predict(new_X))[69.4526273]We get the output as 69.45 which means if we invest $600 on TV and $300 on Radio advertising, we can expect to sell 69 units approximately.SummaryLet us sum up what we have covered in this article so far —How to understand a regression problemWhat is linear regression and how it worksOrdinary Least Square method and RegularizationImplementing Linear Regression in Python using statsmodel and sklearn libraryWe have discussed about a couple of ways to implement linear regression and build efficient models for certain business problems. If you are inspired by the opportunities provided by machine learning, enrol in our  Data Science and Machine Learning Courses for more lucrative career options in this landscape.
Rated 4.5/5 based on 4 customer reviews

What is Linear Regression in Machine Learning

8054
What is Linear Regression in Machine Learning

Machine Learning, being a subset of Artificial Intelligence (AI), has been playing a dominant role in our daily lives. Data science engineers and developers working in various domains are widely using machine learning algorithms to make their tasks simpler and life easier. For example, certain machine learning algorithms enable Google Maps to find the fastest route to our destinations, allow Tesla to make driverless cars, help Amazon to generate almost 35% of their annual income, AccuWeather to get the weather forecast of 3.5 million locations weeks in advance, Facebook to automatically detect faces and suggest tags and so on.

In statistics and machine learning, linear regression is one of the most popular and well understood algorithms. Most data science enthusiasts and machine learning  fanatics begin their journey with linear regression algorithms. In this article, we will look into how linear regression algorithm works and how it can be efficiently used in your machine learning projects to build better models.

Linear Regression is one of the machine learning algorithms where the result is predicted by the use of known parameters which are correlated with the output. It is used to predict values within a continuous range rather than trying to classify them into categories. The known parameters are used to make a continuous and constant slope which is used to predict the unknown or the result.

What is a Regression Problem?

Majority of the machine learning algorithms fall under the supervised learning category. It is the process where an algorithm is used to predict a result based on the previously entered values and the results generated from them. Suppose we have an input variable ‘x’ and an output variable ‘y’ where y is a function of x (y=f{x}). Supervised learning reads the value of entered variable ‘x’ and the resulting variable ‘y’ so that it can use those results to later predict a highly accurate output data of ‘y’ from the entered value of ‘x’. A regression problem is when the resulting variable contains a real or a continuous value. It tries to draw the line of best fit from the data gathered from a number of points.

What is a Regression Problem?

For example, which of these is a regression problem?

  • How much gas will I spend if I drive for 100 miles?
  • What is the nationality of a person?
  • What is the age of a person?
  • Which is the closest planet to the Sun?

Predicting the amount of gas to be spent and the age of a person are regression problems. Predicting nationality is categorical and the closest planet to the Sun is discrete.

What is Linear Regression?

Let’s say we have a dataset which contains information about the relationship between ‘number of hours studied’ and ‘marks obtained’. A number of students have been observed and their hours of study along with their grades are recorded. This will be our training data. Our goal is to design a model that can predict the marks if number of hours studied is provided. Using the training data, a regression line is obtained which will give minimum error. This linear equation is then used to apply for a new data. That is, if we give the number of hours studied by a student as an input, our model should be able to predict their mark with minimum error.

Hypothesis of Linear Regression

The linear regression model can be represented by the following equation:

The linear regression model equation

where,

Y is the predicted value

θ₀ is the bias term.

θ₁,…,θn are the model parameters

x₁, x₂,…,xn are the feature values.

The above hypothesis can also be represented by

The above hypothesis

Where, θ is the model’s parameter vector including the bias term θ₀; x is the feature vector with x₀ =1

Y (pred) = b0 + b1*x

The values b0 and b1 must be chosen so that the error is minimum. If sum of squared error is taken as a metric to evaluate the model, then the goal is to obtain a line that best reduces the error.

Error Calculation in Linear Regression

If we don’t square the error, then the positive and negative points will cancel each other out.

For a model with one predictor,

Intercept Calculation in Linear Regression

Coefficient Formula in Linear Regression

Exploring ‘b1

If b1 > 0, then x (predictor) and y(target) have a positive relationship. That is an increase in x will increase y.

If b1 < 0, then x (predictor) and y(target) have a negative relationship. That is an increase in x will decrease y.

Exploring ‘b0

If the model does not include x=0, then the prediction will become meaningless with only b0. For example, we have a dataset that relates height(x) and weight(y). Taking x=0 (that is height as 0), will make the equation have only b0 value which is completely meaningless as in real-time height and weight can never be zero. This resulted due to considering the model values beyond its scope.

If the model includes value 0, then ‘b0’ will be the average of all predicted values when x=0. But, setting zero for all the predictor variables is often impossible.

The value of b0 guarantees that the residual will have mean zero. If there is no ‘b0’ term, then the regression will be forced to pass over the origin. Both the regression coefficient and prediction will be biased.

How does Linear Regression work?

Let’s look at a scenario where linear regression might be useful: losing weight. Let us consider that there’s a connection between how many calories you take in and how much you weigh; regression analysis can help you understand that connection. Regression analysis will provide you with a relation which can be visualized into a graph in order to make predictions about your data. For example, if you’ve been putting on weight over the last few years, it can predict how much you’ll weigh in the next ten years if you continue to consume the same amount of calories and burn them at the same rate.

The goal of regression analysis is to create a trend line based on the data you have gathered. This then allows you to determine whether other factors apart from the amount of calories consumed affect your weight, such as the number of hours you sleep, work pressure, level of stress, type of exercises you do etc. Before taking into account, we need to look at these factors and attributes and determine whether there is a correlation between them. Linear Regression can then be used to draw a trend line which can then be used to confirm or deny the relationship between attributes. If the test is done over a long time duration, extensive data can be collected and the result can be evaluated more accurately. By the end of this article we will build a model which looks like the below picture i.e, determine a line which best fits the data.

How does Linear Regression work?

How do we determine the best fit line?

The best fit line is considered to be the line for which the error between the predicted values and the observed values is minimum. It is also called the regression line and the errors are also known as residuals. The figure shown below shows the residuals. It can be visualized by the vertical lines from the observed data value to the regression line.

How do we determine the best fit line?

When to use Linear Regression?

Linear Regression’s power lies in its simplicity, which means that it can be used to solve problems across various fields. At first, the data collected from the observations need to be collected and plotted along a line. If the difference between the predicted value and the result is almost the same, we can use linear regression for the problem.

Assumptions in linear regression

If you are planning to use linear regression for your problem then there are some assumptions you need to consider:

  • The relation between the dependent and independent variables should be almost linear.
  • The data is homoscedastic, meaning the variance between the results should not be too much.
  • The results obtained from an observation should not be influenced by the results obtained from the previous observation.
  • The residuals should be normally distributed. This assumption means that the probability density function of the residual values is normally distributed at each independent value.

You can determine whether your data meets these conditions by plotting it and then doing a bit of digging into its structure.

Few properties of Regression Line

Here are a few features a regression line has:

  • Regression passes through the mean of independent variable (x) as well as mean of the dependent variable (y).
  • Regression line minimizes the sum of “Square of Residuals”. That’s why the method of Linear Regression is known as “Ordinary Least Square (OLS)”. We will discuss more in detail about Ordinary Least Square later on.
  • B1 explains the change in Y with a change in x  by one unit. In other words, if we increase the value of ‘x’ it will result in a change in value of Y.

Finding a Linear Regression line

Let’s say we want to predict ‘y’ from ‘x’ given in the following table and assume they are correlated as “y=B0+B1∗x”

xyPredicted 'y'
12Β0+B1∗1
21Β0+B1∗2
33Β0+B1∗3
46Β0+B1∗4
59Β0+B1∗5
611Β0+B1∗6
713Β0+B1∗7
815Β0+B1∗8
917Β0+B1∗9
1020Β0+B1∗10

where,

Std. Dev. of x3.02765
Std. Dev. of y6.617317
Mean of x5.5
Mean of y9.7
Correlation between x & y0.989938

If the Residual Sum of Square (RSS) is differentiated with respect to B0 & B1 and the results equated to zero, we get the following equation:

B1 = Correlation * (Std. Dev. of y/ Std. Dev. of x)

B0 = Mean(Y) – B1 * Mean(X)

Putting values from table 1 into the above equations,

B1 = 2.64

B0 = -2.2

Hence, the least regression equation will become –

Y = -2.2 + 2.64*x

xY - ActualY - Predicted
120.44
213.08
335.72
468.36
5911
61113.64
71316.28
81518.92
91721.56
102024.2

As there are only 10 data points, the results are not too accurate but if we see the correlation between the predicted and actual line, it has turned out to be very high; both the lines are moving almost together and here is the graph for visualizing our predicted values:

Finding a Linear Regression line

Model Performance

After the model is built, if we see that the difference in the values of the predicted and actual data is not much, it is considered to be a good model and can be used to make future predictions. The amount that we consider “not much” entirely depends on the task you want to perform and to what percentage the variation in data can be handled. Here are a few metric tools we can use to calculate error in the model-

R – Square (R2)

Model Performance

Total Sum of Squares (TSS): total sum of squares (TSS) is a quantity that appears as part of a standard way of presenting results of such an analysis. Sum of squares is a measure of how a data set varies around a central number (like the mean). The Total Sum of Squares tells how much variation there is in the dependent variable.

TSS = Σ (Y – Mean[Y])2

Residual Sum of Squares (RSS): The residual sum of squares tells you how much of the dependent variable’s variation your model did not explain. It is the sum of the squared differences between the actual Y and the predicted Y.

RSS = Σ (Y – f[Y])2

(TSS – RSS) measures the amount of variability in the response that is explained by performing the regression.

Properties of R2

  • R2 always ranges between 0 to 1.
  • R2 of 0 means that there is no correlation between the dependent and the independent variable.
  • R2 of 1 means the dependent variable can be predicted from the independent variable without any error. 
  • An R2 between 0 and 1 indicates the extent to which the dependent variable is predictable. An R2 of 0.20 means that there is 20% of the variance in Y is predictable from X; an R2 of 0.40 means that 40% is predictable; and so on.

Root Mean Square Error (RMSE)

Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors). The formula for calculating RMSE is:

Root Mean Square Error (RMSE)

Where N : Total number of observations

When standardized observations are used as RMSE inputs, there is a direct relationship with the correlation coefficient. For example, if the correlation coefficient is 1, the RMSE will be 0, because all of the points lie on the regression line (and therefore there are no errors).

Mean Absolute Percentage Error (MAPE)

There are certain limitations to the use of RMSE, so analysts prefer MAPE over RMSE which gives error in terms of percentages so that different models can be considered for the task and see how they perform. Formula for calculating MAPE can be written as:

Mean Absolute Percentage Error (MAPE)

Where N : Total number of observations

Feature Selection

Feature selection is the automatic selection of attributes for your data that are most relevant to the predictive model you are working on. It seeks to reduce the number of attributes in the dataset by eliminating the features which are not required for the model construction. Feature selection does not totally eliminate an attribute which is considered for the model, rather it mutes that particular characteristic and works with the features which affects the model.

Feature selection method aids your mission to create an accurate predictive model. It helps you by choosing features that will give you as good or better accuracy whilst requiring less data. Feature selection methods can be used to identify and remove unnecessary, irrelevant and redundant attributes from the data that do not contribute to the accuracy of the model or may even decrease the accuracy of the model. Having fewer attributes is desirable because it reduces the complexity of the model, and a simpler model is easier to understand, explain and to work with.

Feature Selection Algorithms:

  • Filter Method: This method involves assigning scores to individual features and ranking them. The features that have very little to almost no impact are removed from consideration while constructing the model.
  • Wrapper Method: Wrapper method is quite similar to Filter method except the fact that it considers attributes in a group i.e. a number of attributes are taken and checked whether they are having an impact on the model and if not another combination is applied.
  • Embedded Method: Embedded method is the best and most accurate of all the algorithms. It learns the features that affect the model while the model is being constructed and takes into consideration only those features. The most common type of embedded feature selection methods are regularization methods.

Cost Function

Cost function helps to figure out the best possible plots which can be used to draw the line of best fit for the data points. As we want to reduce the error of the resulting value we change the process of finding out the actual result to a process which can reduce the error between the predicted value and the actual value.

Cost Function in Linear Regression

Here, J is the cost function.

The above function is made in this format to calculate the error difference between the predicted values and the plotted values. We take the square of the summation of all the data points and divide it by the total number of data points. This cost function J is also called the Mean Squared Error (MSE) function. Using this MSE function we are going to predict values such that the MSE value settles at the minima, reducing the cost function.

Gradient Descent

Gradient Descent is an optimization algorithm that helps machine learning models to find out paths to a minimum value using repeated steps. Gradient descent is used to minimize a function so that it gives the lowest output of that function. This function is called the Loss Function. The loss function shows us how much error is produced by the machine learning model compared to actual results. Our aim should be to lower the cost function as much as possible. One way of achieving a low cost function is by the process of gradient descent. Complexity of some equations makes it difficult to use, partial derivative of the cost function with respect to the considered parameter can provide optimal coefficient value. You may refer to the article on Gradient Descent for Machine Learning.

Simple Linear Regression

Optimization is a big part of machine learning and almost every machine learning algorithm has an optimization technique at its core for increased efficiency. Gradient Descent is such an optimization algorithm used to find values of coefficients of a function that minimizes the cost function. Gradient Descent is best applied when the solution cannot be obtained by analytical methods (linear algebra) and must be obtained by an optimization technique.

Residual Analysis: Simple linear regression models the relationship between the magnitude of one variable and that of a second—for example, as x increases, y also increases. Or as x increases, y decreases. Correlation is another way to measure how two variables are related. The models done by simple linear regression estimate or try to predict the actual result but most often they deviate from the actual result. Residual analysis is used to calculate by how much the estimated value has deviated from the actual result.

Null Hypothesis and p-value: During feature selection, null hypothesis is used to find which attributes will not affect the result of the model. Hypothesis tests are used to test the validity of a claim that is made about a particular attribute of the model. This claim that’s on trial, in essence, is called the null hypothesis. A p-value helps to determine the significance of the results. p-value is a number between 0 and 1 and is interpreted in the following way:

  • A small p-value (less than 0.05) indicates a strong evidence against the null hypothesis, so the null hypothesis is to be rejected.
  • A large p-value (greater than 0.05) indicates weak evidence against the null hypothesis, so the null hypothesis is to be considered.
  • p-value very close to the cut-off (equal to 0.05) is considered to be marginal (could go either way). In this case, the p-value should be provided to the readers so that they can draw their own conclusions.

Ordinary Least Square

Ordinary Least Squares (OLS), also known as Ordinary least squares regression or least squared errors regression is a type of linear least squares method for estimating the unknown parameters in a linear regression model. OLS chooses the parameters for a linear function, the goal of which is to minimize the sum of the squares of the difference of the observed variables and the dependent variables i.e. it tries to attain a relationship between them. 

There are two types of relationships that may occur: linear and curvilinear. A linear relationship is a straight line that is drawn through the central tendency of the points; whereas a curvilinear relationship is a curved line. Association between the variables are depicted by using a scatter plot. The relationship could be positive or negative, and result variation also differs in strength.

The advantage of using Ordinary Least Squares regression is that it can be easily interpreted and is highly compatible with recent computers’ built-in algorithms from linear algebra. It can be used to apply to problems with lots of independent variables which can efficiently conveyed to thousands of data points. In Linear Regression, OLS is used to estimate the unknown parameters by creating a model which will minimize the sum of the squared errors between the observed data and the predicted one.

Let us simulate some data and look at how the predicted values (Yₑ) differ from the actual value (Y):

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

# Generate 'random' data
np.random.seed(0)
X = 2.5 * np.random.randn(100) + 1.5         # Array of 100 values with mean = 1.5, stddev = 2.5
res = 0.5 * np.random.randn(100)         # Generate 100 residual terms
y = 2 + 0.3 * X + res                    # Actual values of Y

# Create pandas dataframe to store our X and y values
df = pd.DataFrame(
    {'X': X,
      'y': y}
)

# Show the first five rows of our dataframe
df.head()

XY
05.9101314.714615
12.5003932.076238
23.9468452.548811
37.1022334.615368
46.1688953.264107

To estimate y using the OLS method, we need to calculate xmean and ymean, the covariance of X and y (xycov), and the variance of X (xvar) before we can determine the values for alpha and beta.

# Calculate the mean of X and y
xmean = np.mean(X)
ymean = np.mean(y)

# Calculate the terms needed for the numator and denominator of beta
df['xycov'] = (df['X'] - xmean) * (df['y'] - ymean)
df['xvar'] = (df['X'] - xmean)**2

# Calculate beta and alpha
beta = df['xycov'].sum() / df['xvar'].sum()
alpha = ymean - (beta * xmean)
print(f'alpha = {alpha}')
print(f'beta = {beta}')
alpha = 2.0031670124623426
beta = 0.3229396867092763

Now that we have an estimate for alpha and beta, we can write our model as Yₑ = 2.003 + 0.323 X, and make predictions:

ypred = alpha + beta * X

Let’s plot our prediction ypred against the actual values of y, to get a better visual understanding of our model.

# Plot regression against actual data
plt.figure(figsize=(12, 6))
plt.plot(X, ypred) # regression line
plt.plot(X, y, 'ro')   # scatter plot showing actual data
plt.title('Actual vs Predicted')
plt.xlabel('X')
plt.ylabel('y')

plt.show()

The blue line in the above graph is our line of best fit

The blue line in the above graph is our line of best fit, Yₑ = 2.003 + 0.323 X.  If you observe the graph carefully, you will notice that there is a linear relationship between X and Y. Using this model, we can predict Y from any values of X. For example, for X = 8,

Yₑ = 2.003 + 0.323 (8) = 4.587

Regularization

Regularization is a type of regression that is used to decrease the coefficient estimates down to zero. This helps to eliminate the data points that don’t actually represent the true properties of the model, but have appeared by random chance. The process is done by identifying the points which have deviated from the line of best-fit by a large extent. Earlier we saw that to estimate the regression coefficients β in the least squares method, we must minimize the term Residual Sum of Squares (RSS). Let the RSS equation in this case be:

Regularization in Linear Regression

The general linear regression model can be expressed using a condensed formula:

expressed using a condensed formula

Here, β=[β01, ….. βp]

The RSS value will adjust the coefficient, β based on the training data. If the resulting data deviates too much from the training data, then the estimated coefficients won’t generalize well to the future data. This is where regularization comes in and shrinks or regularizes these learned estimates towards zero.

Ridge regression

Ridge regression is very similar to least squares, except that the Ridge coefficients are estimated by minimizing a different quantity. In particular, the Ridge regression coefficients β are the values that minimize the following quantity:

Ridge regression in Linear Regression

Here, λ is the tuning parameter that decides how much we want to penalize the flexibility of the model. λ controls the relative impact of the two components: RSS and the penalty term. If λ = 0, the Ridge regression will produce a result similar to least squares method. If λ → ∞, all estimated coefficients tend to zero. Ridge regression produces different estimates for different values of λ. The optimal choice of λ is crucial and should be done with cross-validation. The coefficient estimates produced by ridge regression method is also known as the L2 norm.

The coefficients generated by Ordinary Least Squares method is independent of scale, which means that if each input variable is multiplied by a constant, the corresponding coefficient will be divided by the same constant, as a result of which the multiplication of the coefficient and the input variables will remain the same. The same is not true for ridge regression and we need to bring the coefficients to the same scale before we perform the process. To standardize the variables, we must subtract their means and divide it by their standard deviations.

Lasso Regression

Least Absolute Shrinkage and Selection Operator (LASSO) regression also shrinks the coefficients by adding a penalty to the sum of squares of the residuals, but the lasso penalty has a slightly different effect. The lasso penalty is the sum of the absolute values of the coefficient vector, which corresponds to its L1 norm. Hence, the lasso estimate is defined by:

Lasso Regression in Linear Regression

Similar to ridge regression, the input variables need to be standardized. The lasso penalty makes the solution nonlinear, and there is no closed-form expression for the coefficients as in ridge regression. Instead, the lasso solution is a quadratic programming problem and there are available efficient algorithms that compute the entire path of coefficients that result for different values of λ with the same computational cost as for ridge regression.

The lasso penalty had the effect of gradually reducing some coefficients to zero as the regularization increases. For this reason, the lasso can be used for the continuous selection of a subset of features.

Linear Regression with multiple variables

Linear regression with multiple variables is also known as "multivariate linear regression". We now introduce notation for equations where we can have any number of input variables.

x(i)j=value of feature j in the ith training example

x(i)=the input (features) of the ith training example

m=the number of training examples

n=the number of features

The multivariable form of the hypothesis function accommodating these multiple features is as follows:

hθ(x)=θ01x12x23x3+⋯+θnxn

In order to develop intuition about this function, we can think about θ0 as the basic price of a house, θ1 as the price per square meter, θ2 as the price per floor, etc. x1 will be the number of square meters in the house, x2 the number of floors, etc.

Using the definition of matrix multiplication, our multivariable hypothesis function can be concisely represented as:

Linear Regression with multiple variables

This is a vectorization of our hypothesis function for one training example; see the lessons on vectorization to learn more.

Remark: Note that for convenience reasons in this course we assume x0 (i) =1 for (i∈1,…,m). This allows us to do matrix operations with θ and x. Hence making the two vectors ‘θ’and x(i) match each other element-wise (that is, have the same number of elements: n+1).

Multiple Linear Regression

How is it different?

In simple linear regression we use a single independent variable to predict the value of a dependent variable whereas in multiple linear regression two or more independent variables are used to predict the value of a dependent variable. The difference between the two is the number of independent variables. In both cases there is only a single dependent variable.

Multicollinearity

Multicollinearity tells us the strength of the relationship between independent variables. Multicollinearity is a state of very high intercorrelations or inter-associations among the independent variables. It is therefore a type of disturbance in the data, and if present in the data the statistical inferences made about the data may not be reliable. VIF (Variance Inflation Factor) is used to identify the Multicollinearity. If VIF value is greater than 4, we exclude that variable from our model.

There are certain reasons why multicollinearity occurs:

  • It is caused by an inaccurate use of dummy variables.
  • It is caused by the inclusion of a variable which is computed from other variables in the data set.
  • Multicollinearity can also result from the repetition of the same kind of variable.
  • Generally occurs when the variables are highly correlated to each other.

Multicollinearity can result in several problems. These problems are as follows:

  • The partial regression coefficient due to multicollinearity may not be estimated precisely. The standard errors are likely to be high.
  • Multicollinearity results in a change in the signs as well as in the magnitudes of the partial regression coefficients from one sample to another sample.
  • Multicollinearity makes it tedious to assess the relative importance of the independent variables in explaining the variation caused by the dependent variable.

Iterative Models

Models should be tested and upgraded again and again for better performance. Multiple iterations allows the model to learn from its previous result and take that into consideration while performing the task again.

Making predictions with Linear Regression

Linear Regression can be used to predict the value of an unknown variable using a known variable by the help of a straight line (also called the regression line). The prediction can only be made if it is found that there is a significant correlation between the known and the unknown variable through both a correlation coefficient and a scatterplot.

The general procedure for using regression to make good predictions is the following:

  • Research the subject-area so that the model can be built based on the results produced by similar models. This research helps with the subsequent steps.
  • Collect data for appropriate variables which have some correlation with the model.
  • Specify and assess the regression model.
  • Run repeated tests so that the model has more data to work with.

To test if the model is good enough observe whether:

  • The scatter plot forms a linear pattern.
  • The correlation coefficient r, has a value above 0.5 or below -0.5. A positive value indicates a positive relationship and a negative value represents a negative relationship.

If the correlation coefficient shows a strong relationship between variables but the scatter plot is not linear, the results can be misleading. Examples on how to use linear regression have been shown earlier.

Data preparation for Linear Regression

Step 1: Linear Assumption

The first step for data preparation is checking for the variables which have some sort of linear correlation between the dependent and the independent variables.

Step 2: Remove Noise

It is the process of reducing the number of attributes in the dataset by eliminating the features which have very little to no requirement for the construction of the model.

Step 3: Remove Collinearity

Collinearity tells us the strength of the relationship between independent variables. If two or more variables are highly collinear, it would not make sense to keep both the variables while evaluating the model and hence we can keep one of them.

Step 4: Gaussian Distributions

The linear regression model will produce more reliable results if the input and output variables have a Gaussian distribution. The Gaussian theorem states that  states that a sample mean from an infinite population is approximately normal, or Gaussian, with mean the same as the underlying population, and variance equal to the population variance divided by the sample size. The approximation improves as the sample size gets large.

Step 5: Rescale Inputs

Linear regression model will produce more reliable predictions if the input variables are rescaled using standardization or normalization.

Linear Regression with statsmodels

We have already discussed OLS method, now we will move on and see how to use the OLS method in the statsmodels library. For this we will be using the popular advertising dataset. Here, we will only be looking at the TV variable and explore whether spending on TV advertising can predict the number of sales for the product. Let’s start by importing this csv file as a pandas dataframe using read_csv():

# Import and display first five rows of advertising dataset
advert = pd.read_csv('advertising.csv')
advert.head()

TVRadioNewspaperSales
0230.137.869.222.1
144.539.345.110.4
217.245.969.312.0
3151.541.358.516.5
4180.810.858.417.9

Now we will use statsmodels’ OLS function to initialize simple linear regression model. It will take the formula y ~ X, where X is the predictor variable (TV advertising costs) and y is the output variable (Sales). Then, we will fit the model by calling the OLS object’s fit() method.

import statsmodels.formula.api as smf

# Initialise and fit linear regression model using `statsmodels`
model = smf.ols('Sales ~ TV', data=advert)
model = model.fit()

Once we have fit the simple regression model, we can predict the values of sales based on the equation we just derived using the .predict method and also visualise our regression model by plotting sales_pred against the TV advertising costs to find the line of best fit.

# Predict values
sales_pred = model.predict()

# Plot regression against actual data
plt.figure(figsize=(12, 6))
plt.plot(advert['TV'], advert['Sales'], 'o')       # scatter plot showing actual data
plt.plot(advert['TV'], sales_pred, 'r', linewidth=2)   # regression line
plt.xlabel('TV Advertising Costs')
plt.ylabel('Sales')
plt.title('TV vs Sales')

plt.show()

Linear Regression with statsmodels

In the above graph, if you notice you will see that there is a positive linear relationship between TV advertising costs and Sales. You may also summarize by saying that spending more on TV advertising predicts a higher number of sales.

Linear Regression with scikit-learn

Let us learn to implement linear regression models using sklearn. For this model as well, we will continue to use the advertising dataset but now we will use two predictor variables to create a multiple linear regression model. 

Yₑ = α + β₁X₁ + β₂X₂ + … + βₚXₚ, where p is the number of predictors.

In our example, we will be predicting Sales using the variables TV and Radio i.e. our model can be written as:

Sales = α + β₁*TV + β₂*Radio

from sklearn.linear_model import LinearRegression

# Build linear regression model using TV and Radio as predictors
# Split data into predictors X and output Y
predictors = ['TV', 'Radio']
X = advert[predictors]
y = advert['Sales']

# Initialise and fit model
lm = LinearRegression()
model = lm.fit(X, y)
print(f'alpha = {model.intercept_}')
print(f'betas = {model.coef_}')
alpha = 4.630879464097768
betas = [0.05444896 0.10717457]
model.predict(X)

Linear Regression with scikit-learn

Now that we have fit a multiple linear regression model to our data, we can predict sales from any combination of TV and Radio advertising costs. For example, you want to know how many sales we would make if we invested $600 in TV advertising and $300 in Radio advertising. You can simply find it out by:

new_X = [[600, 300]]
print(model.predict(new_X))
[69.4526273]

We get the output as 69.45 which means if we invest $600 on TV and $300 on Radio advertising, we can expect to sell 69 units approximately.

Summary

Let us sum up what we have covered in this article so far —

  • How to understand a regression problem
  • What is linear regression and how it works
  • Ordinary Least Square method and Regularization
  • Implementing Linear Regression in Python using statsmodel and sklearn library

We have discussed about a couple of ways to implement linear regression and build efficient models for certain business problems. If you are inspired by the opportunities provided by machine learning, enrol in our  Data Science and Machine Learning Courses for more lucrative career options in this landscape.

Priyankur

Priyankur Sarkar

Data Science Enthusiast

Priyankur Sarkar loves to play with data and get insightful results out of it, then turn those data insights and results in business growth. He is an electronics engineer with a versatile experience as an individual contributor and leading teams, and has actively worked towards building Machine Learning capabilities for organizations.

Join the Discussion

Your email address will not be published. Required fields are marked *

Suggested Blogs

What are Decision Trees in Machine Learning (Classification And Regression)

Introduction to Machine Learning and its typesMachine Learning is an interdisciplinary field of study and is a sub-domain of Artificial Intelligence. It gives computers the ability to learn and infer from a huge amount of homogeneous data, without having to be programmed explicitly.Types of Machine Learning: Machine Learning can broadly be classified into three types:Supervised Learning: If the available dataset has predefined features and labels, on which the machine learning models are trained, then the type of learning is known as Supervised Machine Learning. Supervised Machine Learning Models can broadly be classified into two sub-parts: Classification and Regression. These have been discussed further in detail.Unsupervised Learning: If the available dataset has predefined features but lacks labels, then the Machine Learning algorithms perform operations on this data to assign labels to it or to reduce the dimensionality of the data. There are several types of Unsupervised Learning Models, the most common of them being: Principal Component Analysis (PCA) and Clustering.Reinforcement Learning: Reinforcement Learning is a more advanced type of learning, where, the model learns from “Experience”. Here, features and labels are not clearly defined. The model is just given a “Situation” and is rewarded or penalized based on the “Outcome”. The model thus learns to optimize the “Situation” to maximize the Rewards and hence improves the “Outcome” with “Experience”.ClassificationClassification is the process of determination/prediction of the category to which a data-point may belong to. It is the process by which a Supervised Learning Algorithm learns to draw inference from the features of a given dataset and predict which class or group or category does the particular data point belongs to.Example of Classification: Let’s assume that we are given a few images of handwritten digits (0-9). The problem statement is to “teach” the machine to classify correctly which image corresponds to which digit. A small sample of the dataset is given below:The machine has to be thus trained, such that, when given an input of any such hand-written digit, it has to correctly classify the digits and mention which digit the image represents. This is classed classification of hand-written digits.Looking at another example which is not image-based, we have 2D data (x1 and x2) which is plotted in the form of a graph shown below.The red and green dots represent two different classes or categories of data. The main goal of the classifier is that given one such “dot” of unknown class, based on its “features”, the algorithm should be able to correctly classify if that dot belongs to the red or green class. This is also shown by the line going through the middle, which correctly classifies the majority of the dots.Applications of Classification: Listed below are some of the real-world applications of classification Algorithms.Face Recognition: Face recognition finds its applications in our smartphones and any other place with Biometric security. Face Recognition is nothing but face detection followed by classification. The classification algorithm determines if the face in the image matches with the registered user or not.Medical Image Classification: Given the data of patients, a model that is well trained is often used to classify if the patient has a malignant tumor (cancer), heart ailments, fractures, etc.RegressionRegression is also a type of supervised learning. Unlike classification, it does not predict the class of the given data. Instead, it predicts the corresponding values of a given dataset based on the “features” it encounters.Example of Regression: For this, we will look at a dataset consisting of California Housing Prices. The contents of this dataset are shown below.Here, there are several columns. Each of the columns shows the “features” based on which the machine learning algorithm predicts the housing price (shown by yellow highlight). The primary goal of the regression algorithm is that, given the features of a given house, it should be able to correctly estimate the price of the house. This is called a regression problem. It is similar to curve fitting and is often confused with the same.Applications of Regression: Listed below are some of the real-world applications of regression Algorithms.Stock Market Prediction: Regression algorithms are used to predict the future price of stocks based on certain past features like time of the day or festival time, etc. Stock Market Prediction also falls under a subdomain of study called Time Series Analysis.Object Detection Algorithms: Object Detection is the process of detection of the location of a given object in an image or video. This process returns the coordinates of the pixel values stating the location of the object in the image. These coordinates are determined by using regression algorithms alongside classification.ClassificationRegressionAssign specific classes to the data based on its features.Predict values based on the features of the dataset.Prediction is discrete or categorical in nature.Prediction is continuous in nature.Introduction to the building blocks of Decision TreesIn order to get started with Decision Trees, it is important to understand the basic building blocks of decision trees. Hence, we start building the concepts slowly with some basic theory.1. EntropyDefinition: It is a commonly used concept in Information Theory and is a measure of “purity” of an arbitrary collection of information.Mathematical Equation:Here, given a collection S, containing positive and negative examples, the Entropy of S is given by the above equation, where, p represents the probability of occurrence of that example in the given data.In a more generalized form, Entropy is given by the following equation:Example: As an example, a sample S is taken, which contains 14 data samples and includes 9 positive and 5 negative samples. The same is denoted by the mathematical notion: [9+, 5­­–].Thus, Entropy of the given sample can be calculated as follows:2. Information GainDefinition: With the knowledge of Entropy, the amount of relevant information that is gained form a given random sample size can be calculated and is known as Information Gain.Mathematical Equation:Here, the Gain (S, A) is the Information Gain of an attribute A relative to a sample S. The Values(A) is a set of all possible values for attribute A.Example: As an example, let’s assume S is a collection of 14 training-examples. Here, in this example, we will consider the Attribute to be Wind and the values of that corresponding attribute will be Weak and Strong. In addition to the previous example information, we will assume that out of the previously mentioned 9 positives and 5 negative samples, 6 positive and 2 negative samples have the value of the attribute Wind=Weak, and the remaining have Wind=Strong. Thus, under such a circumstance, the information gained by the attribute Wind is shown below.Decision TreeIntroduction: Since we have the basic building blocks out of the way, let’s try to understand what exactly is a Decision Tree. As the name suggests, it is a Tree which is developed based on certain decisions taken by the algorithm in accordance with the given data that it has been trained on.In simple words, a Decision Tree uses the features in the given data to perform Supervised Learning and develop a tree-like structure (data structure) whose branches are developed in such a way that given the feature-set, the decision tree can predict the expected output relatively accurately.Example:  Let us look at the structure of a decision tree. For this, we will take up an example dataset called the “PlayTennis” dataset. A sample of the dataset is shown below.In summary, the target of the model is to predict if the weather conditions are suitable to play tennis or not, as guided by the dataset shown above.As it can be seen in the dataset, it contains certain information (features) for each day. In this, we have the feature-attributes: Outlook, Temperature, Humidity and Wind and the target-attribute PlayTennis. Each of these attributes can take up certain values, for example, the attribute Outlook has the values Sunny, Rain and Overcast.With a clear idea of the dataset, jumping a bit forward, let us look at the structure of the learned Decision Tree as developed from the above dataset.As shown above, it can clearly be seen that, given certain values for each of the attributes, the learned decision tree is capable of giving a clear answer as to whether the weather is suitable for Tennis or not.Algorithm: With the overall intuition of decision trees, let us look at the formal Algorithm:ID3(Samples, Target_attribute, Attributes):Create a root node for the TreeIf all the Samples are positive, Return a single-node tree Root, with label = +If all the Samples are negative, Return a single-node tree Root, with label = –If Attribute is empty, Return the single-node tree Root with the label = Most common value of the Target_attribute among the Samples.Otherwise:A ← the attribute from Attributes that best classifies the SamplesThe decision attribute for Root ← AFor each possible value of A:Add a new tree branch below Root, corresponding to the test A = viLet the Samplesvi be the subset of Samples that have value vi for AIf Samplesvi is empty:Then below the new branch add a leaf node with the label = most common value of Target_attribute in the samplesElse below the new branch add the subtree:ID3(Samplesvi, Target_attribute, Attributes – {A})EndReturn RootConnecting the dots: Since the overall idea of decision trees have been explained, let’s try to figure out how Entropy and Information Gain fits into this entire process.Entropy (E) is used to calculate Information Gain, which is used to identify which attribute of a given dataset provides the highest amount of information. The attribute which provides the highest amount of information for the given dataset is considered to have more contribution towards the outcome of the classifier and hence is given the higher priority in the tree.For Example, considering the PlayTennis Example, if we calculate the Information Gain for two corresponding Attributes: Humidity and Wind, we would find that Humidity plays a more important role in deciding whether to play tennis or not. Hence, in this case, Humidity is considered as a better classifier. The detailed calculation is shown in the figure below:Applications of Decision TreeWith the basic idea out of the way, let’s look at where decision trees can be used:Select a flight to travel: Decision trees are very good at classification and hence can be used to select which flight would yield the best “bang-for-the-buck”. There are a lot of parameters to consider, such as if the flight is connecting or non-stop, or how reliable is the service record of the given airliner, etc.Selecting alternative products: Often in companies, it is important to determine which product will be more profitable at launch. Given the sales attributes such as market conditions, competition, price, availability of raw materials, demand, etc. a Decision Tree classifier can be used to accurately determine which of the products would maximize the profits.Sentiment Analysis: Sentiment Analysis is the determination of the overall opinion of a given piece of text and is especially used to determine if the writer’s comment towards a given product/service is positive, neutral or negative. Decision trees are very versatile classifiers and are used for sentiment analysis in many Natural Language Processing (NLP) applications.Energy Consumption: It is very important for electricity supply boards to correctly predict the amount of energy consumption in the near future for a particular region. This is to make sure that un-used power can be diverted towards an area with a higher demand to keep a regular and uninterrupted supply of power throughout the grid. Decision Trees are often used to determine which region is expected to require more or less power in the up-coming time-frame.Fault Diagnosis: In the Engineering domain, one of the widely used applications of decision trees is the determination of faults. In the case of load-bearing rotatory machines, it is important to determine which of the component(s) have failed and which ones can directly or indirectly be affected by the failure. This is determined by a set of measurements that are taken. Unfortunately, there are numerous measurements to take and among them, there are some measurements which are not relevant to the detection of the fault. A Decision Tree classifier can be used to quickly determine which of these measurements are relevant in the determination of the fault.Advantages of Decision TreeListed below are some of the advantages of Decision Trees:Comprehensive: Another significant advantage of a decision tree is that it forces the algorithm to take into consideration all the possible outcomes of a decision and traces each path to a conclusion.Specific: The output of decision trees is very specific and reduces uncertainty in the prediction. Hence, they are considered as really good classifiers.Easy to use: Decision Trees are one of the simplest, yet most versatile algorithms in Machine Learning. It is based on simple math and no complex formulas. They are easy to visualize, understand and explain.Versatile: A lot of business problems can be solved using Decision Trees. They find their applications in the field of Engineering, Management, Medicine, etc. basically, any situation where data is available and a decision needs to be taken in uncertain conditions.Resistant to data abnormalities: Data is never perfect and there are always many abnormalities in the dataset. Some of the most common abnormalities are outliers, missing data and noise. While most Machine Learning algorithms fail with even a minor set of abnormalities, Decision Trees are very resilient and is able to handle a fair percentage of such abnormalities quite well without altering the results.Visualization of the decision taken: Often in Machine Learning models, data scientists struggle to reason as to why a certain model is giving a certain set of outputs. Unfortunately, for most of the algorithms, it is not possible to clearly determine and visualize the actual process of classification that leads to the final outcome. However, decision trees are very easy to visualize. Once the tree is trained, it can be visualized and the programmer can see exactly how and why the conclusion was reached. It is also easy to explain the outcome to a non-technical team with the “tree” type visualization. This is why many organizations prefer to use decision trees over other Machine Learning Algorithms.Limitations of Decision TreeListed below are some of the limitations of Decision Trees:Sensitivity to hyperparameter tuning: Decision Trees are very sensitive to hyperparameter tuning. Hyperparameters are those parameters which are in control of the programmer and can be tuned to get better performance out of a given model. Unfortunately, the output of a decision tree can vary drastically if the hyperparameters are inaccurately tuned.Overfitting: Decision trees are prone to overfitting. Overfitting is a concept where the model learns the data too well and hence performs well on training dataset but fails to perform on testing dataset. Decision trees are prone to overfitting if the breadth and depth of the tree is set to very high for a simpler dataset.Underfitting: Similar to overfitting, decision trees are also prone to underfitting. Underfitting is a concept where the model is too simple for it to learn the dataset effectively. Decision tree suffers from underfitting if the breadth and depth of the model or the number of nodes are set too low. This does not allow the model to fit the data properly and hence fails to learn.Code ExamplesWith the theory out of the way, let’s look at the practical implementation of decision tree classifiers and regressors.1. ClassificationIn order to conduct classification, a diabetes dataset from Kaggle has been used. It can be downloaded.The initial step for any data science application is data visualization. Hence, the dataset is shown below:The highlighted column is the target value that the model is expected to predict, given the parameters.Load the Libraries. We will be using pandas to load and manipulate data. Sklearn is used for applying Machine Learning models on the data.# Load libraries import pandas as pd from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier from sklearn.model_selection import train_test_split # Import train_test_split function from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation Load the data. Pandas is used to read the data from the CSV.col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label'] # load dataset pima = pd.read_csv("pima-indians-diabetes.csv", header=None, names=col_names)Feature Selection: The relevant features are selected for the classification.#split dataset in features and target variable feature_cols = ['pregnant', 'insulin', 'bmi', 'age','glucose','bp','pedigree'] X = pima[feature_cols] # Features y = pima.label # Target variablesplitting the data: The dataset needs to be split into training and testing data. The training data is used to train the model, while the testing data is used to test the model’s performance on unseen data.# Split dataset into training set and test set X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # 70% training and 30% testBuilding the decision tree. These few lines initialize, train and predict on the dataset.# Create Decision Tree classifier object clf = DecisionTreeClassifier() # Train Decision Tree Classifier clf = clf.fit(X_train,y_train) #Predict the response for test dataset y_pred = clf.predict(X_test)The model’s accuracy is evaluated by using Sklearn’s metrics library. # Model Accuracy, how often is the classifier correct? print("Accuracy:",metrics.accuracy_score(y_test, y_pred)) Output: Accuracy: 0.6753246753246753This will generate the decision tree that is shown in the following image2. RegressionIn order to conduct classification, a diabetes dataset from Kaggle has been used.For this example, we will generate a Numpy Array which will simulate a scatter plot resembling a sine wave with a few randomly added noise elements. # Import the necessary modules and libraries import numpy as np from sklearn.tree import DecisionTreeRegressor import matplotlib.pyplot as plt # Create a random dataset rng = np.random.RandomState(1) X = np.sort(5 * rng.rand(80, 1), axis=0) y = np.sin(X).ravel() y[::5] += 3 * (0.5 - rng.rand(16))This time we create two regression models to experiment and see how overfitting looks like for a decision tree. Hence, we initialize the two Decision Tree Regression objects and train them on the given data.# Fit regression model regr_1 = DecisionTreeRegressor(max_depth=2) regr_2 = DecisionTreeRegressor(max_depth=5) regr_1.fit(X, y) regr_2.fit(X, y)After fitting the model, we predict on a custom test dataset and plot the results to see how it performed.# Predict X_test = np.arange(0.0, 5.0, 0.01)[:, np.newaxis] y_1 = regr_1.predict(X_test) y_2 = regr_2.predict(X_test) # Plot the results plt.figure() plt.scatter(X, y, s=20, edgecolor="black",             c="darkorange", label="data") plt.plot(X_test, y_1, color="cornflowerblue",         label="max_depth=2", linewidth=2) plt.plot(X_test, y_2, color="yellowgreen", label="max_depth=5", linewidth=2) plt.xlabel("data") plt.ylabel("target") plt.title("Decision Tree Regression") plt.legend() plt.show() The graph that is thus generated is shown below. Here we can clearly see that for a simple dataset  when we used max_depth=5 (green), the model started to overfit and learned the patterns of the noise along with the sine wave. Such kinds of models do not perform well. Meanwhile, for max_depth=3 (blue), it has fitted the dataset in a better way when compared to the other one.ConclusionIn this article, we tried to build an intuition, by starting from the basics of the theory behind the working of a decision tree classifier. However, covering every aspect of detail is beyond the scope of this article. Hence, it is suggested to go through this book to dive deeper into the specifics. Further, moving on, the code snippets introduces the “Hello World” of how to use both, real-world data and artificially generated data to train a Decision Tree model and predict using the same. This will allow any novice to get an overall balanced theoretical and practical idea about the workings of Classification and Regression Trees and their implementation.
Rated 4.5/5 based on 12 customer reviews
8117
What are Decision Trees in Machine Learning (Class...

Introduction to Machine Learning and its typesMach... Read More

Top Data Science Trends in 2020

Industry experts are of the view that 2020 will be a huge year for data science and AI. The expected growth rate for AI market will be at 118.6 billion by 2025. The focus areas in the overall AI market will include everything from natural language processing to robotic process automation. Since the beginning of digital era, data has been growing at the speed of light! There will only be a surge in this growth. New data will not only generate more innovative use cases but also spearhead a revolution of innovation.About 77% of the devices today have AI incorporated into them. Smart devices, Netflix recommendations, Amazon’s Alexa Google Home have transformed the way we live in the digital age.  The renowned AI-powered virtual nurses “Molly” and “Angel”, have taken healthcare to new heights and robots have already been performing various surgical procedures.Dynamic technologies like data science and AI have some intriguing data science trends to watch out for, in 2020. Check out the top 6 data science trends in 2020 any data science enthusiast should know:1. Advent of Deep LearningSimply put, deep learning is a machine learning technique that trains computers to think and act like humans i.e., by example. Ever since, deep learning models have proven their efficacy by exceeding human limitations and performance. Deep learning models are usually trained using a large set of labelled data and multi-layered neural network architectures.What’s new for Deep Learning in 2020?In 2020, deep learning will be quite significant. Its capacity to foresee and understand human behaviour and how enterprises can utilize this knowledge to stay ahead of their competitors will come in handy.2. Spotlight on Augmented AnalyticsAlso hailed as the future of Business Intelligence, Augmented analytics employs machine learning/artificial intelligence (ML/AI) techniques to automate data preparation, insight discovery and sharing, data science and ML model development, management and deployment. This can be greatly beneficial for companies to improve their offerings and customer experience. The global augmented analytics market size is projected to reach $29,856 million by 2025. Its growth rate is expected to be at a CAGR of 28.4% from 2018 to 2025.What’s New for Augmented Analytics in 2020?This year, augmented analytics platforms will help enterprises leverage social component. The use of interactive dashboards and visualizations in augmented analytics will help stakeholders share important insights and create a crystal-clear narrative that echoes the company’s mission.3. Impact of IoT, ML and AI2020 will see the rise of AI/ML, 5G, cybersecurity and IoT. The rise in automation will create opportunities for new skills to be explored. Upskilling in emerging new technologies will make professionals competent in the dynamic tech space today. As per a survey by IDC, over 75% of organizations will invest in reskilling programs or their workforce to bridge the rising skill gap by 2025.What’s new for IoT, ML and AI in 2020?It has been estimated that over 24 billion devices will be connected to the Internet of Things this year. This means industries can create a world of difference by developing smart devices that make a difference to the way we live.4. Better Mobile Analytics StrategiesMobile analytics deals with measuring and analysing data created across various mobile platform sites and applications alone. It helps businesses keep track of the behaviour of their users on mobile sites and apps. This technology will aid in boosting the cross-channel marketing initiatives of an enterprise, while optimizing the mobile experience and growing user engagement and retention.What’s new for Mobile Analytics in 2020?With the ever-increasing number of mobile phone users globally, there will be a heightened focus on mobile app marketing and app analytics. Currently, mobile advertising ranks first in digital advertising worldwide. This had made mobile analytics quintessential, as businesses today can track in-app traffic, potential security threats, as well as the levels of customer satisfaction.5. Enhanced Levels of CustomizationAccess to real-time data and customer behaviour has made it possible to cater to each customer’s specific needs. As customer expectations soar, companies would have to buckle up to deliver more personalized, relevant and superior customer experience. The use of data and AI will make it possible.What’s new for User Experience in 2020?Interpreting user experience can be done easily using data derived from conversions, pageviews, and other user actions. These insights will help user experience professionals make better decisions while providing the user exactly what he/she needs.6. Better Cybersecurity2019 brought to light the grim reality of data privacy and security breach. With over 24 billion devices estimated to be connected to the internet this year, stringent measures will be deployed by enterprises to protect data privacy and prevent security breach. Industry experts are of the view that a combination of cybersecurity and AI-enabled technology results will lead to effective attack surface coverage that is a 20x more effective attack than traditional methods.What’s new for Cybersecurity in 2020?Since data shows no signs of stopping in its growth, the number of threats will also keep looming on the horizon. In 2020, cybersecurity professionals would have to gear themselves to conjure new and improved ways to secure data. Therefore, AI-supported cybersecurity measures will be deployed to prevent malicious attacks from malware and ensure better cybersecurity.The Way Ahead in 2020Data Science will be one of the fastest-growing technologies in 2020. Its wide range of applications in various industries will pave way for more innovative trends like the ones above.
Rated 4.5/5 based on 0 customer reviews
5900
Top Data Science Trends in 2020

Industry experts are of the view that 2020 will be... Read More

Top 30 Machine Learning Skills required to get a Machine Learning Job

Machine learning has been making a silent revolution in our lives since the past decade. From capturing selfies with a blurry background and focused face capture to getting our queries answered by virtual assistants such as Siri and Alexa, we are increasingly depending on products and applications that implement machine learning at their core.In more basic terms, machine learning is one of the steps involved in artificial intelligence. Machines learn through machine learning. How exactly? Just like how humans learn – through training, experience, and feedback.Once machines learn through machine learning, they implement the knowledge so acquired for many purposes including, but not limited to, sorting, diagnosis, robotics, analysis and predictions in many fields.It is these implementations and applications that have made machine learning an in-demand skill in the field of programming and technology.Look at the stats that show a positive trend for machine learning projects and careers.Gartner’s report on artificial intelligence showed that as many as 2.3 million jobs in machine learning would be available across the globe by 2020.Another study from Indeed, the online job portal giant, revealed that machine learning engineers, data scientists and software engineers with these skills are topping the list of most in-demand professionals.High profile companies such as Univa, Microsoft, Apple, Google, and Amazon have invested millions of dollars on machine learning research and designing and are developing their future projects on it.With so much happening around machine learning, it is no surprise that any enthusiast who is keen on shaping their career in software programming and technology would prefer machine learning as a foundation to their career. This post is specifically aimed at guiding such enthusiasts and gives comprehensive information on skills that are needed to become a machine learning engineer, who is ready to dive into the real-time challenges.Machine Learning SkillsOrganizations are showing massive interest in using machine learning in their products, which would in turn bring plenty of opportunities for machine learning enthusiasts.When you ask machine learning engineers the question – “What do you do as a machine learning engineer?”, chances are high that individual answers would differ from one professional to another. This may sound a little puzzling, but yes, this is true!Hence, a beginner to machine learning needs to have a clear understanding that there are different roles that they can perform with machine learning skills. And accordingly the skill set that they should possess, would differ. This section will give clarity on machine learning skills that are needed to perform various machine learning roles.Broadly, three main roles come into the picture when you talk about machine learning skills:Data EngineerMachine Learning EngineerMachine Learning ScientistOne must understand that data science, machine learning and artificial intelligence are interlinked. The following quote explains this better:Data science produces insights. Machine learning produces predictions. Artificial intelligence produces actions.A machine learning engineer is someone who deals with huge volumes of data to train a machine and impart it with knowledge that it uses to perform a specified task. However, in practice, there may be a little more to add to this:Machine Learning RoleSkills RequiredRoles and ResponsibilitiesData EngineerPython, R, and DatabasesParallel and distributed Knowledge on quality and reliabilityVirtual machines and cloud environmentMapReduce and HadoopCleaning, manipulating and extracting the required data   Developing code for data analysis and manipulationPlays a major role in statistical analysis of dataMachine Learning EngineerConcepts of computer science and software engineeringData analysis and feature engineeringMetrics involved in MLML algorithm selection, and cross validationMath and StatisticsAnalyses and checks the suitability of an algorithm if it caters the needs of the current taskPlays main role in deciding and selecting machine learning libraries for given task.Machine Learning ScientistExpert knowledge in:Robotics and Machine LearningCognitive ScienceEngineeringMathematics and mathematical modelsDesigns new models and algorithms of machine learningResearches intensively on machine learning and publishes their research papers.Thus, gaining machine learning skills should be a task associated with clarity on the job role and of course the passion to learn them. As it is widely known, becoming a machine learning engineer is not a straightforward task like becoming a web developer or a tester.Irrespective of the role, a learner is expected to have solid knowledge on data science. Besides, many other subjects are intricately intertwined in learning machine learning and for a learner it requires a lot of patience and zeal to learn skills and build them up as they move ahead in their career.The following diagram shows the machine learning skills that are in demand year after year:AI - Artificial IntelligenceTensorFlowApache KafkaData ScienceAWS - Amazon Web Services                                                                                                                                                                                                                                                                                                                                Image SourceIn the coming sections, we would be discussing each of these skills in detail and how proficient you are expected to be in them.Technical skills required to become ML EngineerBecoming a machine learning engineer means preparing oneself to handle interesting and challenging tasks that would change the way humanity is experiencing things right now. It demands both technical and non-technical expertise. Firstly, let’s talk about the technical skills needed for a machine learning engineer. Here is a list of technical skills a machine learning engineer is expected to possess:Applied MathematicsNeural Network ArchitecturesPhysicsData Modeling and EvaluationAdvances Signal Processing TechniquesNatural Language ProcessingAudio and video ProcessingReinforcement LearningLet us delve into each skill in detail now:1.Applied MathematicsMathematics plays an important role in machine learning, and hence it is the first one on the list. If you wish to see yourself as a proven machine learning engineer, you ought to love math and be an expert in the following specializations of math.But first let us understand why a machine learning engineer would need math at all? There are many scenarios where a machine learning engineer should depend on math. For example:Choosing the right algorithm that suits the final needsUnderstanding and working with parameters and their settings.Deciding on validation strategiesApproximating the confidence intervals.How much proficiency in Math does a machine learning engineer need to have?It depends on the level at which a machine learning engineer works. The following diagram gives an idea about how important various concepts of math are for a machine learning enthusiast.A) Linear algebra: 15%B) Probability Theory and Statistics: 25%C) Multivariate Calculus: 15%D) Algorithms and Optimization: 15%F) Other concepts: 10%Data SourceA) Linear AlgebraThis concept plays a main role in machine learning. One has to be skilled in the following sub-topics of linear algebra:Principal Component Analysis (PCA), Singular Value Decomposition (SVD)Eigen decomposition of a matrixLU DecompositionQR Decomposition/FactorizationSymmetric MatricesOrthogonalization & OrthonormalizationMatrix OperationsProjectionsEigenvalues & EigenvectorsVector Spaces and NormsB) Probability Theory and StatisticsThe core aim of machine learning is to reduce the probability of error in the final output and decision making of the machine. Thus, it is no wonder that probability and statistics play a major role.The following topics are important in these subjects:CombinatoricsProbability Rules & AxiomsBayes’ TheoremRandom VariablesVariance and ExpectationConditional and Joint DistributionsStandard Distributions (Bernoulli, Binomial, Multinomial, Uniform and Gaussian)Moment Generating Functions, Maximum Likelihood Estimation (MLE)Prior and PosteriorMaximum a Posteriori Estimation (MAP)Sampling Methods.C) CalculusIn calculus, the following concepts have notable importance in machine learning:Integral CalculusPartial Derivatives,Vector-Values FunctionsDirectional GradientHessian, Jacobian, Laplacian and Lagrangian Distributions.D) Algorithms and OptimizationThe scalability and the efficiency of computation of a machine learning algorithm depends on the chosen algorithm and optimization technique adopted. The following areas are important from this perspective:Data structures (Binary Trees, Hashing, Heap, Stack etc)Dynamic ProgrammingRandomized & Sublinear AlgorithmGraphsGradient/Stochastic DescentsPrimal-Dual methodsE) Other ConceptsBesides, the ones mentioned above, other concepts of mathematics are also important for a learner of machine learning. They are given below:Real and Complex Analysis (Sets and Sequences, Topology, Metric Spaces, Single-Valued and Continuous Functions Limits, Cauchy Kernel, Fourier Transforms)Information Theory (Entropy, Information Gain)Function Spaces and Manifolds2.Neural Network ArchitecturesNeural networks are the predefined set of algorithms for implementing machine learning tasks. They offer a class of models and play a key role in machine learning.The following are the key reasons why a machine learning enthusiast needs to be skilled in neural networks:Neural networks let one understand how the human brain works and help to model and simulate an artificial one.Neural networks give a deeper insight of parallel computations and sequential computationsThe following are the areas of neural networks that are important for machine learning:Perceptrons Convolutional Neural Networks Recurrent Neural NetworkLong/Short Term Memory Network (LSTM)Hopfield Networks Boltzmann Machine NetworkDeep Belief NetworkDeep Auto-encoders3.PhysicsHaving an idea of physics definitely helps a machine learning engineer. It makes a difference in designing complex systems and is a skill that is a definite bonus for a machine learning enthusiast.4.Data Modeling and EvaluationA machine learning has to work with huge amounts of data and leverage them into predictive analytics. Data modeling and evaluation is important in working with such bulky volumes of data and estimating how good the final model is.For this purpose, the following concepts are worth learnable for a machine learning engineer:Classification AccuracyLogarithmic LossConfusion MatrixArea under CurveF1 ScoreMean Absolute ErrorMean Squared Error5.Advanced Signal Processing TechniquesThe crux of signal processing is to minimize noise and extract the best features of a given signal.For this purpose, it uses certain concepts such as:convex/greedy optimization theory and algorithmsspectral time-frequency analysis of signalsAlgorithms such as wavelets, shearlets, curvelets, contourlets, bandlets, etc.All these concepts find their application in machine learning as well.6. Natural language processingThe importance of natural language processing in artificial intelligence and machine learning is not to be forgotten. Various libraries and techniques of natural language processing used in machine learning are listed here:Gensim and NLTKWord2vecSentiment analysisSummarization7. Audio and Video ProcessingThis differs from natural language processing in the sense that we can apply audio and video processing on audio signals only. For achieving this, the following concepts are essential for a machine learning engineer:Fourier transformsMusic theoryTensorFlow8. Reinforcement LearningThough reinforcement learning plays a major role in learning and understanding deep learning and artificial intelligence, it is good for a beginner of machine learning to know the basic concepts of reinforcement learning.Programming skills required to become ML EngineerMachine learning, ultimately, is coding and feeding the code to the machines and getting them to do the tasks we intend them to do. As such, a machine learning engineer should have hands-on expertise in software programming and related concepts. Here is a list of programming skills a machine learning engineer is expected to have knowledge on:Computer Science Fundamentals and ProgrammingSoftware Engineering and System DesignMachine Learning Algorithms and LibrariesDistributed computingUnixLet us look into each of these programming skills in detail now:1.Computer Science Fundamentals and ProgrammingIt is important that a machine learning engineer apply the concepts of computer science and programming correctly as the situation demands. The following concepts play an important role in machine learning and are a must on the list of the skillsets a machine learning engineer needs to have:Data structures (stacks, queues, multi-dimensional arrays, trees, graphs)Algorithms (searching, sorting, optimization, dynamic programming)Computability and complexity (P vs. NP, NP-complete problems, big-O notation, approximate algorithms, etc.)Computer architecture (memory, cache, bandwidth, deadlocks, distributed processing, etc.)2.Software Engineering and System DesignWhatever a machine learning engineer does, ultimately it is a piece of software code – a beautiful conglomerate of many essential concepts and the one that is entirely different from coding in other software languages.Hence, it is quintessential that a machine learning engineer have solid knowledge of the following areas of software programming and system design:Scaling algorithms with the size of dataBasic best practices of software coding and design, such as requirement analysis, version control, and testing.Communicating with different modules and components of work using library calls, REST APIs and querying through databases.Best measures to avoid bottlenecks and designing the final product such that it is user-friendly.3. Machine Learning Algorithms and LibrariesA machine learning engineer may need to work with multiple packages, libraries, algorithms as a part of day-to-day tasks. It is important that a machine learning engineer is well-versed with the following aspects of machine learning algorithms and libraries:A thorough idea of various learning procedures including linear regression, gradient descent, genetic algorithms, bagging, boosting, and other model-specific methods.Sound knowledge in packages and APIs such as scikit-learn, Theano, Spark MLlib, H2O, TensorFlow, etc.Expertise in models such as decision trees, nearest neighbor, neural net, support vector machine and a knack to deciding which one fits the best.Deciding and choosing hyperparameters that affect learning model and the outcome.Comfortable to work with concepts such as gradient descent, convex optimization, quadratic programming, partial differential equations.Select an algorithm which yields the best performance from random forests, support vector machines (SVMs), and Naive Bayes Classifiers, etc.4   Distributed Computing Working as a machine learning engineer means working with huge sets of data, not just focused on one isolated system, but spread among a cluster of systems. For this purpose, it is important that a machine learning engineer knows the concepts of distributed computing.5. UnixMost clusters and servers that machine learning engineers need to work are variants of Linux(Unix). Though randomly they work on Windows and Mac, more than half of the time, they need to work on Unix systems only. Hence having sound knowledge on Unix and Linux is a key skill to become a machine learning engineer.Programming Languages for Machine LearningMachine learning engineers need to code to train machines. Several programming languages can be used to do this. The list of programming languages that a machine learning expert should essentially know are as under:C, C++ and JavaSpark and HadoopR ProgrammingApache KafkaPythonWeka PlatformMATLAB/OctaveIn this section, let us know in detail why each of these programming languages is important for a machine learning engineer:1.C, C++ and JavaThese languages give essentials of programming and teach many concepts in a simple manner that form a foundation stone for working on complex programming patterns of machine learning. Knowledge of C++ helps to improve the speed of the program, while Java is needed to work with Hadoop and Hive, and other tools that are essential for a machine learning engineer.2.Spark and HadoopHadoop skills are needed for working in a distributed computing environment. Spark, a recent variant of Hadoop is gaining popularity among the machine learning tribe. It is a framework to implement machine learning on a large scale.3.R ProgrammingR is a programming language built by statisticians specifically to work with programming that involves statistics. Many mathematical computations of machine learning are based on statistics; hence it is no wonder that a machine learning engineer needs to have sound knowledge in R programming.4.Apache KafkaApache Kafka concepts such as Kafka Streams and KSQL play a major role in pre-processing of data in machine learning. Also, a sound knowledge of Apache Kafka lets a machine learning engineer to design solutions that are both multi-cloud based or hybrid cloud-based.  Other concepts such as business information such as latency and model accuracy are also from Kafka and find use in Machine learning.5.PythonOf late, Python has become the unanimous programming language for machine learning. In fact, experts quote that humans communicate with machines through Python language.Why Python is preferred for Machine Learning?Python Programming Language has several key features and benefits that make it the monarch of programming languages for machine learning:It is an all-in-one purpose programming language that can do a lot more than dealing with statistics.It is beginner friendly and easy to learn.It boasts of rich libraries and APIs that solve various needs of machine learning pretty easily.Its productivity is higher than its other counterparts.It offers ease of integration and gets the workflow smoothly from the designing stage to the production stage.Python EcoSystemThere are various components of Python that make it preferred language for machine learning. Such components are discussed below:Jupyter NotebookNumpyPandasScikit-LearnTensorFlow1.Jupyter NotebookJupyter offers excellent computational environment for Python based data science applications. With the help of Jupyter notebook, a machine learning engineer can illustrate the flow of the process step-by-step very clearly.2.NumPyNumPy or Numerical Python is one of the components of Python that allows the following operations of machine learning in a smooth way:Fourier transformationLinear algebraic operationsLogical and numerical operations on arrays.Of late, NumPy is gaining attention because it makes an excellent substitute to MATLAB, as it coordinates with Matplotlib and SciPy very smoothly.3.PandasPandas is a Python library that offers various features for loading, manipulating, analysing, modeling and preparing data. It is entirely dedicated for data analysis and manipulation.4.Scikit-learnBuilt on NumPy, SciPy, and Matplotlib, it is an open-source library of Python. It offers excellent features and functionalities for major aspects of machine learning such as clustering, dimensionality reduction, model reduction, regression and classification.5.TensorFlowTensorFlow is another framework of Python. It finds its usage in deep learning and having a knowledge of its libraries such as Keras, helps a machine learning engineer to move ahead confidently in their career.6.Weka PlatformIt is widely known that machine learning is a non-linear process that involves many iterations. Weka or Waikato Environment for Knowledge Analysis is a recent platform that is designed specifically designed for applied machine learning. This tool is also slowing gaining its popularity and thus is a must-include on the list of skills for a machine learning engineer.7.MATLAB/OctaveThis is a basic programming language that was used for simulation of various engineering models. Though not popularly used in machine learning, having sound knowledge in MATLAB lets one learns the other mentioned libraries of Python easily.Soft skills or behavioural skills required to become ML engineerTechnical skills are relevant only when they are paired with good soft skills. And the machine learning profession is no exception to this rule. Here is a list of soft skills that a machine learning engineer should have:Domain knowledgeCommunication SkillsProblem-solving skillsRapid prototypingTime managementLove towards constant learningLet us move ahead and discuss how each of these skills make a difference to a machine learning engineer.1.Domain knowledgeMachine learning is such a subject that needs the best of its application in real-time. Choosing the best algorithm while solving a machine learning problem in your academia is far different from what you do in practice. Various aspects of business come into picture when you are a real-time machine learning engineer. Hence, a solid understanding of the business and domain of machine learning is of utmost importance to succeed as a good machine learning engineer.2.Communication SkillsAs a machine learning engineer, you need to communicate with offshore teams, clients and other business teams. Excellent communication skills are a must to boost your reputation and confidence and to bring up your work in front of peers.3.Problem-solving skillsMachine learning is all about solving real time challenges. One must have good problem-solving skills and be able to weigh the pros and cons of the given problem and apply the best possible methods to solve it.4.Rapid PrototypingChoosing the correct learning method or the algorithm are signs of a machine learning engineer’s good prototyping skills. These skills would be a great saviour in real time as they would show a huge impact on budget and time taken for successfully completing a machine learning project.5.Time managementTraining a machine is not a cake-walk. It takes huge time and patience to train a machine. But it’s not always that machine learning engineers are allotted ample time for completing tasks. Hence, time management is an essential skill a machine learning professional should have to effectively deal with bottlenecks and deadlines.6.Love towards constant learningSince its inception, machine learning has witnessed massive change – both in the way it is implemented and in its final form. As we have seen in the previous section, technical and programming skills that are needed for machine learning are constantly evolving. Hence, to prove oneself a successful machine learning expert, it is very crucial that they have a zeal to update themselves – constantly!ConclusionThe skills that one requires to begin their journey in machine learning are exactly what we have discussed in this post. The future for machine learning is undoubtedly bright with companies ready to offer millions of dollars as remuneration, irrespective of the country and the location.Machine learning and deep learning will create a new set of hot jobs in the next five years. – Dave WatersAll it takes to have an amazing career in machine learning is a strong will to hone one’s skills and gain a solid knowledge of them. All the best for an amazing career in machine learning!
Rated 4.5/5 based on 43 customer reviews
14022
Top 30 Machine Learning Skills required to get a M...

Machine learning has been making a silent revoluti... Read More