Search

What is Linear Regression in Machine Learning

Machine Learning, being a subset of Artificial Intelligence (AI), has been playing a dominant role in our daily lives. Data science engineers and developers working in various domains are widely using machine learning algorithms to make their tasks simpler and life easier. For example, certain machine learning algorithms enable Google Maps to find the fastest route to our destinations, allow Tesla to make driverless cars, help Amazon to generate almost 35% of their annual income, AccuWeather to get the weather forecast of 3.5 million locations weeks in advance, Facebook to automatically detect faces and suggest tags and so on.In statistics and machine learning, linear regression is one of the most popular and well understood algorithms. Most data science enthusiasts and machine learning  fanatics begin their journey with linear regression algorithms. In this article, we will look into how linear regression algorithm works and how it can be efficiently used in your machine learning projects to build better models.Linear Regression is one of the machine learning algorithms where the result is predicted by the use of known parameters which are correlated with the output. It is used to predict values within a continuous range rather than trying to classify them into categories. The known parameters are used to make a continuous and constant slope which is used to predict the unknown or the result.What is a Regression Problem?Majority of the machine learning algorithms fall under the supervised learning category. It is the process where an algorithm is used to predict a result based on the previously entered values and the results generated from them. Suppose we have an input variable ‘x’ and an output variable ‘y’ where y is a function of x (y=f{x}). Supervised learning reads the value of entered variable ‘x’ and the resulting variable ‘y’ so that it can use those results to later predict a highly accurate output data of ‘y’ from the entered value of ‘x’. A regression problem is when the resulting variable contains a real or a continuous value. It tries to draw the line of best fit from the data gathered from a number of points.For example, which of these is a regression problem?How much gas will I spend if I drive for 100 miles?What is the nationality of a person?What is the age of a person?Which is the closest planet to the Sun?Predicting the amount of gas to be spent and the age of a person are regression problems. Predicting nationality is categorical and the closest planet to the Sun is discrete.What is Linear Regression?Let’s say we have a dataset which contains information about the relationship between ‘number of hours studied’ and ‘marks obtained’. A number of students have been observed and their hours of study along with their grades are recorded. This will be our training data. Our goal is to design a model that can predict the marks if number of hours studied is provided. Using the training data, a regression line is obtained which will give minimum error. This linear equation is then used to apply for a new data. That is, if we give the number of hours studied by a student as an input, our model should be able to predict their mark with minimum error.Hypothesis of Linear RegressionThe linear regression model can be represented by the following equation:where,Y is the predicted valueθ₀ is the bias term.θ₁,…,θn are the model parametersx₁, x₂,…,xn are the feature values.The above hypothesis can also be represented byWhere, θ is the model’s parameter vector including the bias term θ₀; x is the feature vector with x₀ =1Y (pred) = b0 + b1*xThe values b0 and b1 must be chosen so that the error is minimum. If sum of squared error is taken as a metric to evaluate the model, then the goal is to obtain a line that best reduces the error.If we don’t square the error, then the positive and negative points will cancel each other out.For a model with one predictor,Exploring ‘b1’If b1 > 0, then x (predictor) and y(target) have a positive relationship. That is an increase in x will increase y.If b1 < 0, then x (predictor) and y(target) have a negative relationship. That is an increase in x will decrease y.Exploring ‘b0’If the model does not include x=0, then the prediction will become meaningless with only b0. For example, we have a dataset that relates height(x) and weight(y). Taking x=0 (that is height as 0), will make the equation have only b0 value which is completely meaningless as in real-time height and weight can never be zero. This resulted due to considering the model values beyond its scope.If the model includes value 0, then ‘b0’ will be the average of all predicted values when x=0. But, setting zero for all the predictor variables is often impossible.The value of b0 guarantees that the residual will have mean zero. If there is no ‘b0’ term, then the regression will be forced to pass over the origin. Both the regression coefficient and prediction will be biased.How does Linear Regression work?Let’s look at a scenario where linear regression might be useful: losing weight. Let us consider that there’s a connection between how many calories you take in and how much you weigh; regression analysis can help you understand that connection. Regression analysis will provide you with a relation which can be visualized into a graph in order to make predictions about your data. For example, if you’ve been putting on weight over the last few years, it can predict how much you’ll weigh in the next ten years if you continue to consume the same amount of calories and burn them at the same rate.The goal of regression analysis is to create a trend line based on the data you have gathered. This then allows you to determine whether other factors apart from the amount of calories consumed affect your weight, such as the number of hours you sleep, work pressure, level of stress, type of exercises you do etc. Before taking into account, we need to look at these factors and attributes and determine whether there is a correlation between them. Linear Regression can then be used to draw a trend line which can then be used to confirm or deny the relationship between attributes. If the test is done over a long time duration, extensive data can be collected and the result can be evaluated more accurately. By the end of this article we will build a model which looks like the below picture i.e, determine a line which best fits the data.How do we determine the best fit line?The best fit line is considered to be the line for which the error between the predicted values and the observed values is minimum. It is also called the regression line and the errors are also known as residuals. The figure shown below shows the residuals. It can be visualized by the vertical lines from the observed data value to the regression line.When to use Linear Regression?Linear Regression’s power lies in its simplicity, which means that it can be used to solve problems across various fields. At first, the data collected from the observations need to be collected and plotted along a line. If the difference between the predicted value and the result is almost the same, we can use linear regression for the problem.Assumptions in linear regressionIf you are planning to use linear regression for your problem then there are some assumptions you need to consider:The relation between the dependent and independent variables should be almost linear.The data is homoscedastic, meaning the variance between the results should not be too much.The results obtained from an observation should not be influenced by the results obtained from the previous observation.The residuals should be normally distributed. This assumption means that the probability density function of the residual values is normally distributed at each independent value.You can determine whether your data meets these conditions by plotting it and then doing a bit of digging into its structure.Few properties of Regression LineHere are a few features a regression line has:Regression passes through the mean of independent variable (x) as well as mean of the dependent variable (y).Regression line minimizes the sum of “Square of Residuals”. That’s why the method of Linear Regression is known as “Ordinary Least Square (OLS)”. We will discuss more in detail about Ordinary Least Square later on.B1 explains the change in Y with a change in x  by one unit. In other words, if we increase the value of ‘x’ it will result in a change in value of Y.Finding a Linear Regression lineLet’s say we want to predict ‘y’ from ‘x’ given in the following table and assume they are correlated as “y=B0+B1∗x”xyPredicted 'y'12Β0+B1∗121Β0+B1∗233Β0+B1∗346Β0+B1∗459Β0+B1∗5611Β0+B1∗6713Β0+B1∗7815Β0+B1∗8917Β0+B1∗91020Β0+B1∗10where,Std. Dev. of x3.02765Std. Dev. of y6.617317Mean of x5.5Mean of y9.7Correlation between x & y0.989938If the Residual Sum of Square (RSS) is differentiated with respect to B0 & B1 and the results equated to zero, we get the following equation:B1 = Correlation * (Std. Dev. of y/ Std. Dev. of x)B0 = Mean(Y) – B1 * Mean(X)Putting values from table 1 into the above equations,B1 = 2.64B0 = -2.2Hence, the least regression equation will become –Y = -2.2 + 2.64*xxY - ActualY - Predicted120.44213.08335.72468.36591161113.6471316.2881518.9291721.56102024.2As there are only 10 data points, the results are not too accurate but if we see the correlation between the predicted and actual line, it has turned out to be very high; both the lines are moving almost together and here is the graph for visualizing our predicted values:Model PerformanceAfter the model is built, if we see that the difference in the values of the predicted and actual data is not much, it is considered to be a good model and can be used to make future predictions. The amount that we consider “not much” entirely depends on the task you want to perform and to what percentage the variation in data can be handled. Here are a few metric tools we can use to calculate error in the model-R – Square (R2)Total Sum of Squares (TSS): total sum of squares (TSS) is a quantity that appears as part of a standard way of presenting results of such an analysis. Sum of squares is a measure of how a data set varies around a central number (like the mean). The Total Sum of Squares tells how much variation there is in the dependent variable.TSS = Σ (Y – Mean[Y])2Residual Sum of Squares (RSS): The residual sum of squares tells you how much of the dependent variable’s variation your model did not explain. It is the sum of the squared differences between the actual Y and the predicted Y.RSS = Σ (Y – f[Y])2(TSS – RSS) measures the amount of variability in the response that is explained by performing the regression.Properties of R2R2 always ranges between 0 to 1.R2 of 0 means that there is no correlation between the dependent and the independent variable.R2 of 1 means the dependent variable can be predicted from the independent variable without any error. An R2 between 0 and 1 indicates the extent to which the dependent variable is predictable. An R2 of 0.20 means that there is 20% of the variance in Y is predictable from X; an R2 of 0.40 means that 40% is predictable; and so on.Root Mean Square Error (RMSE)Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors). The formula for calculating RMSE is:Where N : Total number of observationsWhen standardized observations are used as RMSE inputs, there is a direct relationship with the correlation coefficient. For example, if the correlation coefficient is 1, the RMSE will be 0, because all of the points lie on the regression line (and therefore there are no errors).Mean Absolute Percentage Error (MAPE)There are certain limitations to the use of RMSE, so analysts prefer MAPE over RMSE which gives error in terms of percentages so that different models can be considered for the task and see how they perform. Formula for calculating MAPE can be written as:Where N : Total number of observationsFeature SelectionFeature selection is the automatic selection of attributes for your data that are most relevant to the predictive model you are working on. It seeks to reduce the number of attributes in the dataset by eliminating the features which are not required for the model construction. Feature selection does not totally eliminate an attribute which is considered for the model, rather it mutes that particular characteristic and works with the features which affects the model.Feature selection method aids your mission to create an accurate predictive model. It helps you by choosing features that will give you as good or better accuracy whilst requiring less data. Feature selection methods can be used to identify and remove unnecessary, irrelevant and redundant attributes from the data that do not contribute to the accuracy of the model or may even decrease the accuracy of the model. Having fewer attributes is desirable because it reduces the complexity of the model, and a simpler model is easier to understand, explain and to work with.Feature Selection Algorithms:Filter Method: This method involves assigning scores to individual features and ranking them. The features that have very little to almost no impact are removed from consideration while constructing the model.Wrapper Method: Wrapper method is quite similar to Filter method except the fact that it considers attributes in a group i.e. a number of attributes are taken and checked whether they are having an impact on the model and if not another combination is applied.Embedded Method: Embedded method is the best and most accurate of all the algorithms. It learns the features that affect the model while the model is being constructed and takes into consideration only those features. The most common type of embedded feature selection methods are regularization methods.Cost FunctionCost function helps to figure out the best possible plots which can be used to draw the line of best fit for the data points. As we want to reduce the error of the resulting value we change the process of finding out the actual result to a process which can reduce the error between the predicted value and the actual value.Here, J is the cost function.The above function is made in this format to calculate the error difference between the predicted values and the plotted values. We take the square of the summation of all the data points and divide it by the total number of data points. This cost function J is also called the Mean Squared Error (MSE) function. Using this MSE function we are going to predict values such that the MSE value settles at the minima, reducing the cost function.Gradient DescentGradient Descent is an optimization algorithm that helps machine learning models to find out paths to a minimum value using repeated steps. Gradient descent is used to minimize a function so that it gives the lowest output of that function. This function is called the Loss Function. The loss function shows us how much error is produced by the machine learning model compared to actual results. Our aim should be to lower the cost function as much as possible. One way of achieving a low cost function is by the process of gradient descent. Complexity of some equations makes it difficult to use, partial derivative of the cost function with respect to the considered parameter can provide optimal coefficient value. You may refer to the article on Gradient Descent for Machine Learning.Simple Linear RegressionOptimization is a big part of machine learning and almost every machine learning algorithm has an optimization technique at its core for increased efficiency. Gradient Descent is such an optimization algorithm used to find values of coefficients of a function that minimizes the cost function. Gradient Descent is best applied when the solution cannot be obtained by analytical methods (linear algebra) and must be obtained by an optimization technique.Residual Analysis: Simple linear regression models the relationship between the magnitude of one variable and that of a second—for example, as x increases, y also increases. Or as x increases, y decreases. Correlation is another way to measure how two variables are related. The models done by simple linear regression estimate or try to predict the actual result but most often they deviate from the actual result. Residual analysis is used to calculate by how much the estimated value has deviated from the actual result.Null Hypothesis and p-value: During feature selection, null hypothesis is used to find which attributes will not affect the result of the model. Hypothesis tests are used to test the validity of a claim that is made about a particular attribute of the model. This claim that’s on trial, in essence, is called the null hypothesis. A p-value helps to determine the significance of the results. p-value is a number between 0 and 1 and is interpreted in the following way:A small p-value (less than 0.05) indicates a strong evidence against the null hypothesis, so the null hypothesis is to be rejected.A large p-value (greater than 0.05) indicates weak evidence against the null hypothesis, so the null hypothesis is to be considered.p-value very close to the cut-off (equal to 0.05) is considered to be marginal (could go either way). In this case, the p-value should be provided to the readers so that they can draw their own conclusions.Ordinary Least SquareOrdinary Least Squares (OLS), also known as Ordinary least squares regression or least squared errors regression is a type of linear least squares method for estimating the unknown parameters in a linear regression model. OLS chooses the parameters for a linear function, the goal of which is to minimize the sum of the squares of the difference of the observed variables and the dependent variables i.e. it tries to attain a relationship between them. There are two types of relationships that may occur: linear and curvilinear. A linear relationship is a straight line that is drawn through the central tendency of the points; whereas a curvilinear relationship is a curved line. Association between the variables are depicted by using a scatter plot. The relationship could be positive or negative, and result variation also differs in strength.The advantage of using Ordinary Least Squares regression is that it can be easily interpreted and is highly compatible with recent computers’ built-in algorithms from linear algebra. It can be used to apply to problems with lots of independent variables which can efficiently conveyed to thousands of data points. In Linear Regression, OLS is used to estimate the unknown parameters by creating a model which will minimize the sum of the squared errors between the observed data and the predicted one.Let us simulate some data and look at how the predicted values (Yₑ) differ from the actual value (Y):import pandas as pd import numpy as np from matplotlib import pyplot as plt # Generate 'random' data np.random.seed(0) X = 2.5 * np.random.randn(100) + 1.5   # Array of 100 values with mean = 1.5, stddev = 2.5 res = 0.5 * np.random.randn(100)         # Generate 100 residual terms y = 2 + 0.3 * X + res                   # Actual values of Y # Create pandas dataframe to store our X and y values df = pd.DataFrame(     {'X': X,       'y': y} ) # Show the first five rows of our dataframe df.head()XY05.9101314.71461512.5003932.07623823.9468452.54881137.1022334.61536846.1688953.264107To estimate y using the OLS method, we need to calculate xmean and ymean, the covariance of X and y (xycov), and the variance of X (xvar) before we can determine the values for alpha and beta.# Calculate the mean of X and y xmean = np.mean(X) ymean = np.mean(y) # Calculate the terms needed for the numator and denominator of beta df['xycov'] = (df['X'] - xmean) * (df['y'] - ymean) df['xvar'] = (df['X'] - xmean)**2 # Calculate beta and alpha beta = df['xycov'].sum() / df['xvar'].sum() alpha = ymean - (beta * xmean) print(f'alpha = {alpha}') print(f'beta = {beta}')alpha = 2.0031670124623426 beta = 0.3229396867092763Now that we have an estimate for alpha and beta, we can write our model as Yₑ = 2.003 + 0.323 X, and make predictions:ypred = alpha + beta * XLet’s plot our prediction ypred against the actual values of y, to get a better visual understanding of our model.# Plot regression against actual data plt.figure(figsize=(12, 6)) plt.plot(X, ypred) # regression line plt.plot(X, y, 'ro')   # scatter plot showing actual data plt.title('Actual vs Predicted') plt.xlabel('X') plt.ylabel('y') plt.show()The blue line in the above graph is our line of best fit, Yₑ = 2.003 + 0.323 X.  If you observe the graph carefully, you will notice that there is a linear relationship between X and Y. Using this model, we can predict Y from any values of X. For example, for X = 8,Yₑ = 2.003 + 0.323 (8) = 4.587RegularizationRegularization is a type of regression that is used to decrease the coefficient estimates down to zero. This helps to eliminate the data points that don’t actually represent the true properties of the model, but have appeared by random chance. The process is done by identifying the points which have deviated from the line of best-fit by a large extent. Earlier we saw that to estimate the regression coefficients β in the least squares method, we must minimize the term Residual Sum of Squares (RSS). Let the RSS equation in this case be:The general linear regression model can be expressed using a condensed formula:Here, β=[β0 ,β1, ….. βp]The RSS value will adjust the coefficient, β based on the training data. If the resulting data deviates too much from the training data, then the estimated coefficients won’t generalize well to the future data. This is where regularization comes in and shrinks or regularizes these learned estimates towards zero.Ridge regressionRidge regression is very similar to least squares, except that the Ridge coefficients are estimated by minimizing a different quantity. In particular, the Ridge regression coefficients β are the values that minimize the following quantity:Here, λ is the tuning parameter that decides how much we want to penalize the flexibility of the model. λ controls the relative impact of the two components: RSS and the penalty term. If λ = 0, the Ridge regression will produce a result similar to least squares method. If λ → ∞, all estimated coefficients tend to zero. Ridge regression produces different estimates for different values of λ. The optimal choice of λ is crucial and should be done with cross-validation. The coefficient estimates produced by ridge regression method is also known as the L2 norm.The coefficients generated by Ordinary Least Squares method is independent of scale, which means that if each input variable is multiplied by a constant, the corresponding coefficient will be divided by the same constant, as a result of which the multiplication of the coefficient and the input variables will remain the same. The same is not true for ridge regression and we need to bring the coefficients to the same scale before we perform the process. To standardize the variables, we must subtract their means and divide it by their standard deviations.Lasso RegressionLeast Absolute Shrinkage and Selection Operator (LASSO) regression also shrinks the coefficients by adding a penalty to the sum of squares of the residuals, but the lasso penalty has a slightly different effect. The lasso penalty is the sum of the absolute values of the coefficient vector, which corresponds to its L1 norm. Hence, the lasso estimate is defined by:Similar to ridge regression, the input variables need to be standardized. The lasso penalty makes the solution nonlinear, and there is no closed-form expression for the coefficients as in ridge regression. Instead, the lasso solution is a quadratic programming problem and there are available efficient algorithms that compute the entire path of coefficients that result for different values of λ with the same computational cost as for ridge regression.The lasso penalty had the effect of gradually reducing some coefficients to zero as the regularization increases. For this reason, the lasso can be used for the continuous selection of a subset of features.Linear Regression with multiple variablesLinear regression with multiple variables is also known as "multivariate linear regression". We now introduce notation for equations where we can have any number of input variables.x(i)j=value of feature j in the ith training examplex(i)=the input (features) of the ith training examplem=the number of training examplesn=the number of featuresThe multivariable form of the hypothesis function accommodating these multiple features is as follows:hθ(x)=θ0+θ1x1+θ2x2+θ3x3+⋯+θnxnIn order to develop intuition about this function, we can think about θ0 as the basic price of a house, θ1 as the price per square meter, θ2 as the price per floor, etc. x1 will be the number of square meters in the house, x2 the number of floors, etc.Using the definition of matrix multiplication, our multivariable hypothesis function can be concisely represented as:This is a vectorization of our hypothesis function for one training example; see the lessons on vectorization to learn more.Remark: Note that for convenience reasons in this course we assume x0 (i) =1 for (i∈1,…,m). This allows us to do matrix operations with θ and x. Hence making the two vectors ‘θ’and x(i) match each other element-wise (that is, have the same number of elements: n+1).Multiple Linear RegressionHow is it different?In simple linear regression we use a single independent variable to predict the value of a dependent variable whereas in multiple linear regression two or more independent variables are used to predict the value of a dependent variable. The difference between the two is the number of independent variables. In both cases there is only a single dependent variable.MulticollinearityMulticollinearity tells us the strength of the relationship between independent variables. Multicollinearity is a state of very high intercorrelations or inter-associations among the independent variables. It is therefore a type of disturbance in the data, and if present in the data the statistical inferences made about the data may not be reliable. VIF (Variance Inflation Factor) is used to identify the Multicollinearity. If VIF value is greater than 4, we exclude that variable from our model.There are certain reasons why multicollinearity occurs:It is caused by an inaccurate use of dummy variables.It is caused by the inclusion of a variable which is computed from other variables in the data set.Multicollinearity can also result from the repetition of the same kind of variable.Generally occurs when the variables are highly correlated to each other.Multicollinearity can result in several problems. These problems are as follows:The partial regression coefficient due to multicollinearity may not be estimated precisely. The standard errors are likely to be high.Multicollinearity results in a change in the signs as well as in the magnitudes of the partial regression coefficients from one sample to another sample.Multicollinearity makes it tedious to assess the relative importance of the independent variables in explaining the variation caused by the dependent variable.Iterative ModelsModels should be tested and upgraded again and again for better performance. Multiple iterations allows the model to learn from its previous result and take that into consideration while performing the task again.Making predictions with Linear RegressionLinear Regression can be used to predict the value of an unknown variable using a known variable by the help of a straight line (also called the regression line). The prediction can only be made if it is found that there is a significant correlation between the known and the unknown variable through both a correlation coefficient and a scatterplot.The general procedure for using regression to make good predictions is the following:Research the subject-area so that the model can be built based on the results produced by similar models. This research helps with the subsequent steps.Collect data for appropriate variables which have some correlation with the model.Specify and assess the regression model.Run repeated tests so that the model has more data to work with.To test if the model is good enough observe whether:The scatter plot forms a linear pattern.The correlation coefficient r, has a value above 0.5 or below -0.5. A positive value indicates a positive relationship and a negative value represents a negative relationship.If the correlation coefficient shows a strong relationship between variables but the scatter plot is not linear, the results can be misleading. Examples on how to use linear regression have been shown earlier.Data preparation for Linear RegressionStep 1: Linear AssumptionThe first step for data preparation is checking for the variables which have some sort of linear correlation between the dependent and the independent variables.Step 2: Remove NoiseIt is the process of reducing the number of attributes in the dataset by eliminating the features which have very little to no requirement for the construction of the model.Step 3: Remove CollinearityCollinearity tells us the strength of the relationship between independent variables. If two or more variables are highly collinear, it would not make sense to keep both the variables while evaluating the model and hence we can keep one of them.Step 4: Gaussian DistributionsThe linear regression model will produce more reliable results if the input and output variables have a Gaussian distribution. The Gaussian theorem states that  states that a sample mean from an infinite population is approximately normal, or Gaussian, with mean the same as the underlying population, and variance equal to the population variance divided by the sample size. The approximation improves as the sample size gets large.Step 5: Rescale InputsLinear regression model will produce more reliable predictions if the input variables are rescaled using standardization or normalization.Linear Regression with statsmodelsWe have already discussed OLS method, now we will move on and see how to use the OLS method in the statsmodels library. For this we will be using the popular advertising dataset. Here, we will only be looking at the TV variable and explore whether spending on TV advertising can predict the number of sales for the product. Let’s start by importing this csv file as a pandas dataframe using read_csv():# Import and display first five rows of advertising dataset advert = pd.read_csv('advertising.csv') advert.head()TVRadioNewspaperSales0230.137.869.222.1144.539.345.110.4217.245.969.312.03151.541.358.516.54180.810.858.417.9Now we will use statsmodels’ OLS function to initialize simple linear regression model. It will take the formula y ~ X, where X is the predictor variable (TV advertising costs) and y is the output variable (Sales). Then, we will fit the model by calling the OLS object’s fit() method.import statsmodels.formula.api as smf # Initialise and fit linear regression model using `statsmodels` model = smf.ols('Sales ~ TV', data=advert) model = model.fit()Once we have fit the simple regression model, we can predict the values of sales based on the equation we just derived using the .predict method and also visualise our regression model by plotting sales_pred against the TV advertising costs to find the line of best fit.# Predict values sales_pred = model.predict() # Plot regression against actual data plt.figure(figsize=(12, 6)) plt.plot(advert['TV'], advert['Sales'], 'o')       # scatter plot showing actual data plt.plot(advert['TV'], sales_pred, 'r', linewidth=2)   # regression line plt.xlabel('TV Advertising Costs') plt.ylabel('Sales') plt.title('TV vs Sales') plt.show()In the above graph, if you notice you will see that there is a positive linear relationship between TV advertising costs and Sales. You may also summarize by saying that spending more on TV advertising predicts a higher number of sales.Linear Regression with scikit-learnLet us learn to implement linear regression models using sklearn. For this model as well, we will continue to use the advertising dataset but now we will use two predictor variables to create a multiple linear regression model. Yₑ = α + β₁X₁ + β₂X₂ + … + βₚXₚ, where p is the number of predictors.In our example, we will be predicting Sales using the variables TV and Radio i.e. our model can be written as:Sales = α + β₁*TV + β₂*Radiofrom sklearn.linear_model import LinearRegression # Build linear regression model using TV and Radio as predictors # Split data into predictors X and output Y predictors = ['TV', 'Radio'] X = advert[predictors] y = advert['Sales'] # Initialise and fit model lm = LinearRegression() model = lm.fit(X, y) print(f'alpha = {model.intercept_}') print(f'betas = {model.coef_}')alpha = 4.630879464097768 betas = [0.05444896 0.10717457]model.predict(X)Now that we have fit a multiple linear regression model to our data, we can predict sales from any combination of TV and Radio advertising costs. For example, you want to know how many sales we would make if we invested $600 in TV advertising and $300 in Radio advertising. You can simply find it out by:new_X = [[600, 300]] print(model.predict(new_X))[69.4526273]We get the output as 69.45 which means if we invest $600 on TV and $300 on Radio advertising, we can expect to sell 69 units approximately.SummaryLet us sum up what we have covered in this article so far —How to understand a regression problemWhat is linear regression and how it worksOrdinary Least Square method and RegularizationImplementing Linear Regression in Python using statsmodel and sklearn libraryWe have discussed about a couple of ways to implement linear regression and build efficient models for certain business problems. If you are inspired by the opportunities provided by machine learning, enrol in our  Data Science and Machine Learning Courses for more lucrative career options in this landscape.
What is Linear Regression in Machine Learning
Priyankur
Rated 4.5/5 based on 4 customer reviews

What is Linear Regression in Machine Learning

Machine Learning, being a subset of Artificial Intelligence (AI), has been playing a dominant role in our daily lives. Data science engineers and developers working in various domains are widely using machine learning algorithms to make their tasks simpler and life easier. For example, certain machine learning algorithms enable Google Maps to find the fastest route to our destinations, allow Tesla to make driverless cars, help Amazon to generate almost 35% of their annual income, AccuWeather to get the weather forecast of 3.5 million locations weeks in advance, Facebook to automatically detect faces and suggest tags and so on.In statistics and machine learning, linear regression is one of the most popular and well understood algorithms. Most data science enthusiasts and machine learning  fanatics begin their journey with linear regression algorithms. In this article, we will look into how linear regression algorithm works and how it can be efficiently used in your machine learning projects to build better models.Linear Regression is one of the machine learning algorithms where the result is predicted by the use of known parameters which are correlated with the output. It is used to predict values within a continuous range rather than trying to classify them into categories. The known parameters are used to make a continuous and constant slope which is used to predict the unknown or the result.What is a Regression Problem?Majority of the machine learning algorithms fall under the supervised learning category. It is the process where an algorithm is used to predict a result based on the previously entered values and the results generated from them. Suppose we have an input variable ‘x’ and an output variable ‘y’ where y is a function of x (y=f{x}). Supervised learning reads the value of entered variable ‘x’ and the resulting variable ‘y’ so that it can use those results to later predict a highly accurate output data of ‘y’ from the entered value of ‘x’. A regression problem is when the resulting variable contains a real or a continuous value. It tries to draw the line of best fit from the data gathered from a number of points.For example, which of these is a regression problem?How much gas will I spend if I drive for 100 miles?What is the nationality of a person?What is the age of a person?Which is the closest planet to the Sun?Predicting the amount of gas to be spent and the age of a person are regression problems. Predicting nationality is categorical and the closest planet to the Sun is discrete.What is Linear Regression?Let’s say we have a dataset which contains information about the relationship between ‘number of hours studied’ and ‘marks obtained’. A number of students have been observed and their hours of study along with their grades are recorded. This will be our training data. Our goal is to design a model that can predict the marks if number of hours studied is provided. Using the training data, a regression line is obtained which will give minimum error. This linear equation is then used to apply for a new data. That is, if we give the number of hours studied by a student as an input, our model should be able to predict their mark with minimum error.Hypothesis of Linear RegressionThe linear regression model can be represented by the following equation:where,Y is the predicted valueθ₀ is the bias term.θ₁,…,θn are the model parametersx₁, x₂,…,xn are the feature values.The above hypothesis can also be represented byWhere, θ is the model’s parameter vector including the bias term θ₀; x is the feature vector with x₀ =1Y (pred) = b0 + b1*xThe values b0 and b1 must be chosen so that the error is minimum. If sum of squared error is taken as a metric to evaluate the model, then the goal is to obtain a line that best reduces the error.If we don’t square the error, then the positive and negative points will cancel each other out.For a model with one predictor,Exploring ‘b1’If b1 > 0, then x (predictor) and y(target) have a positive relationship. That is an increase in x will increase y.If b1 < 0, then x (predictor) and y(target) have a negative relationship. That is an increase in x will decrease y.Exploring ‘b0’If the model does not include x=0, then the prediction will become meaningless with only b0. For example, we have a dataset that relates height(x) and weight(y). Taking x=0 (that is height as 0), will make the equation have only b0 value which is completely meaningless as in real-time height and weight can never be zero. This resulted due to considering the model values beyond its scope.If the model includes value 0, then ‘b0’ will be the average of all predicted values when x=0. But, setting zero for all the predictor variables is often impossible.The value of b0 guarantees that the residual will have mean zero. If there is no ‘b0’ term, then the regression will be forced to pass over the origin. Both the regression coefficient and prediction will be biased.How does Linear Regression work?Let’s look at a scenario where linear regression might be useful: losing weight. Let us consider that there’s a connection between how many calories you take in and how much you weigh; regression analysis can help you understand that connection. Regression analysis will provide you with a relation which can be visualized into a graph in order to make predictions about your data. For example, if you’ve been putting on weight over the last few years, it can predict how much you’ll weigh in the next ten years if you continue to consume the same amount of calories and burn them at the same rate.The goal of regression analysis is to create a trend line based on the data you have gathered. This then allows you to determine whether other factors apart from the amount of calories consumed affect your weight, such as the number of hours you sleep, work pressure, level of stress, type of exercises you do etc. Before taking into account, we need to look at these factors and attributes and determine whether there is a correlation between them. Linear Regression can then be used to draw a trend line which can then be used to confirm or deny the relationship between attributes. If the test is done over a long time duration, extensive data can be collected and the result can be evaluated more accurately. By the end of this article we will build a model which looks like the below picture i.e, determine a line which best fits the data.How do we determine the best fit line?The best fit line is considered to be the line for which the error between the predicted values and the observed values is minimum. It is also called the regression line and the errors are also known as residuals. The figure shown below shows the residuals. It can be visualized by the vertical lines from the observed data value to the regression line.When to use Linear Regression?Linear Regression’s power lies in its simplicity, which means that it can be used to solve problems across various fields. At first, the data collected from the observations need to be collected and plotted along a line. If the difference between the predicted value and the result is almost the same, we can use linear regression for the problem.Assumptions in linear regressionIf you are planning to use linear regression for your problem then there are some assumptions you need to consider:The relation between the dependent and independent variables should be almost linear.The data is homoscedastic, meaning the variance between the results should not be too much.The results obtained from an observation should not be influenced by the results obtained from the previous observation.The residuals should be normally distributed. This assumption means that the probability density function of the residual values is normally distributed at each independent value.You can determine whether your data meets these conditions by plotting it and then doing a bit of digging into its structure.Few properties of Regression LineHere are a few features a regression line has:Regression passes through the mean of independent variable (x) as well as mean of the dependent variable (y).Regression line minimizes the sum of “Square of Residuals”. That’s why the method of Linear Regression is known as “Ordinary Least Square (OLS)”. We will discuss more in detail about Ordinary Least Square later on.B1 explains the change in Y with a change in x  by one unit. In other words, if we increase the value of ‘x’ it will result in a change in value of Y.Finding a Linear Regression lineLet’s say we want to predict ‘y’ from ‘x’ given in the following table and assume they are correlated as “y=B0+B1∗x”xyPredicted 'y'12Β0+B1∗121Β0+B1∗233Β0+B1∗346Β0+B1∗459Β0+B1∗5611Β0+B1∗6713Β0+B1∗7815Β0+B1∗8917Β0+B1∗91020Β0+B1∗10where,Std. Dev. of x3.02765Std. Dev. of y6.617317Mean of x5.5Mean of y9.7Correlation between x & y0.989938If the Residual Sum of Square (RSS) is differentiated with respect to B0 & B1 and the results equated to zero, we get the following equation:B1 = Correlation * (Std. Dev. of y/ Std. Dev. of x)B0 = Mean(Y) – B1 * Mean(X)Putting values from table 1 into the above equations,B1 = 2.64B0 = -2.2Hence, the least regression equation will become –Y = -2.2 + 2.64*xxY - ActualY - Predicted120.44213.08335.72468.36591161113.6471316.2881518.9291721.56102024.2As there are only 10 data points, the results are not too accurate but if we see the correlation between the predicted and actual line, it has turned out to be very high; both the lines are moving almost together and here is the graph for visualizing our predicted values:Model PerformanceAfter the model is built, if we see that the difference in the values of the predicted and actual data is not much, it is considered to be a good model and can be used to make future predictions. The amount that we consider “not much” entirely depends on the task you want to perform and to what percentage the variation in data can be handled. Here are a few metric tools we can use to calculate error in the model-R – Square (R2)Total Sum of Squares (TSS): total sum of squares (TSS) is a quantity that appears as part of a standard way of presenting results of such an analysis. Sum of squares is a measure of how a data set varies around a central number (like the mean). The Total Sum of Squares tells how much variation there is in the dependent variable.TSS = Σ (Y – Mean[Y])2Residual Sum of Squares (RSS): The residual sum of squares tells you how much of the dependent variable’s variation your model did not explain. It is the sum of the squared differences between the actual Y and the predicted Y.RSS = Σ (Y – f[Y])2(TSS – RSS) measures the amount of variability in the response that is explained by performing the regression.Properties of R2R2 always ranges between 0 to 1.R2 of 0 means that there is no correlation between the dependent and the independent variable.R2 of 1 means the dependent variable can be predicted from the independent variable without any error. An R2 between 0 and 1 indicates the extent to which the dependent variable is predictable. An R2 of 0.20 means that there is 20% of the variance in Y is predictable from X; an R2 of 0.40 means that 40% is predictable; and so on.Root Mean Square Error (RMSE)Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors). The formula for calculating RMSE is:Where N : Total number of observationsWhen standardized observations are used as RMSE inputs, there is a direct relationship with the correlation coefficient. For example, if the correlation coefficient is 1, the RMSE will be 0, because all of the points lie on the regression line (and therefore there are no errors).Mean Absolute Percentage Error (MAPE)There are certain limitations to the use of RMSE, so analysts prefer MAPE over RMSE which gives error in terms of percentages so that different models can be considered for the task and see how they perform. Formula for calculating MAPE can be written as:Where N : Total number of observationsFeature SelectionFeature selection is the automatic selection of attributes for your data that are most relevant to the predictive model you are working on. It seeks to reduce the number of attributes in the dataset by eliminating the features which are not required for the model construction. Feature selection does not totally eliminate an attribute which is considered for the model, rather it mutes that particular characteristic and works with the features which affects the model.Feature selection method aids your mission to create an accurate predictive model. It helps you by choosing features that will give you as good or better accuracy whilst requiring less data. Feature selection methods can be used to identify and remove unnecessary, irrelevant and redundant attributes from the data that do not contribute to the accuracy of the model or may even decrease the accuracy of the model. Having fewer attributes is desirable because it reduces the complexity of the model, and a simpler model is easier to understand, explain and to work with.Feature Selection Algorithms:Filter Method: This method involves assigning scores to individual features and ranking them. The features that have very little to almost no impact are removed from consideration while constructing the model.Wrapper Method: Wrapper method is quite similar to Filter method except the fact that it considers attributes in a group i.e. a number of attributes are taken and checked whether they are having an impact on the model and if not another combination is applied.Embedded Method: Embedded method is the best and most accurate of all the algorithms. It learns the features that affect the model while the model is being constructed and takes into consideration only those features. The most common type of embedded feature selection methods are regularization methods.Cost FunctionCost function helps to figure out the best possible plots which can be used to draw the line of best fit for the data points. As we want to reduce the error of the resulting value we change the process of finding out the actual result to a process which can reduce the error between the predicted value and the actual value.Here, J is the cost function.The above function is made in this format to calculate the error difference between the predicted values and the plotted values. We take the square of the summation of all the data points and divide it by the total number of data points. This cost function J is also called the Mean Squared Error (MSE) function. Using this MSE function we are going to predict values such that the MSE value settles at the minima, reducing the cost function.Gradient DescentGradient Descent is an optimization algorithm that helps machine learning models to find out paths to a minimum value using repeated steps. Gradient descent is used to minimize a function so that it gives the lowest output of that function. This function is called the Loss Function. The loss function shows us how much error is produced by the machine learning model compared to actual results. Our aim should be to lower the cost function as much as possible. One way of achieving a low cost function is by the process of gradient descent. Complexity of some equations makes it difficult to use, partial derivative of the cost function with respect to the considered parameter can provide optimal coefficient value. You may refer to the article on Gradient Descent for Machine Learning.Simple Linear RegressionOptimization is a big part of machine learning and almost every machine learning algorithm has an optimization technique at its core for increased efficiency. Gradient Descent is such an optimization algorithm used to find values of coefficients of a function that minimizes the cost function. Gradient Descent is best applied when the solution cannot be obtained by analytical methods (linear algebra) and must be obtained by an optimization technique.Residual Analysis: Simple linear regression models the relationship between the magnitude of one variable and that of a second—for example, as x increases, y also increases. Or as x increases, y decreases. Correlation is another way to measure how two variables are related. The models done by simple linear regression estimate or try to predict the actual result but most often they deviate from the actual result. Residual analysis is used to calculate by how much the estimated value has deviated from the actual result.Null Hypothesis and p-value: During feature selection, null hypothesis is used to find which attributes will not affect the result of the model. Hypothesis tests are used to test the validity of a claim that is made about a particular attribute of the model. This claim that’s on trial, in essence, is called the null hypothesis. A p-value helps to determine the significance of the results. p-value is a number between 0 and 1 and is interpreted in the following way:A small p-value (less than 0.05) indicates a strong evidence against the null hypothesis, so the null hypothesis is to be rejected.A large p-value (greater than 0.05) indicates weak evidence against the null hypothesis, so the null hypothesis is to be considered.p-value very close to the cut-off (equal to 0.05) is considered to be marginal (could go either way). In this case, the p-value should be provided to the readers so that they can draw their own conclusions.Ordinary Least SquareOrdinary Least Squares (OLS), also known as Ordinary least squares regression or least squared errors regression is a type of linear least squares method for estimating the unknown parameters in a linear regression model. OLS chooses the parameters for a linear function, the goal of which is to minimize the sum of the squares of the difference of the observed variables and the dependent variables i.e. it tries to attain a relationship between them. There are two types of relationships that may occur: linear and curvilinear. A linear relationship is a straight line that is drawn through the central tendency of the points; whereas a curvilinear relationship is a curved line. Association between the variables are depicted by using a scatter plot. The relationship could be positive or negative, and result variation also differs in strength.The advantage of using Ordinary Least Squares regression is that it can be easily interpreted and is highly compatible with recent computers’ built-in algorithms from linear algebra. It can be used to apply to problems with lots of independent variables which can efficiently conveyed to thousands of data points. In Linear Regression, OLS is used to estimate the unknown parameters by creating a model which will minimize the sum of the squared errors between the observed data and the predicted one.Let us simulate some data and look at how the predicted values (Yₑ) differ from the actual value (Y):import pandas as pd import numpy as np from matplotlib import pyplot as plt # Generate 'random' data np.random.seed(0) X = 2.5 * np.random.randn(100) + 1.5   # Array of 100 values with mean = 1.5, stddev = 2.5 res = 0.5 * np.random.randn(100)         # Generate 100 residual terms y = 2 + 0.3 * X + res                   # Actual values of Y # Create pandas dataframe to store our X and y values df = pd.DataFrame(     {'X': X,       'y': y} ) # Show the first five rows of our dataframe df.head()XY05.9101314.71461512.5003932.07623823.9468452.54881137.1022334.61536846.1688953.264107To estimate y using the OLS method, we need to calculate xmean and ymean, the covariance of X and y (xycov), and the variance of X (xvar) before we can determine the values for alpha and beta.# Calculate the mean of X and y xmean = np.mean(X) ymean = np.mean(y) # Calculate the terms needed for the numator and denominator of beta df['xycov'] = (df['X'] - xmean) * (df['y'] - ymean) df['xvar'] = (df['X'] - xmean)**2 # Calculate beta and alpha beta = df['xycov'].sum() / df['xvar'].sum() alpha = ymean - (beta * xmean) print(f'alpha = {alpha}') print(f'beta = {beta}')alpha = 2.0031670124623426 beta = 0.3229396867092763Now that we have an estimate for alpha and beta, we can write our model as Yₑ = 2.003 + 0.323 X, and make predictions:ypred = alpha + beta * XLet’s plot our prediction ypred against the actual values of y, to get a better visual understanding of our model.# Plot regression against actual data plt.figure(figsize=(12, 6)) plt.plot(X, ypred) # regression line plt.plot(X, y, 'ro')   # scatter plot showing actual data plt.title('Actual vs Predicted') plt.xlabel('X') plt.ylabel('y') plt.show()The blue line in the above graph is our line of best fit, Yₑ = 2.003 + 0.323 X.  If you observe the graph carefully, you will notice that there is a linear relationship between X and Y. Using this model, we can predict Y from any values of X. For example, for X = 8,Yₑ = 2.003 + 0.323 (8) = 4.587RegularizationRegularization is a type of regression that is used to decrease the coefficient estimates down to zero. This helps to eliminate the data points that don’t actually represent the true properties of the model, but have appeared by random chance. The process is done by identifying the points which have deviated from the line of best-fit by a large extent. Earlier we saw that to estimate the regression coefficients β in the least squares method, we must minimize the term Residual Sum of Squares (RSS). Let the RSS equation in this case be:The general linear regression model can be expressed using a condensed formula:Here, β=[β0 ,β1, ….. βp]The RSS value will adjust the coefficient, β based on the training data. If the resulting data deviates too much from the training data, then the estimated coefficients won’t generalize well to the future data. This is where regularization comes in and shrinks or regularizes these learned estimates towards zero.Ridge regressionRidge regression is very similar to least squares, except that the Ridge coefficients are estimated by minimizing a different quantity. In particular, the Ridge regression coefficients β are the values that minimize the following quantity:Here, λ is the tuning parameter that decides how much we want to penalize the flexibility of the model. λ controls the relative impact of the two components: RSS and the penalty term. If λ = 0, the Ridge regression will produce a result similar to least squares method. If λ → ∞, all estimated coefficients tend to zero. Ridge regression produces different estimates for different values of λ. The optimal choice of λ is crucial and should be done with cross-validation. The coefficient estimates produced by ridge regression method is also known as the L2 norm.The coefficients generated by Ordinary Least Squares method is independent of scale, which means that if each input variable is multiplied by a constant, the corresponding coefficient will be divided by the same constant, as a result of which the multiplication of the coefficient and the input variables will remain the same. The same is not true for ridge regression and we need to bring the coefficients to the same scale before we perform the process. To standardize the variables, we must subtract their means and divide it by their standard deviations.Lasso RegressionLeast Absolute Shrinkage and Selection Operator (LASSO) regression also shrinks the coefficients by adding a penalty to the sum of squares of the residuals, but the lasso penalty has a slightly different effect. The lasso penalty is the sum of the absolute values of the coefficient vector, which corresponds to its L1 norm. Hence, the lasso estimate is defined by:Similar to ridge regression, the input variables need to be standardized. The lasso penalty makes the solution nonlinear, and there is no closed-form expression for the coefficients as in ridge regression. Instead, the lasso solution is a quadratic programming problem and there are available efficient algorithms that compute the entire path of coefficients that result for different values of λ with the same computational cost as for ridge regression.The lasso penalty had the effect of gradually reducing some coefficients to zero as the regularization increases. For this reason, the lasso can be used for the continuous selection of a subset of features.Linear Regression with multiple variablesLinear regression with multiple variables is also known as "multivariate linear regression". We now introduce notation for equations where we can have any number of input variables.x(i)j=value of feature j in the ith training examplex(i)=the input (features) of the ith training examplem=the number of training examplesn=the number of featuresThe multivariable form of the hypothesis function accommodating these multiple features is as follows:hθ(x)=θ0+θ1x1+θ2x2+θ3x3+⋯+θnxnIn order to develop intuition about this function, we can think about θ0 as the basic price of a house, θ1 as the price per square meter, θ2 as the price per floor, etc. x1 will be the number of square meters in the house, x2 the number of floors, etc.Using the definition of matrix multiplication, our multivariable hypothesis function can be concisely represented as:This is a vectorization of our hypothesis function for one training example; see the lessons on vectorization to learn more.Remark: Note that for convenience reasons in this course we assume x0 (i) =1 for (i∈1,…,m). This allows us to do matrix operations with θ and x. Hence making the two vectors ‘θ’and x(i) match each other element-wise (that is, have the same number of elements: n+1).Multiple Linear RegressionHow is it different?In simple linear regression we use a single independent variable to predict the value of a dependent variable whereas in multiple linear regression two or more independent variables are used to predict the value of a dependent variable. The difference between the two is the number of independent variables. In both cases there is only a single dependent variable.MulticollinearityMulticollinearity tells us the strength of the relationship between independent variables. Multicollinearity is a state of very high intercorrelations or inter-associations among the independent variables. It is therefore a type of disturbance in the data, and if present in the data the statistical inferences made about the data may not be reliable. VIF (Variance Inflation Factor) is used to identify the Multicollinearity. If VIF value is greater than 4, we exclude that variable from our model.There are certain reasons why multicollinearity occurs:It is caused by an inaccurate use of dummy variables.It is caused by the inclusion of a variable which is computed from other variables in the data set.Multicollinearity can also result from the repetition of the same kind of variable.Generally occurs when the variables are highly correlated to each other.Multicollinearity can result in several problems. These problems are as follows:The partial regression coefficient due to multicollinearity may not be estimated precisely. The standard errors are likely to be high.Multicollinearity results in a change in the signs as well as in the magnitudes of the partial regression coefficients from one sample to another sample.Multicollinearity makes it tedious to assess the relative importance of the independent variables in explaining the variation caused by the dependent variable.Iterative ModelsModels should be tested and upgraded again and again for better performance. Multiple iterations allows the model to learn from its previous result and take that into consideration while performing the task again.Making predictions with Linear RegressionLinear Regression can be used to predict the value of an unknown variable using a known variable by the help of a straight line (also called the regression line). The prediction can only be made if it is found that there is a significant correlation between the known and the unknown variable through both a correlation coefficient and a scatterplot.The general procedure for using regression to make good predictions is the following:Research the subject-area so that the model can be built based on the results produced by similar models. This research helps with the subsequent steps.Collect data for appropriate variables which have some correlation with the model.Specify and assess the regression model.Run repeated tests so that the model has more data to work with.To test if the model is good enough observe whether:The scatter plot forms a linear pattern.The correlation coefficient r, has a value above 0.5 or below -0.5. A positive value indicates a positive relationship and a negative value represents a negative relationship.If the correlation coefficient shows a strong relationship between variables but the scatter plot is not linear, the results can be misleading. Examples on how to use linear regression have been shown earlier.Data preparation for Linear RegressionStep 1: Linear AssumptionThe first step for data preparation is checking for the variables which have some sort of linear correlation between the dependent and the independent variables.Step 2: Remove NoiseIt is the process of reducing the number of attributes in the dataset by eliminating the features which have very little to no requirement for the construction of the model.Step 3: Remove CollinearityCollinearity tells us the strength of the relationship between independent variables. If two or more variables are highly collinear, it would not make sense to keep both the variables while evaluating the model and hence we can keep one of them.Step 4: Gaussian DistributionsThe linear regression model will produce more reliable results if the input and output variables have a Gaussian distribution. The Gaussian theorem states that  states that a sample mean from an infinite population is approximately normal, or Gaussian, with mean the same as the underlying population, and variance equal to the population variance divided by the sample size. The approximation improves as the sample size gets large.Step 5: Rescale InputsLinear regression model will produce more reliable predictions if the input variables are rescaled using standardization or normalization.Linear Regression with statsmodelsWe have already discussed OLS method, now we will move on and see how to use the OLS method in the statsmodels library. For this we will be using the popular advertising dataset. Here, we will only be looking at the TV variable and explore whether spending on TV advertising can predict the number of sales for the product. Let’s start by importing this csv file as a pandas dataframe using read_csv():# Import and display first five rows of advertising dataset advert = pd.read_csv('advertising.csv') advert.head()TVRadioNewspaperSales0230.137.869.222.1144.539.345.110.4217.245.969.312.03151.541.358.516.54180.810.858.417.9Now we will use statsmodels’ OLS function to initialize simple linear regression model. It will take the formula y ~ X, where X is the predictor variable (TV advertising costs) and y is the output variable (Sales). Then, we will fit the model by calling the OLS object’s fit() method.import statsmodels.formula.api as smf # Initialise and fit linear regression model using `statsmodels` model = smf.ols('Sales ~ TV', data=advert) model = model.fit()Once we have fit the simple regression model, we can predict the values of sales based on the equation we just derived using the .predict method and also visualise our regression model by plotting sales_pred against the TV advertising costs to find the line of best fit.# Predict values sales_pred = model.predict() # Plot regression against actual data plt.figure(figsize=(12, 6)) plt.plot(advert['TV'], advert['Sales'], 'o')       # scatter plot showing actual data plt.plot(advert['TV'], sales_pred, 'r', linewidth=2)   # regression line plt.xlabel('TV Advertising Costs') plt.ylabel('Sales') plt.title('TV vs Sales') plt.show()In the above graph, if you notice you will see that there is a positive linear relationship between TV advertising costs and Sales. You may also summarize by saying that spending more on TV advertising predicts a higher number of sales.Linear Regression with scikit-learnLet us learn to implement linear regression models using sklearn. For this model as well, we will continue to use the advertising dataset but now we will use two predictor variables to create a multiple linear regression model. Yₑ = α + β₁X₁ + β₂X₂ + … + βₚXₚ, where p is the number of predictors.In our example, we will be predicting Sales using the variables TV and Radio i.e. our model can be written as:Sales = α + β₁*TV + β₂*Radiofrom sklearn.linear_model import LinearRegression # Build linear regression model using TV and Radio as predictors # Split data into predictors X and output Y predictors = ['TV', 'Radio'] X = advert[predictors] y = advert['Sales'] # Initialise and fit model lm = LinearRegression() model = lm.fit(X, y) print(f'alpha = {model.intercept_}') print(f'betas = {model.coef_}')alpha = 4.630879464097768 betas = [0.05444896 0.10717457]model.predict(X)Now that we have fit a multiple linear regression model to our data, we can predict sales from any combination of TV and Radio advertising costs. For example, you want to know how many sales we would make if we invested $600 in TV advertising and $300 in Radio advertising. You can simply find it out by:new_X = [[600, 300]] print(model.predict(new_X))[69.4526273]We get the output as 69.45 which means if we invest $600 on TV and $300 on Radio advertising, we can expect to sell 69 units approximately.SummaryLet us sum up what we have covered in this article so far —How to understand a regression problemWhat is linear regression and how it worksOrdinary Least Square method and RegularizationImplementing Linear Regression in Python using statsmodel and sklearn libraryWe have discussed about a couple of ways to implement linear regression and build efficient models for certain business problems. If you are inspired by the opportunities provided by machine learning, enrol in our  Data Science and Machine Learning Courses for more lucrative career options in this landscape.
Rated 4.5/5 based on 4 customer reviews
7854
What is Linear Regression in Machine Learning

Machine Learning, being a subset of Artificial Int... Read More

What is K-Nearest Neighbor in Machine Learning: K-NN Algorithm

If you are thinking of a simple, easy-to-implement supervised machine learning algorithm which can be used to solve both classification as well as regression problems, K-Nearest Neighbor (K-NN) is the perfect choice. Learning K-NN is a great way to introduce yourself to machine learning and classification in general. Also, you will find a lot of intense application of K-NN in data mining, pattern recognition, semantic searching, intrusion detection and anomaly detection.K-Nearest Neighbors is one of the most basic supervised machine learning algorithms, yet very essential. A supervised machine learning algorithm is one of the types of machine learning algorithm which is dependent on labelled input data in order to learn a function which is capable of producing an output whenever a new unlabeled data is given as input.In real life scenarios, K-NN is widely used as it is non-parametric which means it does not make any underlying assumptions about the distributions of data. With the business world entirely revolving around Data Science, it has become one of the most lucrative fields. Hence, the heavy demand for a Data Science Certification.Parametric vs Non-parametric MethodsLet us look into how different is a parametric machine learning algorithm from a nonparametric machine learning algorithm.Machine learning, in other words can be called as learning a function (f) which maps input variables (X) to the output variables (Y).Y=f(X)An algorithm learns about the target mapping function from the training data. As we are unaware of the form of the function, we have to evaluate various machine learning algorithms and figure out which algorithms perform better at providing an approximation of the underlying function.Statistical Methods are classified on the basis of what we know about the population we are studying.Parametric statistics is a branch of statistics which assumes that sample data comes from a population that follows a probability distribution based on a fixed set of parameters.Nonparametric statistics is the branch of statistics that is not based solely on population parameters.Parametric Machine Learning AlgorithmsThis particular algorithm involves two steps:Selecting a form for the functionLearning the coefficients for the function from the training dataLet us consider a line to understand functional form for the mapping function as it is used in linear regression and simplify the learning process.b0 + b1*x1 + b2*x2 = 0Where b0, b1 and b2 are the coefficients of the line which control the intercept and slope, and x1 and x2 are two input variables.All we have to do now is to estimate the coefficients of the line equation to get a predictive model for the problem. Now, the problem is that the actual unknown underlying function may not be a linear function like a line. In that case, the approach will give poor results. Some of the examples of parametric machine learning algorithms are mentioned below:Logistic RegressionLinear Discriminant AnalysisPerceptronNaive BayesSimple Neural NetworksNonparametric Machine Learning AlgorithmsNonparametric methods always try to find the best fit training data while constructing the mapping function which also allows it to fit a large number of functional forms. Some of the examples of nonparametric machine learning algorithms are mentioned below:k-Nearest NeighborsDecision Trees like CART and C4.5Support Vector MachinesThe best example of nonparametric machine learning algorithms would be k-nearest neighbors algorithm which makes predictions based on the k most similar training patterns for a given set of new data instance. This method simply assumes that the patterns which are close are likely to be of similar type.Parametric Machine Learning AlgorithmsNonparametric Machine Learning AlgorithmsBenefitsSimple to understand and interpret resultsSpeed of learning from data in fastLess training data is requiredFlexible enough to fit a large number of functional formsNo assumptions about the underlying functionsProvides high performance for predictionLimitationsChoosing a functional form constrains the method to the specified formIt has limited complexity and more suited to simpler problemsIt is unlikely to match the underlying mapping function and results in poor fitRequires more training data in order to estimate the mapping functionDue to more parameters to train, it is slower comparativelyThere is a risk to overfit the training dataMethod Based LearningThere are several learning models namely:Association rules basedEnsemble method basedDeep Learning basedClustering method basedRegression Analysis basedBayesian method basedDimensionality reduction basedKernel method basedInstance basedLet us understand what Instance Based Learning is all about.Instance Based Learning (IBL)Instance-Based methods are the simplest form of learningInstance -Based learning is lazy learningK-NN model works on identified instanceInstances are retrieved from memory and then this data is used to classify the new query instanceInstance based learning is also called memory-based or case-basedUnder Instance-based Learning we have,Nearest-neighbor classifierUses k “closest” points (nearest neighbors) for performing classification. For example: It’s how people judge by observing our peers. We tend to move with people of similar attributes.Lazy Learning vs Eager LearningLazy LearningEager LearningSimply stores the training data and waits until it is given a test tuple.Munges the training data as soon as it receives it.It's slow as it calculates based on the current data set instead of coming up with an algorithm based on historical data.It's fast as it has pre-calculated algorithm.Localized data so generalization takes time at every iteration.On the basis of training set ,it constructs a classification model before receiving new data to classify.What is K-NN?One of the biggest applications of K-Nearest Neighbor search is Recommender Systems. If you have noticed while you are shopping as a user on Amazon and you like a particular item, you are recommended with similar items.It also recommends similar items bought by other users and other set of items which are often bought together. Basically, the algorithm compares the set of users who like each item and looks for similarity. This not only applies to recommending items or products but also recommending media and even advertisements to display to a user.K nearest neighbors or K-NN Algorithm is a simple algorithm which uses the entire dataset in its training phase. Whenever a prediction is required for an unseen data instance, it searches through the entire training dataset for k-most similar instances and the data with the most similar instance is finally returned as the prediction.This algorithm suggests that if you’re similar to your neighbours, then you are one of them. Let us consider a simple example, if apple looks more similar to peach, pear, and cherry (fruits) than monkey, cat or a rat (animals), then most likely apple is a fruit.Nearest Neighbours algorithm has been in action for the last sixty years. It is mainly used in statistical estimation and pattern recognition, as a non-parametric method, for regression and classification. The main aim of the K-Nearest Neighbor algorithm is to classify a new data point by comparing it to all previously seen data points. The classification of the k most similar previous cases are used for predicting the classification of the current data point. It is a simple algorithm which stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions).When do we use K-NN algorithm?K-NN algorithm can be used for applications which require high accuracy as it makes highly accurate predictions. The quality of predictions is completely dependent on the distance measure. Thus, this algorithm is suitable for applications for which you have sufficient domain knowledge so that it can help you select an appropriate measure.As we have already seen K-NN algorithm is a type of lazy learning, the computation for the generation is postponed until classification which indeed increases the costs of computation compared to other machine learning algorithms. But still K-NN is considered to be the better choice for applications where accuracy is more important and predictions are not requested frequently.K-NN can be used for both regression and classification predictive problems. However, in the industry it is mostly used in classification problems.Generally we mainly look at 3 important aspects in order to evaluate any technique:Ease to interpret outputCalculation timePredictive PowerLet us consider a few examples to place K-NN in the scale :If you notice the chart mentioned above, K-NN algorithm exceeds in most of the parameters. It is most commonly used for ease of interpretation and low calculation time.How does the K-NN algorithm work?K-NN algorithm works on the basis of feature similarity. The classification of a given data point is determined by how closely out-of-sample features resemble our training set.The above figure shows an example of k-NN classification. If you consider the nearest neighbor to the test sample, it is a blue square (Class 1) and k=1. This falls inside the inner circle.Now, if you consider k=3, then you will see 2 red triangles and only 1 blue square falls under the outer circle. Thus, the test sample is classified as a red triangle now (Class 2).Similarly, if you consider k=5, it is assigned to the first class (3 squares vs. 2 triangles outside the outer circle).K-NN in RegressionIn regression problems, K-NN is used for prediction based on the mean or the median of the K-most similar instances.K-NN in ClassificationK-nearest-neighbor classification was actually developed from the need to perform discriminant analysis when reliable parametric estimates of probability densities are unknown or are difficult to determine. When K-NN is used for classification, the output is easily calculated by the class having the highest frequency from the K-most similar instances. The class with maximum vote is taken into consideration for prediction.The probabilities of Classes can be calculated as the normalized frequency of samples that belong to each class in the set of K most similar instances for a new data instance.For example, in a binary classification problem (class is 0 or 1):p(class=0) = count(class=0) / (count(class=0)+count(class=1))If you are using K and you have an even number of classes (e.g. 2) it is a good idea to choose a K value with an odd number to avoid a tie. And the inverse, use an even number for K when you have an odd number of classes.Ties can be broken consistently by expanding K by 1 and looking at the class of the next most similar instance in the training dataset.Making Predictions with K-NNA case can be classified by a majority vote of its neighbors. The case is then assigned to the most common class amongst its K nearest neighbors measured by a distance function. Suppose the value of K is 1, then the case is simply assigned to the class of its nearest neighbor.The three distance measures mentioned above are valid only for continuous variables. For categorical variables, the Hamming distance is used. It also brings up the issue of standardization of the numerical variables between 0 and 1 when there is a mixture of numerical and categorical variables in the dataset.By inspecting the data, you can choose the best optimal value for K. Generally, a large value of K is more accurate as it tends to reduce the overall noise but is not always true. Another way to retrospectively determine a good K value by using an independent dataset to validate the K value is Cross-validation. According to observation, the optimal K for most datasets has been between 3-10 which provides better results than 1NN.For example, let us consider an example where the data mentioned below us concerned with credit default. Age and Loan are two numerical variables (predictors) and Default is the target.By observing the data mentioned above, we can use the training set in order to classify an unknown case (Age=48 and Loan=$142,000) using Euclidean distance. If K=1 then the nearest neighbor is the last case in the training set with Default=Y.AgeLoanDefaultDistance25$40,000N10200035$60,000N8200045$80,000N6200020$20,000N12200035$120,000N22000252$18,000N12400023$95,000Y4700040$62,000Y8000060$100,000Y42000348$220,000Y7800033$150,000Y8000148$142,000?Euclidean DistanceWith K=3, there are two Default=Y and one Default=N out of three closest neighbors. The prediction for the unknown case is again Default=Y.Standardized DistanceOne major drawback in calculating distance measures directly from the training set is in the case where variables have different measurement scales or there is a mixture of numerical and categorical variables. For example, if one variable is based on annual income in dollars, and the other is based on age in years then income will have a much higher influence on the distance calculated. One solution is to standardize the training set as shown below.AgeLoanDefaultDistance0.1250.11N0.76520.3750.21N0.52000.6250.31N0.316000.01N0.92450.3750.50N0.34280.80.00N0.62200.0750.38Y0.66690.50.22Y0.443710.41Y0.36500.71.00Y0.38610.3250.65Y0.37710.70.61?Standardized VariableUsing the standardized distance on the same training set, the unknown case returned a different neighbor which is not a good sign of robustness.Between-sample geometric distanceThe k-nearest-neighbor classifier is commonly based on the Euclidean distance between a test sample and the specified training samples. Let  xi be an input sample with p features, (xi1, xi2, …, xip), n be the total number of input samples (i=1,2,…,n) and p the total number of features (j=1,2,…,p) . The Euclidean distance between sample xi and xl (l=1,2,…,n) is defined as:A graphical representation of the nearest neighbor concept is illustrated in the Voronoi tessellation. The tessellation shows 19 samples marked with a "+", and the Voronoi cell, R, surrounding each sample. A Voronoi cell encapsulates all neighboring points that are nearest to each sample and is defined as:Where Ri is the Voronoi cell for sample xi, and x represents all possible points within Voronoi cell Ri.Voronoi tessellation showing Voronoi cells of 19 samples marked with a "+"The Voronoi tessellation reflects two characteristics of the example 2-dimensional coordinate system: i) all possible points within a sample's Voronoi cell are the nearest neighboring points for that sample, and ii) for any sample, the nearest sample is determined by the closest Voronoi cell edge.According to the latter characteristic, the k-nearest-neighbor classification rule is to assign to a test sample the majority category label of its k nearest training samples. In practice, k is usually chosen to be odd, so as to avoid ties. The k = 1 rule is generally called the nearest-neighbor classification rule.Curse of DimensionalityThe curse of dimensionality refers to various phenomena that are witnessed while analyzing and organizing data in high-dimensional spaces (often with hundreds or thousands of dimensions). Such phenomenon do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience.K-NN algorithm will work absolutely fine when you are dealing with a small number of input variables (p)  but will struggle when there are a large number of inputs.K-NN works well with a small number of input variables (p), but struggles when the number of inputs is very large. Each input variable can be considered a dimension of a p-dimensional input space. For example, suppose you have two input variables x1 and x2, the input space would be 2-dimensional. With the increase in the number of dimensions, the volume of the input space increases at an exponential rate.In case of higher dimensions, the points which are similar may have large distances. All these points will be then away from each other and our intuition about 2 to 3 dimensional spaces will not be applicable. This kind of problem is called the “Curse of Dimensionality“. How is K in K-means different from K in K-NN?K-Means Clustering and k-Nearest Neighbors algorithm, both are commonly used algorithms in Machine Learning. They are often confused with each other, especially when we are talking about the k-factor. The ‘K’ in K-Means Clustering has nothing to do with the ‘K’ in K-NN algorithm. k-Means Clustering is an unsupervised learning algorithm that is used for clustering whereas K-NN is a supervised learning algorithm used for classification.K-Means AlgorithmThe k-means algorithm is an unsupervised clustering algorithm which takes a couple of unlabeled points and then groups them into “k” number of clusters.The “k” in k-means denotes the number of clusters you would like to have in the end. Suppose the value of k is 5, it means you will have 5 clusters on the data set.Let us see how it works.Step 1: First you determine the value of K by Elbow method and then specify the number of clusters KStep 2: Next you have to randomly assign each data point to a clusterStep 3: Determine the cluster centroid coordinatesStep 4: Determine the distances of each data point to the centroids and re-assign each point to the closest cluster centroid based upon minimum distanceStep 5: Calculate cluster centroids againStep 6: Repeat steps 4 and 5 until we reach global optima where further improvements are not possible and there is no provision to switch data points from one cluster to another.Implementation in Python#Finding the optimum number of clusters for k-means clustering Nc = range(1, 10) kmeans = [KMeans(n_clusters=i) for i in Nc] kmeans score = [kmeans[i].fit(x).score(x) for i in range(len(kmeans))] score pl.plot(Nc,score) pl.xlabel('Number of Clusters') pl.ylabel('Score') pl.title('Elbow Curve') pl.show()You can clearly see why it is called 'The elbow method' from the above graph, the optimum clusters is where the elbow occurs.Now that we have the optimum amount of clusters (k=3), we can move on to applying K-means clustering to the Iris dataset.#Implementation of K-Means Clustering model = KMeans(n_clusters = 3) model.fit(x) model.labels_ colormap = np.array(['Red', 'Blue', 'Green']) z = plt.scatter(x.sepal_length, x.sepal_width, x.petal_length, c = colormap[model.labels_])#Accuracy of K-Means Clustering accuracy_score(iris.target,model.labels_) 0.8933333333333333K-NN AlgorithmBy now, we already know that K-NN algorithm is a supervised classification algorithm. It takes into consideration a couple of labelled points and then uses those points to learn how to label other points. To be able to assign label to other points, K-NN algorithm looks for the closest neighbor of the new point and checks for voting. The most number of neighbors around the new point decide the label of the new point.The “k” in K-Nearest Neighbors is the number of neighbors it checks. It is supervised because it is trying to classify a point on the basis of the known classification of other points.Let us see how it works.Step 1: Firstly, you determine the value for K.Step 2: Then you calculate the distances between the new input (test data) and all the training data. The most commonly used metrics for calculating distance are Euclidean, Manhattan and MinkowskiStep 3: Sort the distance and determine k nearest neighbors based on minimum distance valuesStep 4: Analyze the category of those neighbors and assign the category for the test data based on majority voteStep 5: Return the predicted classImplementation using Pythonerror = [] # Calculating error for K values between 1 and 40 for i in range(1, 40): K-NN = KNeighborsClassifier(n_neighbors=i) K-NN.fit(X_train, y_train) pred_i = K-NN.predict(X_test) error.append(np.mean(pred_i != y_test)) plt.figure(figsize=(12, 6)) plt.plot(range(1, 40), error, color='black', linestyle='dashed', marker='o',     markerfacecolor='grey', markersize=10) plt.title('Error Rate K Value') plt.xlabel('K Value') plt.ylabel('Mean Error') Text(0, 0.5, 'Mean Error')Now we know for what values of ‘K’, the error rate will be less. Let’s fix k=5 and implement K-NN algorithm.#Creating training and test splits from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20) #Performing Feature Scaling from sklearn.preprocessing import StandardScaler scaler = StandardScaler() scaler.fit(X_train) X_train = scaler.transform(X_train) X_test = scaler.transform(X_test) #Training K-NN with k=5 from sklearn.neighbors import KNeighborsClassifier classifier = KNeighborsClassifier(n_neighbors=5) classifier.fit(X_train, y_train) KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,                     weights='uniform') y_pred = classifier.predict(X_test) from sklearn.metrics import classification_report, confusion_matrix print(confusion_matrix(y_test, y_pred)) print(classification_report(y_test, y_pred)) [[10  0  0] [ 0  9  2] [ 0  1  8]]                 precision recall   f1-score   support     Iris-setosa        1.00         1.00       1.00       10 Iris-versicolor       0.90       0.82     0.86       11 Iris-virginica     0.80         0.89       0.84       9       accuracy                   0.90         30       macro avg     0.90       0.90     0.90     30   weighted avg    0.90       0.90     0.90       30Practical Applications of K-NNNow that we have we have seen how K-NN works, let us look into some of the practical applications of K-NN.Recommending products to people with similar interests, recommending movies and TV shows as per viewer’s choice and interest, recommending hotels and other accommodation facilities while you are travelling based on your previous bookings.Assigning credit ratings based on financial characteristics, comparing people with similar financial features in a database. By analyzing the nature of a credit rating, people with similar financial details, they would be assigned similar credit ratings.Should the bank give a loan to an individual? Would an individual default on his or her loan? Is that person closer in characteristics to people who defaulted or did not default on their loans?Some advanced examples could include handwriting detection (like OCR), image recognition and even video recognition.Some pros and cons of K-NNProsTraining phase of K-nearest neighbor classification is faster in comparison with other classification algorithms.Training of a model is not required for generalization.Simple algorithm — to explain and understand/interpret.High accuracy (relatively) — it is pretty high but not competitive in comparison to better supervised learning models.K-NN can be useful in case of nonlinear data.Versatile — useful for classification or regression.ConsTesting phase of K-nearest neighbor classification is slower and costlier with respect to time and memory. High memory requirement - Requires large memory for storing the entire training dataset.K-NN requires scaling of data because K-NN uses the Euclidean distance between two data points to find nearest neighbors.Euclidean distance is sensitive to magnitudes. The features with high magnitudes will weigh more than features with low magnitudes.Not suitable for large dimensional data.How to improve the performance of K-NN?Rescaling Data: K-NN performs much better if all of the data has the same scale. Normalizing your data to the range [0, 1] is a good idea. It may also be a good idea to standardize your data if it has a Gaussian distribution.Addressing Missing Data: Missing data will mean that the distance between samples can not be calculated. These samples could be excluded or the missing values could be imputed.Reducing Dimensionality: K-NN is suited for lower dimensional data. You can try it on high dimensional data (hundreds or thousands of input variables) but be aware that it may not perform as good as other techniques. K-NN can benefit from feature selection that reduces the dimensionality of the input feature space.In this article we have learned about the K-Nearest Neighbor algorithm, where we should use it, how it works and so on. Also, we have discussed about parametric and nonparametric machine learning algorithms, instance based learning, eager and lazy learning, advantages and disadvantages of using K-NN, performance improvement suggestions and have implemented K-NN in Python. To learn more about other machine learning algorithms, join our Data Science Certification course and expand your learning skill set and career opportunities.
Rated 4.5/5 based on 8 customer reviews
8985
What is K-Nearest Neighbor in Machine Learning: K-...

If you are thinking of a simple, easy-to-implement... Read More

What is Classification and Regression Trees in Machine Learning

Introduction to Machine Learning and its typesMachine Learning is an interdisciplinary field of study and is a sub-domain of Artificial Intelligence. It gives computers the ability to learn and infer from a huge amount of homogeneous data, without having to be programmed explicitly.Types of Machine Learning: Machine Learning can broadly be classified into three types:Supervised Learning: If the available dataset has predefined features and labels, on which the machine learning models are trained, then the type of learning is known as Supervised Machine Learning. Supervised Machine Learning Models can broadly be classified into two sub-parts: Classification and Regression. These have been discussed further in detail.Unsupervised Learning: If the available dataset has predefined features but lacks labels, then the Machine Learning algorithms perform operations on this data to assign labels to it or to reduce the dimensionality of the data. There are several types of Unsupervised Learning Models, the most common of them being: Principal Component Analysis (PCA) and Clustering.Reinforcement Learning: Reinforcement Learning is a more advanced type of learning, where, the model learns from “Experience”. Here, features and labels are not clearly defined. The model is just given a “Situation” and is rewarded or penalized based on the “Outcome”. The model thus learns to optimize the “Situation” to maximize the Rewards and hence improves the “Outcome” with “Experience”.ClassificationClassification is the process of determination/prediction of the category to which a data-point may belong to. It is the process by which a Supervised Learning Algorithm learns to draw inference from the features of a given dataset and predict which class or group or category does the particular data point belongs to.Example of Classification: Let’s assume that we are given a few images of handwritten digits (0-9). The problem statement is to “teach” the machine to classify correctly which image corresponds to which digit. A small sample of the dataset is given below:The machine has to be thus trained, such that, when given an input of any such hand-written digit, it has to correctly classify the digits and mention which digit the image represents. This is classed classification of hand-written digits.Looking at another example which is not image-based, we have 2D data (x1 and x2) which is plotted in the form of a graph shown below.The red and green dots represent two different classes or categories of data. The main goal of the classifier is that given one such “dot” of unknown class, based on its “features”, the algorithm should be able to correctly classify if that dot belongs to the red or green class. This is also shown by the line going through the middle, which correctly classifies the majority of the dots.Applications of Classification: Listed below are some of the real-world applications of classification Algorithms.Face Recognition: Face recognition finds its applications in our smartphones and any other place with Biometric security. Face Recognition is nothing but face detection followed by classification. The classification algorithm determines if the face in the image matches with the registered user or not.Medical Image Classification: Given the data of patients, a model that is well trained is often used to classify if the patient has a malignant tumor (cancer), heart ailments, fractures, etc.RegressionRegression is also a type of supervised learning. Unlike classification, it does not predict the class of the given data. Instead, it predicts the corresponding values of a given dataset based on the “features” it encounters.Example of Regression: For this, we will look at a dataset consisting of California Housing Prices. The contents of this dataset are shown below.Here, there are several columns. Each of the columns shows the “features” based on which the machine learning algorithm predicts the housing price (shown by yellow highlight). The primary goal of the regression algorithm is that, given the features of a given house, it should be able to correctly estimate the price of the house. This is called a regression problem. It is similar to curve fitting and is often confused with the same.Applications of Regression: Listed below are some of the real-world applications of regression Algorithms.Stock Market Prediction: Regression algorithms are used to predict the future price of stocks based on certain past features like time of the day or festival time, etc. Stock Market Prediction also falls under a subdomain of study called Time Series Analysis.Object Detection Algorithms: Object Detection is the process of detection of the location of a given object in an image or video. This process returns the coordinates of the pixel values stating the location of the object in the image. These coordinates are determined by using regression algorithms alongside classification.ClassificationRegressionAssign specific classes to the data based on its features.Predict values based on the features of the dataset.Prediction is discrete or categorical in nature.Prediction is continuous in nature.Introduction to the building blocks of Decision TreesIn order to get started with Decision Trees, it is important to understand the basic building blocks of decision trees. Hence, we start building the concepts slowly with some basic theory.1. EntropyDefinition: It is a commonly used concept in Information Theory and is a measure of “purity” of an arbitrary collection of information.Mathematical Equation:Here, given a collection S, containing positive and negative examples, the Entropy of S is given by the above equation, where, p represents the probability of occurrence of that example in the given data.In a more generalized form, Entropy is given by the following equation:Example: As an example, a sample S is taken, which contains 14 data samples and includes 9 positive and 5 negative samples. The same is denoted by the mathematical notion: [9+, 5­­–].Thus, Entropy of the given sample can be calculated as follows:2. Information GainDefinition: With the knowledge of Entropy, the amount of relevant information that is gained form a given random sample size can be calculated and is known as Information Gain.Mathematical Equation:Here, the Gain (S, A) is the Information Gain of an attribute A relative to a sample S. The Values(A) is a set of all possible values for attribute A.Example: As an example, let’s assume S is a collection of 14 training-examples. Here, in this example, we will consider the Attribute to be Wind and the values of that corresponding attribute will be Weak and Strong. In addition to the previous example information, we will assume that out of the previously mentioned 9 positives and 5 negative samples, 6 positive and 2 negative samples have the value of the attribute Wind=Weak, and the remaining have Wind=Strong. Thus, under such a circumstance, the information gained by the attribute Wind is shown below.Decision TreeIntroduction: Since we have the basic building blocks out of the way, let’s try to understand what exactly is a Decision Tree. As the name suggests, it is a Tree which is developed based on certain decisions taken by the algorithm in accordance with the given data that it has been trained on.In simple words, a Decision Tree uses the features in the given data to perform Supervised Learning and develop a tree-like structure (data structure) whose branches are developed in such a way that given the feature-set, the decision tree can predict the expected output relatively accurately.Example:  Let us look at the structure of a decision tree. For this, we will take up an example dataset called the “PlayTennis” dataset. A sample of the dataset is shown below.In summary, the target of the model is to predict if the weather conditions are suitable to play tennis or not, as guided by the dataset shown above.As it can be seen in the dataset, it contains certain information (features) for each day. In this, we have the feature-attributes: Outlook, Temperature, Humidity and Wind and the target-attribute PlayTennis. Each of these attributes can take up certain values, for example, the attribute Outlook has the values Sunny, Rain and Overcast.With a clear idea of the dataset, jumping a bit forward, let us look at the structure of the learned Decision Tree as developed from the above dataset.As shown above, it can clearly be seen that, given certain values for each of the attributes, the learned decision tree is capable of giving a clear answer as to whether the weather is suitable for Tennis or not.Algorithm: With the overall intuition of decision trees, let us look at the formal Algorithm:ID3(Samples, Target_attribute, Attributes):Create a root node for the TreeIf all the Samples are positive, Return a single-node tree Root, with label = +If all the Samples are negative, Return a single-node tree Root, with label = –If Attribute is empty, Return the single-node tree Root with the label = Most common value of the Target_attribute among the Samples.Otherwise:A ← the attribute from Attributes that best classifies the SamplesThe decision attribute for Root ← AFor each possible value of A:Add a new tree branch below Root, corresponding to the test A = viLet the Samplesvi be the subset of Samples that have value vi for AIf Samplesvi is empty:Then below the new branch add a leaf node with the label = most common value of Target_attribute in the samplesElse below the new branch add the subtree:ID3(Samplesvi, Target_attribute, Attributes – {A})EndReturn RootConnecting the dots: Since the overall idea of decision trees have been explained, let’s try to figure out how Entropy and Information Gain fits into this entire process.Entropy (E) is used to calculate Information Gain, which is used to identify which attribute of a given dataset provides the highest amount of information. The attribute which provides the highest amount of information for the given dataset is considered to have more contribution towards the outcome of the classifier and hence is given the higher priority in the tree.For Example, considering the PlayTennis Example, if we calculate the Information Gain for two corresponding Attributes: Humidity and Wind, we would find that Humidity plays a more important role in deciding whether to play tennis or not. Hence, in this case, Humidity is considered as a better classifier. The detailed calculation is shown in the figure below:Applications of Decision TreeWith the basic idea out of the way, let’s look at where decision trees can be used:Select a flight to travel: Decision trees are very good at classification and hence can be used to select which flight would yield the best “bang-for-the-buck”. There are a lot of parameters to consider, such as if the flight is connecting or non-stop, or how reliable is the service record of the given airliner, etc.Selecting alternative products: Often in companies, it is important to determine which product will be more profitable at launch. Given the sales attributes such as market conditions, competition, price, availability of raw materials, demand, etc. a Decision Tree classifier can be used to accurately determine which of the products would maximize the profits.Sentiment Analysis: Sentiment Analysis is the determination of the overall opinion of a given piece of text and is especially used to determine if the writer’s comment towards a given product/service is positive, neutral or negative. Decision trees are very versatile classifiers and are used for sentiment analysis in many Natural Language Processing (NLP) applications.Energy Consumption: It is very important for electricity supply boards to correctly predict the amount of energy consumption in the near future for a particular region. This is to make sure that un-used power can be diverted towards an area with a higher demand to keep a regular and uninterrupted supply of power throughout the grid. Decision Trees are often used to determine which region is expected to require more or less power in the up-coming time-frame.Fault Diagnosis: In the Engineering domain, one of the widely used applications of decision trees is the determination of faults. In the case of load-bearing rotatory machines, it is important to determine which of the component(s) have failed and which ones can directly or indirectly be affected by the failure. This is determined by a set of measurements that are taken. Unfortunately, there are numerous measurements to take and among them, there are some measurements which are not relevant to the detection of the fault. A Decision Tree classifier can be used to quickly determine which of these measurements are relevant in the determination of the fault.Advantages of Decision TreeListed below are some of the advantages of Decision Trees:Comprehensive: Another significant advantage of a decision tree is that it forces the algorithm to take into consideration all the possible outcomes of a decision and traces each path to a conclusion.Specific: The output of decision trees is very specific and reduces uncertainty in the prediction. Hence, they are considered as really good classifiers.Easy to use: Decision Trees are one of the simplest, yet most versatile algorithms in Machine Learning. It is based on simple math and no complex formulas. They are easy to visualize, understand and explain.Versatile: A lot of business problems can be solved using Decision Trees. They find their applications in the field of Engineering, Management, Medicine, etc. basically, any situation where data is available and a decision needs to be taken in uncertain conditions.Resistant to data abnormalities: Data is never perfect and there are always many abnormalities in the dataset. Some of the most common abnormalities are outliers, missing data and noise. While most Machine Learning algorithms fail with even a minor set of abnormalities, Decision Trees are very resilient and is able to handle a fair percentage of such abnormalities quite well without altering the results.Visualization of the decision taken: Often in Machine Learning models, data scientists struggle to reason as to why a certain model is giving a certain set of outputs. Unfortunately, for most of the algorithms, it is not possible to clearly determine and visualize the actual process of classification that leads to the final outcome. However, decision trees are very easy to visualize. Once the tree is trained, it can be visualized and the programmer can see exactly how and why the conclusion was reached. It is also easy to explain the outcome to a non-technical team with the “tree” type visualization. This is why many organizations prefer to use decision trees over other Machine Learning Algorithms.Limitations of Decision TreeListed below are some of the limitations of Decision Trees:Sensitivity to hyperparameter tuning: Decision Trees are very sensitive to hyperparameter tuning. Hyperparameters are those parameters which are in control of the programmer and can be tuned to get better performance out of a given model. Unfortunately, the output of a decision tree can vary drastically if the hyperparameters are inaccurately tuned.Overfitting: Decision trees are prone to overfitting. Overfitting is a concept where the model learns the data too well and hence performs well on training dataset but fails to perform on testing dataset. Decision trees are prone to overfitting if the breadth and depth of the tree is set to very high for a simpler dataset.Underfitting: Similar to overfitting, decision trees are also prone to underfitting. Underfitting is a concept where the model is too simple for it to learn the dataset effectively. Decision tree suffers from underfitting if the breadth and depth of the model or the number of nodes are set too low. This does not allow the model to fit the data properly and hence fails to learn.Code ExamplesWith the theory out of the way, let’s look at the practical implementation of decision tree classifiers and regressors.1. ClassificationIn order to conduct classification, a diabetes dataset from Kaggle has been used. It can be downloaded here.The initial step for any data science application is data visualization. Hence, the dataset is shown below:The highlighted column is the target value that the model is expected to predict, given the parameters.Load the Libraries. We will be using pandas to load and manipulate data. Sklearn is used for applying Machine Learning models on the data.# Load libraries import pandas as pd from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier from sklearn.model_selection import train_test_split # Import train_test_split function from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation Load the data. Pandas is used to read the data from the CSV.col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label'] # load dataset pima = pd.read_csv("pima-indians-diabetes.csv", header=None, names=col_names)Feature Selection: The relevant features are selected for the classification.#split dataset in features and target variable feature_cols = ['pregnant', 'insulin', 'bmi', 'age','glucose','bp','pedigree'] X = pima[feature_cols] # Features y = pima.label # Target variablesplitting the data: The dataset needs to be split into training and testing data. The training data is used to train the model, while the testing data is used to test the model’s performance on unseen data.# Split dataset into training set and test set X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # 70% training and 30% testBuilding the decision tree. These few lines initialize, train and predict on the dataset.# Create Decision Tree classifier object clf = DecisionTreeClassifier() # Train Decision Tree Classifier clf = clf.fit(X_train,y_train) #Predict the response for test dataset y_pred = clf.predict(X_test)The model’s accuracy is evaluated by using Sklearn’s metrics library. # Model Accuracy, how often is the classifier correct? print("Accuracy:",metrics.accuracy_score(y_test, y_pred)) Output: Accuracy: 0.6753246753246753This will generate the decision tree that is shown in the following image2. RegressionIn order to conduct classification, a diabetes dataset from Kaggle has been used. It can be downloaded here.For this example, we will generate a Numpy Array which will simulate a scatter plot resembling a sine wave with a few randomly added noise elements. # Import the necessary modules and libraries import numpy as np from sklearn.tree import DecisionTreeRegressor import matplotlib.pyplot as plt # Create a random dataset rng = np.random.RandomState(1) X = np.sort(5 * rng.rand(80, 1), axis=0) y = np.sin(X).ravel() y[::5] += 3 * (0.5 - rng.rand(16))This time we create two regression models to experiment and see how overfitting looks like for a decision tree. Hence, we initialize the two Decision Tree Regression objects and train them on the given data.# Fit regression model regr_1 = DecisionTreeRegressor(max_depth=2) regr_2 = DecisionTreeRegressor(max_depth=5) regr_1.fit(X, y) regr_2.fit(X, y)After fitting the model, we predict on a custom test dataset and plot the results to see how it performed.# Predict X_test = np.arange(0.0, 5.0, 0.01)[:, np.newaxis] y_1 = regr_1.predict(X_test) y_2 = regr_2.predict(X_test) # Plot the results plt.figure() plt.scatter(X, y, s=20, edgecolor="black",             c="darkorange", label="data") plt.plot(X_test, y_1, color="cornflowerblue",         label="max_depth=2", linewidth=2) plt.plot(X_test, y_2, color="yellowgreen", label="max_depth=5", linewidth=2) plt.xlabel("data") plt.ylabel("target") plt.title("Decision Tree Regression") plt.legend() plt.show() The graph that is thus generated is shown below. Here we can clearly see that for a simple dataset  when we used max_depth=5 (green), the model started to overfit and learned the patterns of the noise along with the sine wave. Such kinds of models do not perform well. Meanwhile, for max_depth=3 (blue), it has fitted the dataset in a better way when compared to the other one.ConclusionIn this article, we tried to build an intuition, by starting from the basics of the theory behind the working of a decision tree classifier. However, covering every aspect of detail is beyond the scope of this article. Hence, it is suggested to go through this book to dive deeper into the specifics. Further, moving on, the code snippets introduces the “Hello World” of how to use both, real-world data and artificially generated data to train a Decision Tree model and predict using the same. This will allow any novice to get an overall balanced theoretical and practical idea about the workings of Classification and Regression Trees and their implementation.
Rated 4.5/5 based on 12 customer reviews
7964
What is Classification and Regression Trees in Mac...

Introduction to Machine Learning and its typesMach... Read More

What is Gradient Descent For Machine Learning

In our day-to-day lives, we are optimizing variables based on our personal decisions and we don’t even recognize the process consciously. We are constantly using optimization techniques all day long, for example, while going to work, choosing a shorter route in order to minimize traffic woes, figuring out and managing a quick walk around the campus during a snack break, or scheduling a cab in advance to reach the airport on time.Optimization is the ultimate goal, whether you are dealing with actual events in real-life or creating a technology-based product. Optimization is at the heart of most of the statistical and machine learning techniques which are widely used in data science. To gain more knowledge and skills on data science and machine learning, join the  certification course now.Optimization for Machine LearningAccuracy is the word with which we are most concerned, while we are dealing with problems related to machine learning and artificial intelligence. Any rate of errors cannot be tolerated while dealing with real-world problems and neither should they be compromised.Let us consider a case of self-driving cars. The model fitted in the car detects any obstacles that come in the way and takes appropriate actions, which can be slowing down the speed or pulling on the brakes and so on. Now we need to keep this in mind that there is no human in the car to operate or withdraw the actions taken by the self-driving car. In such a scenario, suppose the model is not accurate. It will not be able to detect other cars or any pedestrians and end up crashing leading to several lives at risk.This is where we need optimization algorithms to evaluate our model and judge whether the model is performing according to our needs or not. The evaluation can be made easy by calculating the cost function (which we will look into in a while in this article in detail). It is basically a mapping function that tells us about the difference between the desired output and what our model is computing. We can accordingly correct the model and avoid any kind of undesired activities.Optimization may be defined as the process by which an optimum is achieved. It is all about designing an optimal output for your problems with the use of resources available. However, optimization in machine learning is slightly different. In most of the cases, we are aware of the data, the shape and size, which also helps us know the areas we need to improve. But in machine learning we do not know how the new data may look like, this is where optimization acts perfectly. Optimization techniques are performed on the training data and then the validation data set is used to check its performance.There are a lot of advanced applications of optimization which are widely used in airway routing, market basket analysis, face recognition and so on. Machine learning algorithms such as linear regression, KNN, neural networks completely depend on optimization techniques. Here, we are going to look into one such popular optimization technique called Gradient Descent.What is Gradient Descent?Gradient descent is an optimization algorithm which is mainly used to find the minimum of a function. In machine learning, gradient descent is used to update parameters in a model. Parameters can vary according to the algorithms, such as coefficients in Linear Regression and weights in Neural Networks.Let us relate gradient descent with a real-life analogy for better understanding. Think of a valley you would like to descend when you are blind-folded. Any sane human will take a step and look for the slope of the valley, whether it goes up or down. Once you are sure of the downward slope you will follow that and repeat the step again and again until you have descended completely (or reached the minima).Similarly, let us consider another analogy. Suppose you have a ball and you place it on an inclined plane (at position A). As per laws, it will start rolling until it travels to a gentle plane where it will be stationary (at position B as shown in the figure below).This is exactly what happens in gradient descent. The inclined and/or irregular is the cost function when it is plotted and the role of gradient descent is to provide direction and the velocity (learning rate)  of the movement in order to attain the minima of the function i.e where the cost is minimum.How does Gradient Descent work?The primary goal of machine learning algorithms is always to build a model, which is basically a hypothesis which can be used to find an estimation for Y based on X. Let us consider an example of a model based on certain housing data which comprises of the sale price of the house, the size of the house etc. Suppose we want to predict the pricing of the house based on its size. It is clearly a regression problem where given some inputs, we would like to predict a continuous output.The hypothesis is usually presented aswhere the theta values are the parameters.Let us look into some examples and visualize the hypothesis:This yields h(x) = 1.5 + 0x. 0x means no slope, and y will always be the constant 1.5. This looks like:Now let us consider,Where, h(x) = 1 + 0.5xCost FunctionThe objective in the case of gradient descent is to find a line of best fit for some given inputs, or X values, and any number of Y values, or outputs. A cost function is defined as “a function that maps an event or values of one or more variables onto a real number intuitively representing some “cost” associated with the event.”With a known set of inputs and their corresponding outputs, a machine learning model attempts to make predictions according to the new set of inputs.Machine Learning ProcessThe Error would be the difference between the two predictions.This relates to the idea of a Cost function or Loss function.A Cost Function/Loss Function tells us “how good” our model is at making predictions for a given set of parameters. The cost function has a curve and a gradient, the slope of this curve helps us to update our parameters and make an accurate model.Minimizing the Cost FunctionIt is always the primary goal of any Machine Learning Algorithm to minimize the Cost Function. Minimizing cost functions will also result in a lower error between the predicted values and the actual values which also denotes that the algorithm has performed well in learning. How do we actually minimize any function?Generally, the cost function is in the form of Y = X². In a Cartesian coordinate system, this represents an equation for a parabola which can be graphically represented as :ParabolaNow in order to minimize the function mentioned above, firstly we need to find the value of X which will produce the lowest value of Y (in this case it is the red dot). With lower dimensions (like 2D in this case) it becomes easier to locate the minima but it is not the same while dealing with higher dimensions. For such cases, we need to use the Gradient Descent algorithm to locate the minima.Now a function is required which will minimize the parameters over a dataset. The most common function which is often used is the  mean squared error. It measures the difference between the estimated value (the prediction) and the estimator (the dataset).Mean Squared ErrorIt turns out we can adjust the equation a little to make the calculation down the track a little more simple. Now a question may arise, Why do we take the squared differences and simply not the absolute differences? Because the squared differences make it easier to derive a regression line. Indeed, to find that line we need to compute the first derivative of the Cost function, and it is much harder to compute the derivative of absolute values than squared values. Also, the squared differences increase the error distance, thus, making the bad predictions more pronounced than the good ones.The equation looks like -Mean Squared ErrorLet us apply this cost function to the following data:Here we will calculate some of the theta values and then plot the cost function by hand. Since this function passes through (0, 0), we will look only at a single value of theta. Also, let us refer to the cost function as J(ϴ) from now on.When the value of ϴ is 1, for J(1), we get a 0. You will notice the value of J(1) gives a straight line which fits the data perfectly. Now let us try with ϴ = 0.5J(0.5)The MSE function gives us a value of 0.58. Let’s plot both our values so far:J(1) = 0J(0.5) = 0.58With J(1) and J(0.5)Let us go ahead and calculate some more values of J(ϴ).Now if we join the dots carefully, we will get -Visualizing the cost function J(ϴ)As we can see, the cost function is at a minimum when theta = 1, which means the initial data is a straight line with a slope or gradient of 1 as shown by the orange line in the above figure.Using a trial and error method, we minimized J(ϴ). We did all of these by trying out a lot of values and with the help of visualizations. Gradient Descent does the same thing in a much better way, by changing the theta values or parameters until it descends to the minimum value.You may refer below for the Python code to find out cost function:import matplotlib.pyplot as plt import numpy as np # original data set X = [1, 2, 3] y = [1, 2, 3] # slope of best_fit_1 is 0.5 # slope of best_fit_2 is 1.0 # slope of best_fit_3 is 1.5 hyps = [0.5, 1.0, 1.5] # multiply the original X values by the theta # to produce hypothesis values for each X def multiply_matrix(mat, theta): mutated = [] for i in range(len(mat)):     mutated.append(mat[i] * theta) return mutated # calculate cost by looping each sample # subtract hyp(x) from y # square the result # sum them all together def calc_cost(m, X, y): total = 0 for i in range(m):     squared_error = (y[i] - X[i]) ** 2     total += squared_error     return total * (1 / (2*m)) # calculate cost for each hypothesis for i in range(len(hyps)): hyp_values = multiply_matrix(X, hyps[i])   print("Cost for ", hyps[i], " is ", calc_cost(len(X), y, hyp_values))Cost for 0.5 is 0.5833333333333333 Cost for 1.0 is 0.0 Cost for 1.5 is 0.5833333333333333 Learning RateLet us now start by initializing theta0 and theta1 to any two values, say 0 for both, and go from there. The algorithm is as follows:Gradient Descentwhere α, alpha, is the learning rate, or how rapidly do we want to move towards the minimum. We can always overshoot if the value of α is too large.The derivative which refers to the slope of the function is calculated. Here we calculate the partial derivative of the cost function. It helps us to know the direction (sign) in which the coefficient values should move so that they attain a lower cost on the following iteration. Partial Derivative of the Cost Function which we need to calculateOnce we know the direction from the derivative, we can update the coefficient values. Now you need to specify a learning rate parameter which will control how much the coefficients can change on each update.coefficient = coefficient – (alpha * delta)This particular process is repeated as long as the cost of the coefficients is 0.0 or close enough to zero.This turns out to be:Image from Andrew Ng’s machine learning courseWhich gives us linear regression!Linear RegressionTypes of Gradient Descent AlgorithmsGradient descent variants’ trajectory towards the minimum1. Batch Gradient Descent: In this type of gradient descent, all the training examples are processed for each iteration of gradient descent. It gets computationally expensive if the number of training examples is large. This is when batch gradient descent is not preferred, rather a stochastic gradient descent or mini-batch gradient descent is used.Algorithm for batch gradient descent:Let hθ(x) be the hypothesis for linear regression. Then, the cost function is given by:Let Σ represents the sum of all training examples from i=1 to m.Repeat {For every j =0 …n}Where xj(i) represents the jth feature of the ith training example. So if m is very large, then the derivative term fails to converge at the global minimum.2. Stochastic Gradient Descent: The word stochastic is related to a system or a process that is linked with a random probability. Therefore, in Stochastic Gradient Descent (SGD) samples are selected at random for each iteration instead of selecting the entire data set. When the number of training examples is too large, it becomes computationally expensive to use batch gradient descent, however, Stochastic Gradient Descent uses only a single sample, i.e., a batch size of one, to perform each iteration. The sample is randomly shuffled and selected for performing the iteration. The parameters are updated even after one iteration where only one has been processed. Thus, it gets faster than batch gradient descent.Algorithm for stochastic gradient descent:Firstly shuffle the data set randomly in order to train the parameters evenly for each type of data.As mentioned above, it takes into consideration one example per iteration.Hence,Let (x(i),y(i)) be the training exampleRepeat {For i=1 to m{        For every j =0 …n              }}3. Mini Batch gradient descent: This type of gradient descent is considered to be faster than both batch gradient descent and stochastic gradient descent. Even if the number of training examples is large, it processes it in batches in one go. Also, the number of iterations are lesser in spite of working with larger training samples.Algorithm for mini-batch gradient descent:Let us consider b be the number of examples in one batch, where b precision:          # change the value of x     x_prev = x_new      # get the derivation of the old value of x     d_x = - deriv(x_prev)          # get your new value of x by adding the previous, the multiplication of the derivative and the learning rate     x_new = x_prev + (l_r * d_x)          # append the new value of x to a list of all x-s for later visualization of path     x_list.append(x_new)          # append the new value of y to a list of all y-s for later visualization of path     y_list.append(function(x_new)) print ("Local minimum occurs at: "+ str(x_new)) print ("Number of steps: " + str(len(x_list)))           plt.subplot(1,2,2) plt.scatter(x_list,y_list,c="g") plt.plot(x_list,y_list,c="g") plt.plot(x,function(x), c="r") plt.title("Gradient descent") plt.show() plt.subplot(1,2,1) plt.scatter(x_list,y_list,c="g") plt.plot(x_list,y_list,c="g") plt.plot(x,function(x), c="r") plt.xlim([1.0,2.1]) plt.title("Zoomed in Gradient descent to Key Area") plt.show() #Implement gradient descent (all the arguments are arbitrarily chosen) step(0.5, 0, 0.001, 0.05)Local minimum occurs at: 1.9980265135950486Number of steps: 25 SummaryIn this article, you have learned about gradient descent for machine learning. Here we tried to cover most of the topics. To learn more about machine learning algorithms in-depth,  click here. Let us summarize all that we have covered in this article.Optimization is the heart and soul of machine learning.Gradient descent is a simple optimization technique which can be used with other machine learning algorithms.Batch gradient descent refers to calculating the derivative from all training data before calculating an update.Stochastic gradient descent refers to calculating the derivative from each training data instance and calculating the update immediately.If you are inspired by the opportunities provided by Data Science, enrol in our  Data Science and Machine Learning Courses for more lucrative career options in this landscape.
Rated 4.5/5 based on 34 customer reviews
12562
What is Gradient Descent For Machine Learning

In our day-to-day lives, we are optimizing variabl... Read More

Overfitting and Underfitting With Algorithms

Curve fitting is the process of determining the best fit mathematical function for a given set of data points. It examines the relationship between multiple independent variables (predictors) and a dependent variable (response) in order to determine the “best fit” line.In the figure shown, the red line represents the curve that is the best fit for the given purple data points. It can also be seen that curve fitting does not necessarily mean that the curve should pass over each and every data point. Instead, it is the most appropriate curve that represents all the data points adequately.Curve Fitting vs. Machine LearningAs discussed, curve fitting refers to finding the “best fit” curve or line for a given set of data points. Even though this is also what a part of Machine Learning or Data Science does, the applications of Machine Learning or Data Science far outweigh that of Curve Fitting.The major difference is that during Curve Fitting, the entire data is available to the developer. However, when it comes to Machine Learning, the amount of data available to the developer is only a part of the real-world data on which the Fitted Model will be applied.Even then, Machine Learning is a vast interdisciplinary field and it consists of a lot more than just “Curve Fitting”. Machine Learning can be broadly classified into Supervised, Unsupervised and Reinforcement Learning. Considering the fact that most of the real-world problems are solved by Supervised Learning, this article concentrates on Supervised Learning itself.Supervised learning can be further classified into Classification and Regression. In this case, the work done by Regression is similar to what Curve Fitting achieves. To get a broader idea, let’s look at the difference between Classification and Regression:ClassificationRegressionIt is the process of separating/classifying two or more types of data into separate categories or classes based on their characteristics.It is the process of determining the “Best Fit” curve for the given data such that, on unseen data, the data points lying on the curve accurately represent the desired result.The output values are discrete in nature (eg. 0, 1, 2, 3, etc) and are known as “Classes”.The output values are continuous in nature (eg. 0.1, 1.78, 9.54, etc).Here, the two classes (red and blue colored points) are clearly separated by the line(s) in the middle. This is an example of classification.Here, the curve represented by the magenta line is the “Best Fit” line for all the data points as shown. This is an example of Regression.Noise in DataThe data that is obtained from the real world is not ideal or noise-free. It contains a lot of noise, which needs to be filtered out before applying the Machine Learning Algorithms.As shown in the above image, the few extra data points in the top of the left graph represent unnecessary noise, which in technical terms is known as “Outliers”. As shown in the difference between the left and the right graphs, the presence of outliers makes a considerable amount of difference when it comes to the determination of the “Best Fit” line. Hence, it is of immense importance to apply preprocessing techniques in order to remove outliers from the data.Let us look at two of the most common types of noise in Data:Outliers: As already discussed, outliers are data points which do not belong to the original set of data. These data points are either too high or too low in value, such that they do not belong to the general distribution of the rest of the dataset. They are usually due to misrepresentation or an accidental entry of wrong data. There are several statistical algorithms which are used to detect and remove such outliers.Missing Data: In sharp contrast to outliers, missing data is another major challenge when it comes to the dataset. The occurrence is quite common in tabular datasets (eg. CSV files) and is a challenge if the number of missing data points exceeds 10% of the total size of the dataset. Most Machine Learning algorithms fail to perform on such datasets. However, certain algorithms such as Decision Trees are quite resilient when it comes to data with missing data and are able to provide accurate results even when supplied with such noisy datasets. Similar to Outliers, there are statistical methods to handle missing data or “NaN” (Not a Number) values. The most common of them is to remove or “drop” the row containing the missing data. Training of Data“Training” is terminology associated with Machine Learning and it basically means the “Fitting” of data or “Learning” from data. This is the step where the Model starts to learn from the given data in order to be able to predict on similar but unseen data. This step is crucial since the final output (or Prediction) of the model will be based on how well the model was able to acquire the patterns of the training data.Training in Machine Learning: Depending on the type of data, the training methodology varies. Hence, here we assume simple tabular (eg. CSV) text data. Before the model can be fitted on the data, there are a few steps that have to be followed:Data Cleaning/Preprocessing: The raw data that is thus obtained from the real-world is likely to contain a good amount of noise in it. In addition to that, the data might not be homogenous, which means, the values of different “features” might belong to different ranges. Hence, after the removal of noise, the data needs to be normalized or scaled in order to make it homogeneous.Feature Engineering: In a tabular dataset, all the columns that describe the data are called “Features”. These features are necessary to correctly predict the target value. However, data often contains columns which are irrelevant to the output of the model. Hence, these columns need to be removed or statistically processed to make sure that they do not interfere with the training of the model on features that are relevant. In addition to the removal of irrelevant features, it is often required to create new relevant features from the existing features. This allows the model to learn better and this process is also called “Feature Extraction”.Train, Validation and Test Split: After the data has been preprocessed and is ready for training, the data is split into Training Data, Validation Data and Testing Data in the ratio of 60:20:20 (usually). This ratio varies depending on the availability of data and on the application. This is done to ensure that the model does not unnecessarily “Overfit” or “Underfit”, and performs equally well when deployed in the real world.Training: Finally, as the last step,  the Training Data is fed into the model to train upon. Multiple models can be trained simultaneously and their performance can be measured against each other with the help of the Validation Set, based on which the best model is selected. This is called “Model Selection”. Finally, the selected model is used to predict on the Test Set to get a final test score, which more or less accurately defines the performance of the model on the given dataset.Training in Deep Learning: Deep Learning is a part of machine learning, but instead of relying on statistical methods, Deep Learning Techniques largely depend on calculus and aims to mimic the Neural structure of the biological brain, and hence, are often referred to as Neural Networks.The training process for Deep Learning is quite similar to that of Machine Learning except that there is no need for “Feature Engineering”. Since deep learning models largely rely on weights to specify the importance of given input (feature), the model automatically tends to learn which features are relevant and which feature is not. Hence, it assigns a “high” weight to the features that are relevant and assigns a “low” weight to the features that are not relevant. This removes the need for a separate Feature Engineering.This difference is correctly portrayed in the following figure:Improper Training of Data: As discussed above, the training of data is the most crucial step of any Machine Learning Algorithm. Improper training can lead to drastic performance degradation of the model on deployment. On a high level, there are two main types of outcomes of Improper Training: Underfitting and Overfitting.UnderfittingWhen the complexity of the model is too less for it to learn the data that is given as input, the model is said to “Underfit”. In other words, the excessively simple model fails to “Learn” the intricate patterns and underlying trends of the given dataset. Underfitting occurs for a model with Low Variance and High Bias.Underfitting data Visualization: With the initial idea out of the way, visualization of an underfitting model is important. This helps in determining if the model is underfitting the given data during training. As already discussed, supervised learning is of two types: Classification and Regression. The following graphs show underfitting for both of these cases:Classification: As shown in the figure below, the model is trained to classify between the circles and crosses. However, it is unable to do so properly due to the straight line, which fails to properly classify either of the two classes.Regression: As shown in the figure below, the data points are laid out in a given pattern, but the model is unable to “Fit” properly to the given data due to low model complexity.Detection of underfitting model: The model may underfit the data, but it is necessary to know when it does so. The following steps are the checks that are used to determine if the model is underfitting or not.Training and Validation Loss: During training and validation, it is important to check the loss that is generated by the model. If the model is underfitting, the loss for both training and validation will be significantly high. In terms of Deep Learning, the loss will not decrease at the rate that it is supposed to if the model has reached saturation or is underfitting.Over Simplistic Prediction Graph: If a graph is plotted showing the data points and the fitted curve, and the curve is over-simplistic (as shown in the image above), then the model is suffering from underfitting. A more complex model is to be tried out.Classification: A lot of classes will be misclassified in the training set as well as the validation set. On data visualization, the graph would indicate that if there was a more complex model, more classes would have been correctly classified.Regression: The final “Best Fit” line will fail to fit the data points in an effective manner. On visualization, it would clearly seem that a more complex curve can fit the data better.Fix for an underfitting model: If the model is underfitting, the developer can take the following steps to recover from the underfitting state:Train Longer: Since underfitting means less model complexity, training longer can help in learning more complex patterns. This is especially true in terms of Deep Learning.Train a more complex model: The main reason behind the model to underfit is using a model of lesser complexity than required for the data. Hence, the most obvious fix is to use a more complex model. In terms of Deep Learning, a deeper network can be used.Obtain more features: If the data set lacks enough features to get a clear inference, then Feature Engineering or collecting more features will help fit the data better.Decrease Regularization: Regularization is the process that helps Generalize the model by avoiding overfitting. However, if the model is learning less or underfitting, then it is better to decrease or completely remove Regularization techniques so that the model can learn better.New Model Architecture: Finally, if none of the above approaches work, then a new model can be used, which may provide better results.OverfittingWhen the complexity of the model is too high as compared to the data that it is trying to learn from, the model is said to “Overfit”. In other words, with increasing model complexity, the model tends to fit the Noise present in data (eg. Outliers). The model learns the data too well and hence fails to Generalize. Overfitting occurs for a model with High Variance and Low Bias.Overfitting data Visualization: With the initial idea out of the way, visualization of an overfitting model is important. Similar to underfitting, overfitting can also be showcased in two forms of supervised learning: Classification and Regression. The following graphs show overfitting for both of these cases:Classification: As shown in the figure below, the model is trained to classify between the circles and crosses, and unlike last time, this time the model learns too well. It even tends to classify the noise in the data by creating an excessively complex model (right).Regression: As shown in the figure below, the data points are laid out in a given pattern, and instead of determining the least complex model that fits the data properly, the model on the right has fitted the data points too well when compared to the appropriate fitting (left).Detection of overfitting model: The parameters to look out for to determine if the model is overfitting or not is similar to those of underfitting ones. These are listed below:Training and Validation Loss: As already mentioned, it is important to measure the loss of the model during training and validation. A very low training loss but a high validation loss would signify that the model is overfitting. Additionally, in Deep Learning, if the training loss keeps on decreasing but the validation loss remains stagnant or starts to increase, it also signifies that the model is overfitting.Too Complex Prediction Graph: If a graph is plotted showing the data points and the fitted curve, and the curve is too complex to be the simplest solution which fits the data points appropriately, then the model is overfitting.Classification: If every single class is properly classified on the training set by forming a very complex decision boundary, then there is a good chance that the model is overfitting.Regression: If the final “Best Fit” line crosses over every single data point by forming an unnecessarily complex curve, then the model is likely overfitting.Fix for an overfitting model: If the model is overfitting, the developer can take the following steps to recover from the overfitting state:Early Stopping during Training: This is especially prevalent in Deep Learning. Allowing the model to train for a high number of epochs (iterations) may lead to overfitting. Hence it is necessary to stop the model from training when the model has started to overfit. This is done by monitoring the validation loss and stopping the model when the loss stops decreasing over a given number of epochs (or iterations).Train with more data: Often, the data available for training is less when compared to the model complexity. Hence, in order to get the model to fit appropriately, it is often advisable to increase the training dataset size.Train a less complex model: As mentioned earlier, the main reason behind overfitting is excessive model complexity for a relatively less complex dataset. Hence it is advisable to reduce the model complexity in order to avoid overfitting. For Deep Learning, the model complexity can be reduced by reducing the number of layers and neurons.Remove features: As a contrast to the steps to avoid underfitting, if the number of features is too many, then the model tends to overfit. Hence, reducing the number of unnecessary or irrelevant features often leads to a better and more generalized model. Deep Learning models are usually not affected by this.Regularization: Regularization is the process of simplification of the model artificially, without losing the flexibility that it gains from having a higher complexity. With the increase in regularization, the effective model complexity decreases and hence prevents overfitting.Ensembling: Ensembling is a Machine Learning method which is used to combine the predictions from multiple separate models. It reduces the model complexity and reduces the errors of each model by taking the strengths of multiple models. Out of multiple ensembling methods, two of the most commonly used are Bagging and Boosting.GeneralizationThe term “Generalization” in Machine Learning refers to the ability of a model to train on a given data and be able to predict with a respectable accuracy on similar but completely new or unseen data. Model generalization can also be considered as the prevention of overfitting of data by making sure that the model learns adequately.Generalization and its effect on an Underfitting Model: If a model is underfitting a given dataset, then all efforts to generalize that model should be avoided. Generalization should only be the goal if the model has learned the patterns of the dataset properly and needs to generalize on top of that. Any attempt to generalize an already underfitting model will lead to further underfitting since it tends to reduce model complexity.Generalization and its effect on Overfitting Model: If a model is overfitting, then it is the ideal candidate to apply generalization techniques upon. This is primarily because an overfitting model has already learned the intricate details and patterns of the dataset. Applying generalization techniques on this kind of a model will lead to a reduction of model complexity and hence prevent overfitting. In addition to that, the model will be able to predict more accurately on unseen, but similar data.Generalization Techniques: There are no separate Generalization techniques as such, but it can easily be achieved if a model performs equally well in both training and validation data. Hence, it can be said that if we apply the techniques to prevent overfitting (eg. Regularization, Ensembling, etc.) on a model that has properly acquired the complex patterns, then a successful generalization of some degree can be achieved.Relationship between Overfitting and Underfitting with Bias-Variance TradeoffBias-Variance Tradeoff: Bias denotes the simplicity of the model. A high biased model will have a simpler architecture than that of a model with a lower bias. Similarly, complementing Bias, Variance denotes how complex the model is and how well it can fit the data with a high degree of diversity.An ideal model should have Low Bias and Low Variance. However, when it comes to practical datasets and models, it is nearly impossible to achieve a “zero” Bias and Variance. These two are complementary of each other, if one decreases beyond a certain limit, then the other starts increasing. This is known as the Bias-Variance Tradeoff. Under such circumstances, there is a “sweet spot” as shown in the figure, where both bias and variance are at their optimal values.Bias-Variance and Generalization: As it is clear from the above graph, the Bias and Variance are linked to Underfitting and Overfitting.  A model with high Bias means the model is Underfitting the given data and a model with High Variance means the model is Overfitting the given data.Hence, as it can be seen, at the optimal region of the Bias-Variance tradeoff, the model is neither underfitting nor overfitting. Hence, since there is neither underfitting nor overfitting, it can also be said that the model is most Generalized, as under these conditions the model is expected to perform equally well on Training and Validation Data. Thus, the graph depicts that the Generalization Error is minimum at the optimal value of the degree of Bias and Variance.ConclusionTo summarize, the learning capabilities of a model depend on both, model complexity and data diversity. Hence, it is necessary to keep a balance between both such that the Machine Learning Models thus trained can perform equally well when deployed in the real world.In most cases, Overfitting and Underfitting can be taken care of in order to determine the most appropriate model for the given dataset. However, even though there are certain rule-based steps that can be followed to improve a model, the insight to achieve a properly Generalized model comes with experience.
Rated 4.5/5 based on 3 customer reviews
4884
Overfitting and Underfitting With Algorithms

Curve fitting is the process of determining the be... Read More

What is Bias-Variance Tradeoff in Machine Learning

What is Machine Learning? Machine Learning is a multidisciplinary field of study, which gives computers the ability to solve complex problems, which otherwise would be nearly impossible to be hand-coded by a human being. Machine Learning is a scientific field of study which involves the use of algorithms and statistics to perform a given task by relying on inference from data instead of explicit instructions. Machine Learning Process:The process of Machine Learning can be broken down into several parts, most of which is based around “Data”. The following steps show the Machine Learning Process. 1. Gathering Data from various sources: Since Machine Learning is basically the inference drawn from data before any algorithm can be used, data needs to be collected from some source. Data collected can be of any form, viz. Video data, Image data, Audio data, Text data, Statistical data, etc. 2. Cleaning data to have homogeneity: The data that is collected from various sources does not always come in the desired form. More importantly, data contains various irregularities like Missing data and Outliers.These irregularities may cause the Machine Learning Model(s) to perform poorly. Hence, the removal or processing of irregularities is necessary to promote data homogeneity. This step is also known as data pre-processing. 3. Model Building & Selecting the right Machine Learning Model: After the data has been correctly pre-processed, various Machine Learning Algorithms (or Models) are applied on the data to train the model to predict on unseen data, as well as to extract various insights from the data. After various models are “trained” to the data, the best performing model(s) that suit the application and the performance criteria are selected.4. Getting Insights from the model’s results: Once the model is selected, further data is used to validate the performance and accuracy of the model and get insights as to how the model performs under various conditions. 5. Data Visualization: This is the final step, where the model is used to predict unseen and real-world data. However, these predictions are not directly understandable to the user, and hence, data Visualization or converting the results into understandable visual graphs is necessary. At this stage, the model can be deployed to solve real-world problems.How is Machine Learning different from Curve Fitting? To get the similarities out of the way, both, Machine Learning and Curve Fitting rely on data to infer a model which, ideally, fits the data perfectly. The difference comes in the availability of the data. Curve Fitting is carried out with data, all of which is already available to the user. Hence, there is no question of the model to encounter unseen data.However, in Machine Learning, only a part of the data is available to the user at the time of training (fitting) the model, and then the model has to perform equally well on data that it has never encountered before. Which is, in other words, the generalization of the model over a given data, such that it is able to correctly predict when it is deployed.A high-level introduction to Bias and Variance through illustrative and applied examples Let’s initiate the idea of Bias and Variance with a case study. Let’s assume a simple dataset of predicting the price of a house based on its carpet area. Here, the x-axis represents the carpet area of the house, and the y-axis represents the price of the property. The plotted data (in a 2D graph) is shown in the graph below: The goal is to build a model to predict the price of the house, given the carpet area of the property. This is a rather easy problem to solve and can easily be achieved by fitting a curve to the given data points. But, for the time being, let’s concentrate on solving the same using Machine Learning.In order to keep this example simple and concentrate on Bias and Variance, a few assumptions are made:Adequate data is present in order to come up with a working model capable of making relatively accurate predictions.The data is homogeneous in nature and hence no major pre-processing steps are involved.There are no missing values or outliers, and hence they do not interfere with the outcome in any way. The y-axis data-points are independent of the order of the sequence of the x-axis data-points.With the above assumptions, the data is processed to train the model using the following steps: 1. Shuffling the data: Since the y-axis data-points are independent of the order of the sequence of the x-axis data-points, the dataset is shuffled in a pseudo-random manner. This is done to avoid unnecessary patterns from being learned by the model. During the shuffling, it is imperative to keep each x-y pair data point constant. Mixing them up will change the dataset itself and the model will learn inaccurate patterns. 2. Data Splitting: The dataset is split into three categories: Training Set (60%), Validation Set (20%), and Testing Set (20%). These three sets are used for different purposes:Training Set - This part of the dataset is used to train the model. It is also known as the Development Set. Validation Set - This is separate from the Training Set and is only used for model selection. The model does not train or learn from this part of the dataset.Testing Set - This part of the dataset is used for performance evaluation and is completely independent of the Training or Validation Sets. Similar to the Validation Set, the model does not train on this part of the dataset.3. Model Selection: Several Machine Learning Models are applied to the Training Set and their Training and Validation Losses are determined, which then helps determine the most appropriate model for the given dataset.During this step, we assume that a polynomial equation fits the data correctly. The general equation is given below: The process of “Training” mathematically is nothing more than figuring out the appropriate values for the parameters: a0, a1, ... ,an, which is done automatically by the model using the Training Set.The developer does have control over how high the degree of the polynomial can be. These parameters that can be tuned by the developer are called Hyperparameters. These hyperparameters play a key role in deciding how well would the model learn and how generalized will the learned parameters be. Given below are two graphs representing the prediction of the trained model on training data. The graph on the left represents a linear model with an error of 3.6, and the graph on the right represents a polynomial model with an error of 1.7. By looking at the errors, it can be concluded that the polynomial model performs significantly better when compared to the linear model (Lower the error, better is the performance of the model). However, when we use the same trained models on the Testing Set, the models perform very differently. The graph on the left represents the same linear model’s prediction on the Testing Set, and the graph on the right side represents the Polynomial model’s prediction on the Testing Set. It is clearly visible that the Polynomial model inaccurately predicts the outputs when compared to the Linear model.In terms of error, the total error for the Linear model is 3.6 and for the Polynomial model is a whopping 929.12. Such a big difference in errors between the Training and Testing Set clearly signifies that something is wrong with the Polynomial model. This drastic change in error is due to a phenomenon called Bias-Variance Tradeoff.What is “Error” in Machine Learning? Error in Machine Learning is the difference in the expected output and the predicted output of the model. It is a measure of how well the model performs over a given set of data.There are several methods to calculate error in Machine Learning. One of the most commonly used terminologies to represent the error is called the Loss/Cost Function. It is also known as the Mean Squared Error (or MSE) and is given by the following equation:The necessity of minimization of Errors: As it is obvious from the previously shown graphs, the higher the error, the worse the model performs. Hence, the error of the prediction of a model can be considered as a performance measure: Lower the error of a model, the better it performs. In addition to that, a model judges its own performance and trains itself based on the error created between its own output and the expected output. The primary target of the model is to minimize the error so as to get the best parameters that would fit the data perfectly. Total Error: The error mentioned above is the Total Error and consists of three types of errors: Bias + Variance + Irreducible Error. Total Error = Bias + Variance + Irreducible ErrorEven for an ideal model, it is impossible to get rid of all the types of errors. The “irreducible” error rate is caused by the presence of noise in the data and hence is not removable. However, the Bias and Variance errors can be reduced to a minimum and hence, the total error can also be reduced significantly. Why is the splitting of data important? Ideally, the complete dataset is not used to train the model. The dataset is split into three sets: Training, Validation and Testing Sets. Each of these serves a specific role in the development of a model which performs well under most conditions.Training Set (60-80%): The largest portion of the dataset is used for training the Machine Learning Model. The model extracts the features and learns to recognize the patterns in the dataset. The quality and quantity of the training set determines how well the model is going to perform. Testing Set (15-25%): The main goal of every Machine Learning Engineer is to develop a model which would generalize the best over a given dataset. This is achieved by training the model(s) on a portion of the dataset and testing its performance by applying the trained model on another portion of the same/similar dataset that has not been used during training (Testing Set). This is important since the model might perform too well on the training set, but perform poorly on unseen data, as was the case with the example given above. Testing set is primarily used for model performance evaluation.Validation Set (15-25%): In addition to the above, because of the presence of more than one Machine Learning Algorithm (model), it is often not recommended to test the performance of multiple models on the same dataset and then choose the best one. This process is called Model Selection, and for this, a separate part of the training set is used, which is also known as Validation Set. A validation set behaves similar to a testing set but is primarily used in model selection and not in performance evaluation.Bias and Variance - A Technical Introduction What is Bias?Bias is used to allow the Machine Learning Model to learn in a simplified manner. Ideally, the simplest model that is able to learn the entire dataset and predict correctly on it is the best model. Hence, bias is introduced into the model in the view of achieving the simplest model possible.Parameter based learning algorithms usually have high bias and hence are faster to train and easier to understand. However, too much bias causes the model to be oversimplified and hence underfits the data. Hence these models are less flexible and often fail when they are applied on complex problems.Mathematically, it is the difference between the model’s average prediction and the expected value.What is Variance?Variance in data is the variability of the model in a case where different Training Data is used. This would significantly change the estimation of the target function. Statistically, for a given random variable, Variance is the expectation of squared deviation from its mean. In other words, the higher the variance of the model, the more complex the model is and it is able to learn more complex functions. However, if the model is too complex for the given dataset, where a simpler solution is possible, a model with high Variance causes the model to overfit. When the model performs well on the Training Set and fails to perform on the Testing Set, the model is said to have Variance.Characteristics of a biased model A biased model will have the following characteristics:Underfitting: A model with high bias is simpler than it should be and hence tends to underfit the data. In other words, the model fails to learn and acquire the intricate patterns of the dataset. Low Training Accuracy: A biased model will not fit the Training Dataset properly and hence will have low training accuracy (or high training loss). Inability to solve complex problems: A Biased model is too simple and hence is often incapable of learning complex features and solving relatively complex problems.Characteristics of a model with Variance A model with high Variance will have the following characteristics:Overfitting: A model with high Variance will have a tendency to be overly complex. This causes the overfitting of the model.Low Testing Accuracy: A model with high Variance will have very high training accuracy (or very low training loss), but it will have a low testing accuracy (or a low testing loss). Overcomplicating simpler problems: A model with high variance tends to be overly complex and ends up fitting a much more complex curve to a relatively simpler data. The model is thus capable of solving complex problems but incapable of solving simple problems efficiently.What is Bias-Variance Tradeoff? From the understanding of bias and variance individually thus far, it can be concluded that the two are complementary to each other. In other words, if the bias of a model is decreased, the variance of the model automatically increases. The vice-versa is also true, that is if the variance of a model decreases, bias starts to increase.Hence, it can be concluded that it is nearly impossible to have a model with no bias or no variance since decreasing one increases the other. This phenomenon is known as the Bias-Variance TradeA graphical introduction to Bias-Variance Tradeoff In order to get a clear idea about the Bias-Variance Tradeoff, let us consider the bulls-eye diagram. Here, the central red portion of the target can be considered the location where the model correctly predicts the values. As we move away from the central red circle, the error in the prediction starts to increase. Each of the several hits on the target is achieved by repetition of the model building process. Each hit represents the individual realization of the model. As can be seen in the diagram below, the bias and the variance together influence the predictions of the model under different circumstances.Another way of looking at the Bias-Variance Tradeoff graphically is to plot the graphical representation for error, bias, and variance versus the complexity of the model. In the graph shown below, the green dotted line represents variance, the blue dotted line represents bias and the red solid line represents the error in the prediction of the concerned model. Since bias is high for a simpler model and decreases with an increase in model complexity, the line representing bias exponentially decreases as the model complexity increases. Similarly, Variance is high for a more complex model and is low for simpler models. Hence, the line representing variance increases exponentially as the model complexity increases. Finally, it can be seen that on either side, the generalization error is quite high. Both high bias and high variance lead to a higher error rate. The most optimal complexity of the model is right in the middle, where the bias and variance intersect. This part of the graph is shown to produce the least error and is preferred. Also, as discussed earlier, the model underfits for high-bias situations and overfits for high-variance situations.Mathematical Expression of Bias-Variance Tradeoff The expected values is a vector represented by y. The predicted output of the model is denoted by the vector y for input vector x. The relationship between the predicted values and the inputs can be taken as y = f(x) + e, where e is the normally distributed error given by:The third term in the above equation, irreducible_error represents the noise term and cannot be fundamentally reduced by any given model. If hypothetically, infinite data is available, it is possible to tune the model to reduce the bias and variance terms to zero but is not possible to do so practically. Hence, there is always a tradeoff between the minimization of bias and variance. Detection of Bias and Variance of a modelIn model building, it is imperative to have the knowledge to detect if the model is suffering from high bias or high variance. The methods to detect high bias and variance is given below:Detection of High Bias:The model suffers from a very High Training Error.The Validation error is similar in magnitude to the training error.The model is underfitting.Detection of High Variance:The model suffers from a very Low Training Error.The Validation error is very high when compared to the training error.The model is overfitting.A graphical method to Detect a model suffering from High Bias and Variance is shown below: The graph shows the change in error rate with respect to model complexity for training and validation error. The left portion of the graph suffers from High Bias. This can be seen as the training error is quite high along with the validation error. In addition to that, model complexity is quite low. The right portion of the graph suffers from High Variance. This can be seen as the training error is very low, yet the validation error is very high and starts increasing with increasing model complexity.A systematic approach to solve a Bias-Variance Problem by Dr. Andrew Ng:Dr. Andrew Ng proposed a very simple-to-follow step by step architecture to detect and solve a High Bias and High Variance errors in a model. The block diagram is shown below:Detection and Solution to High Bias problem - if the training error is high: Train longer: High bias means a usually less complex model, and hence it requires more training iterations to learn the relevant patterns. Hence, longer training solves the error sometimes.Train a more complex model: As mentioned above, high bias is a result of a less than optimal complexity in the model. Hence, to avoid high bias, the existing model can be swapped out with a more complex model. Obtain more features: It is often possible that the existing dataset lacks the required essential features for effective pattern recognition. To remedy this problem: More features can be collected for the existing data.Feature Engineering can be performed on existing features to extract more non-linear features. Decrease regularization: Regularization is a process to decrease model complexity by regularizing the inputs at different stages, promote generalization and prevent overfitting in the process. Decreasing regularization allows the model to learn the training dataset better. New model architecture: If all of the above-mentioned methods fail to deliver satisfactory results, then it is suggested to try out other new model architectures. Detection and Solution to High Variance problem - if a validation error is high: Obtain more data: High variance is often caused due to a lack of training data. The model complexity and quantity of training data need to be balanced. A model of higher complexity requires a larger quantity of training data. Hence, if the model is suffering from high variance, more datasets can reduce the variance. Decrease number of features: If the dataset consists of too many features for each data-point, the model often starts to suffer from high variance and starts to overfit. Hence, decreasing the number of features is recommended. Increase Regularization: As mentioned above, regularization is a process to decrease model complexity. Hence, if the model is suffering from high variance (which is caused by a complex model), then an increase in regularization can decrease the complexity and help to generalize the model better.New model architecture: Similar to the solution of a model suffering from high bias, if all of the above-mentioned methods fail to deliver satisfactory results, then it is suggested to try out other new model architectures.Conclusion To summarize, Bias and Variance play a major role in the training process of a model. It is necessary to reduce each of these parameters individually to the minimum possible value. However, it should be kept in mind that an effort to decrease one of these parameters beyond a certain limit increases the probability of the other getting increased. This phenomenon is called as the Bias-Variance Tradeoff and is a parameter to consider during model building. 
Rated 4.5/5 based on 1 customer reviews
7603
What is Bias-Variance Tradeoff in Machine Learning

What is Machine Learning? Machine Learning is a m... Read More