Search

What is Linear Regression in Machine Learning

Machine Learning, being a subset of Artificial Intelligence (AI), has been playing a dominant role in our daily lives. Data science engineers and developers working in various domains are widely using machine learning algorithms to make their tasks simpler and life easier. For example, certain machine learning algorithms enable Google Maps to find the fastest route to our destinations, allow Tesla to make driverless cars, help Amazon to generate almost 35% of their annual income, AccuWeather to get the weather forecast of 3.5 million locations weeks in advance, Facebook to automatically detect faces and suggest tags and so on.In statistics and machine learning, linear regression is one of the most popular and well understood algorithms. Most data science enthusiasts and machine learning  fanatics begin their journey with linear regression algorithms. In this article, we will look into how linear regression algorithm works and how it can be efficiently used in your machine learning projects to build better models.Linear Regression is one of the machine learning algorithms where the result is predicted by the use of known parameters which are correlated with the output. It is used to predict values within a continuous range rather than trying to classify them into categories. The known parameters are used to make a continuous and constant slope which is used to predict the unknown or the result.What is a Regression Problem?Majority of the machine learning algorithms fall under the supervised learning category. It is the process where an algorithm is used to predict a result based on the previously entered values and the results generated from them. Suppose we have an input variable ‘x’ and an output variable ‘y’ where y is a function of x (y=f{x}). Supervised learning reads the value of entered variable ‘x’ and the resulting variable ‘y’ so that it can use those results to later predict a highly accurate output data of ‘y’ from the entered value of ‘x’. A regression problem is when the resulting variable contains a real or a continuous value. It tries to draw the line of best fit from the data gathered from a number of points.For example, which of these is a regression problem?How much gas will I spend if I drive for 100 miles?What is the nationality of a person?What is the age of a person?Which is the closest planet to the Sun?Predicting the amount of gas to be spent and the age of a person are regression problems. Predicting nationality is categorical and the closest planet to the Sun is discrete.What is Linear Regression?Let’s say we have a dataset which contains information about the relationship between ‘number of hours studied’ and ‘marks obtained’. A number of students have been observed and their hours of study along with their grades are recorded. This will be our training data. Our goal is to design a model that can predict the marks if number of hours studied is provided. Using the training data, a regression line is obtained which will give minimum error. This linear equation is then used to apply for a new data. That is, if we give the number of hours studied by a student as an input, our model should be able to predict their mark with minimum error.Hypothesis of Linear RegressionThe linear regression model can be represented by the following equation:where,Y is the predicted valueθ₀ is the bias term.θ₁,…,θn are the model parametersx₁, x₂,…,xn are the feature values.The above hypothesis can also be represented byWhere, θ is the model’s parameter vector including the bias term θ₀; x is the feature vector with x₀ =1Y (pred) = b0 + b1*xThe values b0 and b1 must be chosen so that the error is minimum. If sum of squared error is taken as a metric to evaluate the model, then the goal is to obtain a line that best reduces the error.If we don’t square the error, then the positive and negative points will cancel each other out.For a model with one predictor,Exploring ‘b1’If b1 > 0, then x (predictor) and y(target) have a positive relationship. That is an increase in x will increase y.If b1 < 0, then x (predictor) and y(target) have a negative relationship. That is an increase in x will decrease y.Exploring ‘b0’If the model does not include x=0, then the prediction will become meaningless with only b0. For example, we have a dataset that relates height(x) and weight(y). Taking x=0 (that is height as 0), will make the equation have only b0 value which is completely meaningless as in real-time height and weight can never be zero. This resulted due to considering the model values beyond its scope.If the model includes value 0, then ‘b0’ will be the average of all predicted values when x=0. But, setting zero for all the predictor variables is often impossible.The value of b0 guarantees that the residual will have mean zero. If there is no ‘b0’ term, then the regression will be forced to pass over the origin. Both the regression coefficient and prediction will be biased.How does Linear Regression work?Let’s look at a scenario where linear regression might be useful: losing weight. Let us consider that there’s a connection between how many calories you take in and how much you weigh; regression analysis can help you understand that connection. Regression analysis will provide you with a relation which can be visualized into a graph in order to make predictions about your data. For example, if you’ve been putting on weight over the last few years, it can predict how much you’ll weigh in the next ten years if you continue to consume the same amount of calories and burn them at the same rate.The goal of regression analysis is to create a trend line based on the data you have gathered. This then allows you to determine whether other factors apart from the amount of calories consumed affect your weight, such as the number of hours you sleep, work pressure, level of stress, type of exercises you do etc. Before taking into account, we need to look at these factors and attributes and determine whether there is a correlation between them. Linear Regression can then be used to draw a trend line which can then be used to confirm or deny the relationship between attributes. If the test is done over a long time duration, extensive data can be collected and the result can be evaluated more accurately. By the end of this article we will build a model which looks like the below picture i.e, determine a line which best fits the data.How do we determine the best fit line?The best fit line is considered to be the line for which the error between the predicted values and the observed values is minimum. It is also called the regression line and the errors are also known as residuals. The figure shown below shows the residuals. It can be visualized by the vertical lines from the observed data value to the regression line.When to use Linear Regression?Linear Regression’s power lies in its simplicity, which means that it can be used to solve problems across various fields. At first, the data collected from the observations need to be collected and plotted along a line. If the difference between the predicted value and the result is almost the same, we can use linear regression for the problem.Assumptions in linear regressionIf you are planning to use linear regression for your problem then there are some assumptions you need to consider:The relation between the dependent and independent variables should be almost linear.The data is homoscedastic, meaning the variance between the results should not be too much.The results obtained from an observation should not be influenced by the results obtained from the previous observation.The residuals should be normally distributed. This assumption means that the probability density function of the residual values is normally distributed at each independent value.You can determine whether your data meets these conditions by plotting it and then doing a bit of digging into its structure.Few properties of Regression LineHere are a few features a regression line has:Regression passes through the mean of independent variable (x) as well as mean of the dependent variable (y).Regression line minimizes the sum of “Square of Residuals”. That’s why the method of Linear Regression is known as “Ordinary Least Square (OLS)”. We will discuss more in detail about Ordinary Least Square later on.B1 explains the change in Y with a change in x  by one unit. In other words, if we increase the value of ‘x’ it will result in a change in value of Y.Finding a Linear Regression lineLet’s say we want to predict ‘y’ from ‘x’ given in the following table and assume they are correlated as “y=B0+B1∗x”xyPredicted 'y'12Β0+B1∗121Β0+B1∗233Β0+B1∗346Β0+B1∗459Β0+B1∗5611Β0+B1∗6713Β0+B1∗7815Β0+B1∗8917Β0+B1∗91020Β0+B1∗10where,Std. Dev. of x3.02765Std. Dev. of y6.617317Mean of x5.5Mean of y9.7Correlation between x & y0.989938If the Residual Sum of Square (RSS) is differentiated with respect to B0 & B1 and the results equated to zero, we get the following equation:B1 = Correlation * (Std. Dev. of y/ Std. Dev. of x)B0 = Mean(Y) – B1 * Mean(X)Putting values from table 1 into the above equations,B1 = 2.64B0 = -2.2Hence, the least regression equation will become –Y = -2.2 + 2.64*xxY - ActualY - Predicted120.44213.08335.72468.36591161113.6471316.2881518.9291721.56102024.2As there are only 10 data points, the results are not too accurate but if we see the correlation between the predicted and actual line, it has turned out to be very high; both the lines are moving almost together and here is the graph for visualizing our predicted values:Model PerformanceAfter the model is built, if we see that the difference in the values of the predicted and actual data is not much, it is considered to be a good model and can be used to make future predictions. The amount that we consider “not much” entirely depends on the task you want to perform and to what percentage the variation in data can be handled. Here are a few metric tools we can use to calculate error in the model-R – Square (R2)Total Sum of Squares (TSS): total sum of squares (TSS) is a quantity that appears as part of a standard way of presenting results of such an analysis. Sum of squares is a measure of how a data set varies around a central number (like the mean). The Total Sum of Squares tells how much variation there is in the dependent variable.TSS = Σ (Y – Mean[Y])2Residual Sum of Squares (RSS): The residual sum of squares tells you how much of the dependent variable’s variation your model did not explain. It is the sum of the squared differences between the actual Y and the predicted Y.RSS = Σ (Y – f[Y])2(TSS – RSS) measures the amount of variability in the response that is explained by performing the regression.Properties of R2R2 always ranges between 0 to 1.R2 of 0 means that there is no correlation between the dependent and the independent variable.R2 of 1 means the dependent variable can be predicted from the independent variable without any error. An R2 between 0 and 1 indicates the extent to which the dependent variable is predictable. An R2 of 0.20 means that there is 20% of the variance in Y is predictable from X; an R2 of 0.40 means that 40% is predictable; and so on.Root Mean Square Error (RMSE)Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors). The formula for calculating RMSE is:Where N : Total number of observationsWhen standardized observations are used as RMSE inputs, there is a direct relationship with the correlation coefficient. For example, if the correlation coefficient is 1, the RMSE will be 0, because all of the points lie on the regression line (and therefore there are no errors).Mean Absolute Percentage Error (MAPE)There are certain limitations to the use of RMSE, so analysts prefer MAPE over RMSE which gives error in terms of percentages so that different models can be considered for the task and see how they perform. Formula for calculating MAPE can be written as:Where N : Total number of observationsFeature SelectionFeature selection is the automatic selection of attributes for your data that are most relevant to the predictive model you are working on. It seeks to reduce the number of attributes in the dataset by eliminating the features which are not required for the model construction. Feature selection does not totally eliminate an attribute which is considered for the model, rather it mutes that particular characteristic and works with the features which affects the model.Feature selection method aids your mission to create an accurate predictive model. It helps you by choosing features that will give you as good or better accuracy whilst requiring less data. Feature selection methods can be used to identify and remove unnecessary, irrelevant and redundant attributes from the data that do not contribute to the accuracy of the model or may even decrease the accuracy of the model. Having fewer attributes is desirable because it reduces the complexity of the model, and a simpler model is easier to understand, explain and to work with.Feature Selection Algorithms:Filter Method: This method involves assigning scores to individual features and ranking them. The features that have very little to almost no impact are removed from consideration while constructing the model.Wrapper Method: Wrapper method is quite similar to Filter method except the fact that it considers attributes in a group i.e. a number of attributes are taken and checked whether they are having an impact on the model and if not another combination is applied.Embedded Method: Embedded method is the best and most accurate of all the algorithms. It learns the features that affect the model while the model is being constructed and takes into consideration only those features. The most common type of embedded feature selection methods are regularization methods.Cost FunctionCost function helps to figure out the best possible plots which can be used to draw the line of best fit for the data points. As we want to reduce the error of the resulting value we change the process of finding out the actual result to a process which can reduce the error between the predicted value and the actual value.Here, J is the cost function.The above function is made in this format to calculate the error difference between the predicted values and the plotted values. We take the square of the summation of all the data points and divide it by the total number of data points. This cost function J is also called the Mean Squared Error (MSE) function. Using this MSE function we are going to predict values such that the MSE value settles at the minima, reducing the cost function.Gradient DescentGradient Descent is an optimization algorithm that helps machine learning models to find out paths to a minimum value using repeated steps. Gradient descent is used to minimize a function so that it gives the lowest output of that function. This function is called the Loss Function. The loss function shows us how much error is produced by the machine learning model compared to actual results. Our aim should be to lower the cost function as much as possible. One way of achieving a low cost function is by the process of gradient descent. Complexity of some equations makes it difficult to use, partial derivative of the cost function with respect to the considered parameter can provide optimal coefficient value. You may refer to the article on Gradient Descent for Machine Learning.Simple Linear RegressionOptimization is a big part of machine learning and almost every machine learning algorithm has an optimization technique at its core for increased efficiency. Gradient Descent is such an optimization algorithm used to find values of coefficients of a function that minimizes the cost function. Gradient Descent is best applied when the solution cannot be obtained by analytical methods (linear algebra) and must be obtained by an optimization technique.Residual Analysis: Simple linear regression models the relationship between the magnitude of one variable and that of a second—for example, as x increases, y also increases. Or as x increases, y decreases. Correlation is another way to measure how two variables are related. The models done by simple linear regression estimate or try to predict the actual result but most often they deviate from the actual result. Residual analysis is used to calculate by how much the estimated value has deviated from the actual result.Null Hypothesis and p-value: During feature selection, null hypothesis is used to find which attributes will not affect the result of the model. Hypothesis tests are used to test the validity of a claim that is made about a particular attribute of the model. This claim that’s on trial, in essence, is called the null hypothesis. A p-value helps to determine the significance of the results. p-value is a number between 0 and 1 and is interpreted in the following way:A small p-value (less than 0.05) indicates a strong evidence against the null hypothesis, so the null hypothesis is to be rejected.A large p-value (greater than 0.05) indicates weak evidence against the null hypothesis, so the null hypothesis is to be considered.p-value very close to the cut-off (equal to 0.05) is considered to be marginal (could go either way). In this case, the p-value should be provided to the readers so that they can draw their own conclusions.Ordinary Least SquareOrdinary Least Squares (OLS), also known as Ordinary least squares regression or least squared errors regression is a type of linear least squares method for estimating the unknown parameters in a linear regression model. OLS chooses the parameters for a linear function, the goal of which is to minimize the sum of the squares of the difference of the observed variables and the dependent variables i.e. it tries to attain a relationship between them. There are two types of relationships that may occur: linear and curvilinear. A linear relationship is a straight line that is drawn through the central tendency of the points; whereas a curvilinear relationship is a curved line. Association between the variables are depicted by using a scatter plot. The relationship could be positive or negative, and result variation also differs in strength.The advantage of using Ordinary Least Squares regression is that it can be easily interpreted and is highly compatible with recent computers’ built-in algorithms from linear algebra. It can be used to apply to problems with lots of independent variables which can efficiently conveyed to thousands of data points. In Linear Regression, OLS is used to estimate the unknown parameters by creating a model which will minimize the sum of the squared errors between the observed data and the predicted one.Let us simulate some data and look at how the predicted values (Yₑ) differ from the actual value (Y):import pandas as pd import numpy as np from matplotlib import pyplot as plt # Generate 'random' data np.random.seed(0) X = 2.5 * np.random.randn(100) + 1.5   # Array of 100 values with mean = 1.5, stddev = 2.5 res = 0.5 * np.random.randn(100)         # Generate 100 residual terms y = 2 + 0.3 * X + res                   # Actual values of Y # Create pandas dataframe to store our X and y values df = pd.DataFrame(     {'X': X,       'y': y} ) # Show the first five rows of our dataframe df.head()XY05.9101314.71461512.5003932.07623823.9468452.54881137.1022334.61536846.1688953.264107To estimate y using the OLS method, we need to calculate xmean and ymean, the covariance of X and y (xycov), and the variance of X (xvar) before we can determine the values for alpha and beta.# Calculate the mean of X and y xmean = np.mean(X) ymean = np.mean(y) # Calculate the terms needed for the numator and denominator of beta df['xycov'] = (df['X'] - xmean) * (df['y'] - ymean) df['xvar'] = (df['X'] - xmean)**2 # Calculate beta and alpha beta = df['xycov'].sum() / df['xvar'].sum() alpha = ymean - (beta * xmean) print(f'alpha = {alpha}') print(f'beta = {beta}')alpha = 2.0031670124623426 beta = 0.3229396867092763Now that we have an estimate for alpha and beta, we can write our model as Yₑ = 2.003 + 0.323 X, and make predictions:ypred = alpha + beta * XLet’s plot our prediction ypred against the actual values of y, to get a better visual understanding of our model.# Plot regression against actual data plt.figure(figsize=(12, 6)) plt.plot(X, ypred) # regression line plt.plot(X, y, 'ro')   # scatter plot showing actual data plt.title('Actual vs Predicted') plt.xlabel('X') plt.ylabel('y') plt.show()The blue line in the above graph is our line of best fit, Yₑ = 2.003 + 0.323 X.  If you observe the graph carefully, you will notice that there is a linear relationship between X and Y. Using this model, we can predict Y from any values of X. For example, for X = 8,Yₑ = 2.003 + 0.323 (8) = 4.587RegularizationRegularization is a type of regression that is used to decrease the coefficient estimates down to zero. This helps to eliminate the data points that don’t actually represent the true properties of the model, but have appeared by random chance. The process is done by identifying the points which have deviated from the line of best-fit by a large extent. Earlier we saw that to estimate the regression coefficients β in the least squares method, we must minimize the term Residual Sum of Squares (RSS). Let the RSS equation in this case be:The general linear regression model can be expressed using a condensed formula:Here, β=[β0 ,β1, ….. βp]The RSS value will adjust the coefficient, β based on the training data. If the resulting data deviates too much from the training data, then the estimated coefficients won’t generalize well to the future data. This is where regularization comes in and shrinks or regularizes these learned estimates towards zero.Ridge regressionRidge regression is very similar to least squares, except that the Ridge coefficients are estimated by minimizing a different quantity. In particular, the Ridge regression coefficients β are the values that minimize the following quantity:Here, λ is the tuning parameter that decides how much we want to penalize the flexibility of the model. λ controls the relative impact of the two components: RSS and the penalty term. If λ = 0, the Ridge regression will produce a result similar to least squares method. If λ → ∞, all estimated coefficients tend to zero. Ridge regression produces different estimates for different values of λ. The optimal choice of λ is crucial and should be done with cross-validation. The coefficient estimates produced by ridge regression method is also known as the L2 norm.The coefficients generated by Ordinary Least Squares method is independent of scale, which means that if each input variable is multiplied by a constant, the corresponding coefficient will be divided by the same constant, as a result of which the multiplication of the coefficient and the input variables will remain the same. The same is not true for ridge regression and we need to bring the coefficients to the same scale before we perform the process. To standardize the variables, we must subtract their means and divide it by their standard deviations.Lasso RegressionLeast Absolute Shrinkage and Selection Operator (LASSO) regression also shrinks the coefficients by adding a penalty to the sum of squares of the residuals, but the lasso penalty has a slightly different effect. The lasso penalty is the sum of the absolute values of the coefficient vector, which corresponds to its L1 norm. Hence, the lasso estimate is defined by:Similar to ridge regression, the input variables need to be standardized. The lasso penalty makes the solution nonlinear, and there is no closed-form expression for the coefficients as in ridge regression. Instead, the lasso solution is a quadratic programming problem and there are available efficient algorithms that compute the entire path of coefficients that result for different values of λ with the same computational cost as for ridge regression.The lasso penalty had the effect of gradually reducing some coefficients to zero as the regularization increases. For this reason, the lasso can be used for the continuous selection of a subset of features.Linear Regression with multiple variablesLinear regression with multiple variables is also known as "multivariate linear regression". We now introduce notation for equations where we can have any number of input variables.x(i)j=value of feature j in the ith training examplex(i)=the input (features) of the ith training examplem=the number of training examplesn=the number of featuresThe multivariable form of the hypothesis function accommodating these multiple features is as follows:hθ(x)=θ0+θ1x1+θ2x2+θ3x3+⋯+θnxnIn order to develop intuition about this function, we can think about θ0 as the basic price of a house, θ1 as the price per square meter, θ2 as the price per floor, etc. x1 will be the number of square meters in the house, x2 the number of floors, etc.Using the definition of matrix multiplication, our multivariable hypothesis function can be concisely represented as:This is a vectorization of our hypothesis function for one training example; see the lessons on vectorization to learn more.Remark: Note that for convenience reasons in this course we assume x0 (i) =1 for (i∈1,…,m). This allows us to do matrix operations with θ and x. Hence making the two vectors ‘θ’and x(i) match each other element-wise (that is, have the same number of elements: n+1).Multiple Linear RegressionHow is it different?In simple linear regression we use a single independent variable to predict the value of a dependent variable whereas in multiple linear regression two or more independent variables are used to predict the value of a dependent variable. The difference between the two is the number of independent variables. In both cases there is only a single dependent variable.MulticollinearityMulticollinearity tells us the strength of the relationship between independent variables. Multicollinearity is a state of very high intercorrelations or inter-associations among the independent variables. It is therefore a type of disturbance in the data, and if present in the data the statistical inferences made about the data may not be reliable. VIF (Variance Inflation Factor) is used to identify the Multicollinearity. If VIF value is greater than 4, we exclude that variable from our model.There are certain reasons why multicollinearity occurs:It is caused by an inaccurate use of dummy variables.It is caused by the inclusion of a variable which is computed from other variables in the data set.Multicollinearity can also result from the repetition of the same kind of variable.Generally occurs when the variables are highly correlated to each other.Multicollinearity can result in several problems. These problems are as follows:The partial regression coefficient due to multicollinearity may not be estimated precisely. The standard errors are likely to be high.Multicollinearity results in a change in the signs as well as in the magnitudes of the partial regression coefficients from one sample to another sample.Multicollinearity makes it tedious to assess the relative importance of the independent variables in explaining the variation caused by the dependent variable.Iterative ModelsModels should be tested and upgraded again and again for better performance. Multiple iterations allows the model to learn from its previous result and take that into consideration while performing the task again.Making predictions with Linear RegressionLinear Regression can be used to predict the value of an unknown variable using a known variable by the help of a straight line (also called the regression line). The prediction can only be made if it is found that there is a significant correlation between the known and the unknown variable through both a correlation coefficient and a scatterplot.The general procedure for using regression to make good predictions is the following:Research the subject-area so that the model can be built based on the results produced by similar models. This research helps with the subsequent steps.Collect data for appropriate variables which have some correlation with the model.Specify and assess the regression model.Run repeated tests so that the model has more data to work with.To test if the model is good enough observe whether:The scatter plot forms a linear pattern.The correlation coefficient r, has a value above 0.5 or below -0.5. A positive value indicates a positive relationship and a negative value represents a negative relationship.If the correlation coefficient shows a strong relationship between variables but the scatter plot is not linear, the results can be misleading. Examples on how to use linear regression have been shown earlier.Data preparation for Linear RegressionStep 1: Linear AssumptionThe first step for data preparation is checking for the variables which have some sort of linear correlation between the dependent and the independent variables.Step 2: Remove NoiseIt is the process of reducing the number of attributes in the dataset by eliminating the features which have very little to no requirement for the construction of the model.Step 3: Remove CollinearityCollinearity tells us the strength of the relationship between independent variables. If two or more variables are highly collinear, it would not make sense to keep both the variables while evaluating the model and hence we can keep one of them.Step 4: Gaussian DistributionsThe linear regression model will produce more reliable results if the input and output variables have a Gaussian distribution. The Gaussian theorem states that  states that a sample mean from an infinite population is approximately normal, or Gaussian, with mean the same as the underlying population, and variance equal to the population variance divided by the sample size. The approximation improves as the sample size gets large.Step 5: Rescale InputsLinear regression model will produce more reliable predictions if the input variables are rescaled using standardization or normalization.Linear Regression with statsmodelsWe have already discussed OLS method, now we will move on and see how to use the OLS method in the statsmodels library. For this we will be using the popular advertising dataset. Here, we will only be looking at the TV variable and explore whether spending on TV advertising can predict the number of sales for the product. Let’s start by importing this csv file as a pandas dataframe using read_csv():# Import and display first five rows of advertising dataset advert = pd.read_csv('advertising.csv') advert.head()TVRadioNewspaperSales0230.137.869.222.1144.539.345.110.4217.245.969.312.03151.541.358.516.54180.810.858.417.9Now we will use statsmodels’ OLS function to initialize simple linear regression model. It will take the formula y ~ X, where X is the predictor variable (TV advertising costs) and y is the output variable (Sales). Then, we will fit the model by calling the OLS object’s fit() method.import statsmodels.formula.api as smf # Initialise and fit linear regression model using `statsmodels` model = smf.ols('Sales ~ TV', data=advert) model = model.fit()Once we have fit the simple regression model, we can predict the values of sales based on the equation we just derived using the .predict method and also visualise our regression model by plotting sales_pred against the TV advertising costs to find the line of best fit.# Predict values sales_pred = model.predict() # Plot regression against actual data plt.figure(figsize=(12, 6)) plt.plot(advert['TV'], advert['Sales'], 'o')       # scatter plot showing actual data plt.plot(advert['TV'], sales_pred, 'r', linewidth=2)   # regression line plt.xlabel('TV Advertising Costs') plt.ylabel('Sales') plt.title('TV vs Sales') plt.show()In the above graph, if you notice you will see that there is a positive linear relationship between TV advertising costs and Sales. You may also summarize by saying that spending more on TV advertising predicts a higher number of sales.Linear Regression with scikit-learnLet us learn to implement linear regression models using sklearn. For this model as well, we will continue to use the advertising dataset but now we will use two predictor variables to create a multiple linear regression model. Yₑ = α + β₁X₁ + β₂X₂ + … + βₚXₚ, where p is the number of predictors.In our example, we will be predicting Sales using the variables TV and Radio i.e. our model can be written as:Sales = α + β₁*TV + β₂*Radiofrom sklearn.linear_model import LinearRegression # Build linear regression model using TV and Radio as predictors # Split data into predictors X and output Y predictors = ['TV', 'Radio'] X = advert[predictors] y = advert['Sales'] # Initialise and fit model lm = LinearRegression() model = lm.fit(X, y) print(f'alpha = {model.intercept_}') print(f'betas = {model.coef_}')alpha = 4.630879464097768 betas = [0.05444896 0.10717457]model.predict(X)Now that we have fit a multiple linear regression model to our data, we can predict sales from any combination of TV and Radio advertising costs. For example, you want to know how many sales we would make if we invested $600 in TV advertising and $300 in Radio advertising. You can simply find it out by:new_X = [[600, 300]] print(model.predict(new_X))[69.4526273]We get the output as 69.45 which means if we invest $600 on TV and $300 on Radio advertising, we can expect to sell 69 units approximately.SummaryLet us sum up what we have covered in this article so far —How to understand a regression problemWhat is linear regression and how it worksOrdinary Least Square method and RegularizationImplementing Linear Regression in Python using statsmodel and sklearn libraryWe have discussed about a couple of ways to implement linear regression and build efficient models for certain business problems. If you are inspired by the opportunities provided by machine learning, enrol in our  Data Science and Machine Learning Courses for more lucrative career options in this landscape.
What is Linear Regression in Machine Learning
Priyankur
Rated 4.5/5 based on 4 customer reviews

What is Linear Regression in Machine Learning

Machine Learning, being a subset of Artificial Intelligence (AI), has been playing a dominant role in our daily lives. Data science engineers and developers working in various domains are widely using machine learning algorithms to make their tasks simpler and life easier. For example, certain machine learning algorithms enable Google Maps to find the fastest route to our destinations, allow Tesla to make driverless cars, help Amazon to generate almost 35% of their annual income, AccuWeather to get the weather forecast of 3.5 million locations weeks in advance, Facebook to automatically detect faces and suggest tags and so on.In statistics and machine learning, linear regression is one of the most popular and well understood algorithms. Most data science enthusiasts and machine learning  fanatics begin their journey with linear regression algorithms. In this article, we will look into how linear regression algorithm works and how it can be efficiently used in your machine learning projects to build better models.Linear Regression is one of the machine learning algorithms where the result is predicted by the use of known parameters which are correlated with the output. It is used to predict values within a continuous range rather than trying to classify them into categories. The known parameters are used to make a continuous and constant slope which is used to predict the unknown or the result.What is a Regression Problem?Majority of the machine learning algorithms fall under the supervised learning category. It is the process where an algorithm is used to predict a result based on the previously entered values and the results generated from them. Suppose we have an input variable ‘x’ and an output variable ‘y’ where y is a function of x (y=f{x}). Supervised learning reads the value of entered variable ‘x’ and the resulting variable ‘y’ so that it can use those results to later predict a highly accurate output data of ‘y’ from the entered value of ‘x’. A regression problem is when the resulting variable contains a real or a continuous value. It tries to draw the line of best fit from the data gathered from a number of points.For example, which of these is a regression problem?How much gas will I spend if I drive for 100 miles?What is the nationality of a person?What is the age of a person?Which is the closest planet to the Sun?Predicting the amount of gas to be spent and the age of a person are regression problems. Predicting nationality is categorical and the closest planet to the Sun is discrete.What is Linear Regression?Let’s say we have a dataset which contains information about the relationship between ‘number of hours studied’ and ‘marks obtained’. A number of students have been observed and their hours of study along with their grades are recorded. This will be our training data. Our goal is to design a model that can predict the marks if number of hours studied is provided. Using the training data, a regression line is obtained which will give minimum error. This linear equation is then used to apply for a new data. That is, if we give the number of hours studied by a student as an input, our model should be able to predict their mark with minimum error.Hypothesis of Linear RegressionThe linear regression model can be represented by the following equation:where,Y is the predicted valueθ₀ is the bias term.θ₁,…,θn are the model parametersx₁, x₂,…,xn are the feature values.The above hypothesis can also be represented byWhere, θ is the model’s parameter vector including the bias term θ₀; x is the feature vector with x₀ =1Y (pred) = b0 + b1*xThe values b0 and b1 must be chosen so that the error is minimum. If sum of squared error is taken as a metric to evaluate the model, then the goal is to obtain a line that best reduces the error.If we don’t square the error, then the positive and negative points will cancel each other out.For a model with one predictor,Exploring ‘b1’If b1 > 0, then x (predictor) and y(target) have a positive relationship. That is an increase in x will increase y.If b1 < 0, then x (predictor) and y(target) have a negative relationship. That is an increase in x will decrease y.Exploring ‘b0’If the model does not include x=0, then the prediction will become meaningless with only b0. For example, we have a dataset that relates height(x) and weight(y). Taking x=0 (that is height as 0), will make the equation have only b0 value which is completely meaningless as in real-time height and weight can never be zero. This resulted due to considering the model values beyond its scope.If the model includes value 0, then ‘b0’ will be the average of all predicted values when x=0. But, setting zero for all the predictor variables is often impossible.The value of b0 guarantees that the residual will have mean zero. If there is no ‘b0’ term, then the regression will be forced to pass over the origin. Both the regression coefficient and prediction will be biased.How does Linear Regression work?Let’s look at a scenario where linear regression might be useful: losing weight. Let us consider that there’s a connection between how many calories you take in and how much you weigh; regression analysis can help you understand that connection. Regression analysis will provide you with a relation which can be visualized into a graph in order to make predictions about your data. For example, if you’ve been putting on weight over the last few years, it can predict how much you’ll weigh in the next ten years if you continue to consume the same amount of calories and burn them at the same rate.The goal of regression analysis is to create a trend line based on the data you have gathered. This then allows you to determine whether other factors apart from the amount of calories consumed affect your weight, such as the number of hours you sleep, work pressure, level of stress, type of exercises you do etc. Before taking into account, we need to look at these factors and attributes and determine whether there is a correlation between them. Linear Regression can then be used to draw a trend line which can then be used to confirm or deny the relationship between attributes. If the test is done over a long time duration, extensive data can be collected and the result can be evaluated more accurately. By the end of this article we will build a model which looks like the below picture i.e, determine a line which best fits the data.How do we determine the best fit line?The best fit line is considered to be the line for which the error between the predicted values and the observed values is minimum. It is also called the regression line and the errors are also known as residuals. The figure shown below shows the residuals. It can be visualized by the vertical lines from the observed data value to the regression line.When to use Linear Regression?Linear Regression’s power lies in its simplicity, which means that it can be used to solve problems across various fields. At first, the data collected from the observations need to be collected and plotted along a line. If the difference between the predicted value and the result is almost the same, we can use linear regression for the problem.Assumptions in linear regressionIf you are planning to use linear regression for your problem then there are some assumptions you need to consider:The relation between the dependent and independent variables should be almost linear.The data is homoscedastic, meaning the variance between the results should not be too much.The results obtained from an observation should not be influenced by the results obtained from the previous observation.The residuals should be normally distributed. This assumption means that the probability density function of the residual values is normally distributed at each independent value.You can determine whether your data meets these conditions by plotting it and then doing a bit of digging into its structure.Few properties of Regression LineHere are a few features a regression line has:Regression passes through the mean of independent variable (x) as well as mean of the dependent variable (y).Regression line minimizes the sum of “Square of Residuals”. That’s why the method of Linear Regression is known as “Ordinary Least Square (OLS)”. We will discuss more in detail about Ordinary Least Square later on.B1 explains the change in Y with a change in x  by one unit. In other words, if we increase the value of ‘x’ it will result in a change in value of Y.Finding a Linear Regression lineLet’s say we want to predict ‘y’ from ‘x’ given in the following table and assume they are correlated as “y=B0+B1∗x”xyPredicted 'y'12Β0+B1∗121Β0+B1∗233Β0+B1∗346Β0+B1∗459Β0+B1∗5611Β0+B1∗6713Β0+B1∗7815Β0+B1∗8917Β0+B1∗91020Β0+B1∗10where,Std. Dev. of x3.02765Std. Dev. of y6.617317Mean of x5.5Mean of y9.7Correlation between x & y0.989938If the Residual Sum of Square (RSS) is differentiated with respect to B0 & B1 and the results equated to zero, we get the following equation:B1 = Correlation * (Std. Dev. of y/ Std. Dev. of x)B0 = Mean(Y) – B1 * Mean(X)Putting values from table 1 into the above equations,B1 = 2.64B0 = -2.2Hence, the least regression equation will become –Y = -2.2 + 2.64*xxY - ActualY - Predicted120.44213.08335.72468.36591161113.6471316.2881518.9291721.56102024.2As there are only 10 data points, the results are not too accurate but if we see the correlation between the predicted and actual line, it has turned out to be very high; both the lines are moving almost together and here is the graph for visualizing our predicted values:Model PerformanceAfter the model is built, if we see that the difference in the values of the predicted and actual data is not much, it is considered to be a good model and can be used to make future predictions. The amount that we consider “not much” entirely depends on the task you want to perform and to what percentage the variation in data can be handled. Here are a few metric tools we can use to calculate error in the model-R – Square (R2)Total Sum of Squares (TSS): total sum of squares (TSS) is a quantity that appears as part of a standard way of presenting results of such an analysis. Sum of squares is a measure of how a data set varies around a central number (like the mean). The Total Sum of Squares tells how much variation there is in the dependent variable.TSS = Σ (Y – Mean[Y])2Residual Sum of Squares (RSS): The residual sum of squares tells you how much of the dependent variable’s variation your model did not explain. It is the sum of the squared differences between the actual Y and the predicted Y.RSS = Σ (Y – f[Y])2(TSS – RSS) measures the amount of variability in the response that is explained by performing the regression.Properties of R2R2 always ranges between 0 to 1.R2 of 0 means that there is no correlation between the dependent and the independent variable.R2 of 1 means the dependent variable can be predicted from the independent variable without any error. An R2 between 0 and 1 indicates the extent to which the dependent variable is predictable. An R2 of 0.20 means that there is 20% of the variance in Y is predictable from X; an R2 of 0.40 means that 40% is predictable; and so on.Root Mean Square Error (RMSE)Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors). The formula for calculating RMSE is:Where N : Total number of observationsWhen standardized observations are used as RMSE inputs, there is a direct relationship with the correlation coefficient. For example, if the correlation coefficient is 1, the RMSE will be 0, because all of the points lie on the regression line (and therefore there are no errors).Mean Absolute Percentage Error (MAPE)There are certain limitations to the use of RMSE, so analysts prefer MAPE over RMSE which gives error in terms of percentages so that different models can be considered for the task and see how they perform. Formula for calculating MAPE can be written as:Where N : Total number of observationsFeature SelectionFeature selection is the automatic selection of attributes for your data that are most relevant to the predictive model you are working on. It seeks to reduce the number of attributes in the dataset by eliminating the features which are not required for the model construction. Feature selection does not totally eliminate an attribute which is considered for the model, rather it mutes that particular characteristic and works with the features which affects the model.Feature selection method aids your mission to create an accurate predictive model. It helps you by choosing features that will give you as good or better accuracy whilst requiring less data. Feature selection methods can be used to identify and remove unnecessary, irrelevant and redundant attributes from the data that do not contribute to the accuracy of the model or may even decrease the accuracy of the model. Having fewer attributes is desirable because it reduces the complexity of the model, and a simpler model is easier to understand, explain and to work with.Feature Selection Algorithms:Filter Method: This method involves assigning scores to individual features and ranking them. The features that have very little to almost no impact are removed from consideration while constructing the model.Wrapper Method: Wrapper method is quite similar to Filter method except the fact that it considers attributes in a group i.e. a number of attributes are taken and checked whether they are having an impact on the model and if not another combination is applied.Embedded Method: Embedded method is the best and most accurate of all the algorithms. It learns the features that affect the model while the model is being constructed and takes into consideration only those features. The most common type of embedded feature selection methods are regularization methods.Cost FunctionCost function helps to figure out the best possible plots which can be used to draw the line of best fit for the data points. As we want to reduce the error of the resulting value we change the process of finding out the actual result to a process which can reduce the error between the predicted value and the actual value.Here, J is the cost function.The above function is made in this format to calculate the error difference between the predicted values and the plotted values. We take the square of the summation of all the data points and divide it by the total number of data points. This cost function J is also called the Mean Squared Error (MSE) function. Using this MSE function we are going to predict values such that the MSE value settles at the minima, reducing the cost function.Gradient DescentGradient Descent is an optimization algorithm that helps machine learning models to find out paths to a minimum value using repeated steps. Gradient descent is used to minimize a function so that it gives the lowest output of that function. This function is called the Loss Function. The loss function shows us how much error is produced by the machine learning model compared to actual results. Our aim should be to lower the cost function as much as possible. One way of achieving a low cost function is by the process of gradient descent. Complexity of some equations makes it difficult to use, partial derivative of the cost function with respect to the considered parameter can provide optimal coefficient value. You may refer to the article on Gradient Descent for Machine Learning.Simple Linear RegressionOptimization is a big part of machine learning and almost every machine learning algorithm has an optimization technique at its core for increased efficiency. Gradient Descent is such an optimization algorithm used to find values of coefficients of a function that minimizes the cost function. Gradient Descent is best applied when the solution cannot be obtained by analytical methods (linear algebra) and must be obtained by an optimization technique.Residual Analysis: Simple linear regression models the relationship between the magnitude of one variable and that of a second—for example, as x increases, y also increases. Or as x increases, y decreases. Correlation is another way to measure how two variables are related. The models done by simple linear regression estimate or try to predict the actual result but most often they deviate from the actual result. Residual analysis is used to calculate by how much the estimated value has deviated from the actual result.Null Hypothesis and p-value: During feature selection, null hypothesis is used to find which attributes will not affect the result of the model. Hypothesis tests are used to test the validity of a claim that is made about a particular attribute of the model. This claim that’s on trial, in essence, is called the null hypothesis. A p-value helps to determine the significance of the results. p-value is a number between 0 and 1 and is interpreted in the following way:A small p-value (less than 0.05) indicates a strong evidence against the null hypothesis, so the null hypothesis is to be rejected.A large p-value (greater than 0.05) indicates weak evidence against the null hypothesis, so the null hypothesis is to be considered.p-value very close to the cut-off (equal to 0.05) is considered to be marginal (could go either way). In this case, the p-value should be provided to the readers so that they can draw their own conclusions.Ordinary Least SquareOrdinary Least Squares (OLS), also known as Ordinary least squares regression or least squared errors regression is a type of linear least squares method for estimating the unknown parameters in a linear regression model. OLS chooses the parameters for a linear function, the goal of which is to minimize the sum of the squares of the difference of the observed variables and the dependent variables i.e. it tries to attain a relationship between them. There are two types of relationships that may occur: linear and curvilinear. A linear relationship is a straight line that is drawn through the central tendency of the points; whereas a curvilinear relationship is a curved line. Association between the variables are depicted by using a scatter plot. The relationship could be positive or negative, and result variation also differs in strength.The advantage of using Ordinary Least Squares regression is that it can be easily interpreted and is highly compatible with recent computers’ built-in algorithms from linear algebra. It can be used to apply to problems with lots of independent variables which can efficiently conveyed to thousands of data points. In Linear Regression, OLS is used to estimate the unknown parameters by creating a model which will minimize the sum of the squared errors between the observed data and the predicted one.Let us simulate some data and look at how the predicted values (Yₑ) differ from the actual value (Y):import pandas as pd import numpy as np from matplotlib import pyplot as plt # Generate 'random' data np.random.seed(0) X = 2.5 * np.random.randn(100) + 1.5   # Array of 100 values with mean = 1.5, stddev = 2.5 res = 0.5 * np.random.randn(100)         # Generate 100 residual terms y = 2 + 0.3 * X + res                   # Actual values of Y # Create pandas dataframe to store our X and y values df = pd.DataFrame(     {'X': X,       'y': y} ) # Show the first five rows of our dataframe df.head()XY05.9101314.71461512.5003932.07623823.9468452.54881137.1022334.61536846.1688953.264107To estimate y using the OLS method, we need to calculate xmean and ymean, the covariance of X and y (xycov), and the variance of X (xvar) before we can determine the values for alpha and beta.# Calculate the mean of X and y xmean = np.mean(X) ymean = np.mean(y) # Calculate the terms needed for the numator and denominator of beta df['xycov'] = (df['X'] - xmean) * (df['y'] - ymean) df['xvar'] = (df['X'] - xmean)**2 # Calculate beta and alpha beta = df['xycov'].sum() / df['xvar'].sum() alpha = ymean - (beta * xmean) print(f'alpha = {alpha}') print(f'beta = {beta}')alpha = 2.0031670124623426 beta = 0.3229396867092763Now that we have an estimate for alpha and beta, we can write our model as Yₑ = 2.003 + 0.323 X, and make predictions:ypred = alpha + beta * XLet’s plot our prediction ypred against the actual values of y, to get a better visual understanding of our model.# Plot regression against actual data plt.figure(figsize=(12, 6)) plt.plot(X, ypred) # regression line plt.plot(X, y, 'ro')   # scatter plot showing actual data plt.title('Actual vs Predicted') plt.xlabel('X') plt.ylabel('y') plt.show()The blue line in the above graph is our line of best fit, Yₑ = 2.003 + 0.323 X.  If you observe the graph carefully, you will notice that there is a linear relationship between X and Y. Using this model, we can predict Y from any values of X. For example, for X = 8,Yₑ = 2.003 + 0.323 (8) = 4.587RegularizationRegularization is a type of regression that is used to decrease the coefficient estimates down to zero. This helps to eliminate the data points that don’t actually represent the true properties of the model, but have appeared by random chance. The process is done by identifying the points which have deviated from the line of best-fit by a large extent. Earlier we saw that to estimate the regression coefficients β in the least squares method, we must minimize the term Residual Sum of Squares (RSS). Let the RSS equation in this case be:The general linear regression model can be expressed using a condensed formula:Here, β=[β0 ,β1, ….. βp]The RSS value will adjust the coefficient, β based on the training data. If the resulting data deviates too much from the training data, then the estimated coefficients won’t generalize well to the future data. This is where regularization comes in and shrinks or regularizes these learned estimates towards zero.Ridge regressionRidge regression is very similar to least squares, except that the Ridge coefficients are estimated by minimizing a different quantity. In particular, the Ridge regression coefficients β are the values that minimize the following quantity:Here, λ is the tuning parameter that decides how much we want to penalize the flexibility of the model. λ controls the relative impact of the two components: RSS and the penalty term. If λ = 0, the Ridge regression will produce a result similar to least squares method. If λ → ∞, all estimated coefficients tend to zero. Ridge regression produces different estimates for different values of λ. The optimal choice of λ is crucial and should be done with cross-validation. The coefficient estimates produced by ridge regression method is also known as the L2 norm.The coefficients generated by Ordinary Least Squares method is independent of scale, which means that if each input variable is multiplied by a constant, the corresponding coefficient will be divided by the same constant, as a result of which the multiplication of the coefficient and the input variables will remain the same. The same is not true for ridge regression and we need to bring the coefficients to the same scale before we perform the process. To standardize the variables, we must subtract their means and divide it by their standard deviations.Lasso RegressionLeast Absolute Shrinkage and Selection Operator (LASSO) regression also shrinks the coefficients by adding a penalty to the sum of squares of the residuals, but the lasso penalty has a slightly different effect. The lasso penalty is the sum of the absolute values of the coefficient vector, which corresponds to its L1 norm. Hence, the lasso estimate is defined by:Similar to ridge regression, the input variables need to be standardized. The lasso penalty makes the solution nonlinear, and there is no closed-form expression for the coefficients as in ridge regression. Instead, the lasso solution is a quadratic programming problem and there are available efficient algorithms that compute the entire path of coefficients that result for different values of λ with the same computational cost as for ridge regression.The lasso penalty had the effect of gradually reducing some coefficients to zero as the regularization increases. For this reason, the lasso can be used for the continuous selection of a subset of features.Linear Regression with multiple variablesLinear regression with multiple variables is also known as "multivariate linear regression". We now introduce notation for equations where we can have any number of input variables.x(i)j=value of feature j in the ith training examplex(i)=the input (features) of the ith training examplem=the number of training examplesn=the number of featuresThe multivariable form of the hypothesis function accommodating these multiple features is as follows:hθ(x)=θ0+θ1x1+θ2x2+θ3x3+⋯+θnxnIn order to develop intuition about this function, we can think about θ0 as the basic price of a house, θ1 as the price per square meter, θ2 as the price per floor, etc. x1 will be the number of square meters in the house, x2 the number of floors, etc.Using the definition of matrix multiplication, our multivariable hypothesis function can be concisely represented as:This is a vectorization of our hypothesis function for one training example; see the lessons on vectorization to learn more.Remark: Note that for convenience reasons in this course we assume x0 (i) =1 for (i∈1,…,m). This allows us to do matrix operations with θ and x. Hence making the two vectors ‘θ’and x(i) match each other element-wise (that is, have the same number of elements: n+1).Multiple Linear RegressionHow is it different?In simple linear regression we use a single independent variable to predict the value of a dependent variable whereas in multiple linear regression two or more independent variables are used to predict the value of a dependent variable. The difference between the two is the number of independent variables. In both cases there is only a single dependent variable.MulticollinearityMulticollinearity tells us the strength of the relationship between independent variables. Multicollinearity is a state of very high intercorrelations or inter-associations among the independent variables. It is therefore a type of disturbance in the data, and if present in the data the statistical inferences made about the data may not be reliable. VIF (Variance Inflation Factor) is used to identify the Multicollinearity. If VIF value is greater than 4, we exclude that variable from our model.There are certain reasons why multicollinearity occurs:It is caused by an inaccurate use of dummy variables.It is caused by the inclusion of a variable which is computed from other variables in the data set.Multicollinearity can also result from the repetition of the same kind of variable.Generally occurs when the variables are highly correlated to each other.Multicollinearity can result in several problems. These problems are as follows:The partial regression coefficient due to multicollinearity may not be estimated precisely. The standard errors are likely to be high.Multicollinearity results in a change in the signs as well as in the magnitudes of the partial regression coefficients from one sample to another sample.Multicollinearity makes it tedious to assess the relative importance of the independent variables in explaining the variation caused by the dependent variable.Iterative ModelsModels should be tested and upgraded again and again for better performance. Multiple iterations allows the model to learn from its previous result and take that into consideration while performing the task again.Making predictions with Linear RegressionLinear Regression can be used to predict the value of an unknown variable using a known variable by the help of a straight line (also called the regression line). The prediction can only be made if it is found that there is a significant correlation between the known and the unknown variable through both a correlation coefficient and a scatterplot.The general procedure for using regression to make good predictions is the following:Research the subject-area so that the model can be built based on the results produced by similar models. This research helps with the subsequent steps.Collect data for appropriate variables which have some correlation with the model.Specify and assess the regression model.Run repeated tests so that the model has more data to work with.To test if the model is good enough observe whether:The scatter plot forms a linear pattern.The correlation coefficient r, has a value above 0.5 or below -0.5. A positive value indicates a positive relationship and a negative value represents a negative relationship.If the correlation coefficient shows a strong relationship between variables but the scatter plot is not linear, the results can be misleading. Examples on how to use linear regression have been shown earlier.Data preparation for Linear RegressionStep 1: Linear AssumptionThe first step for data preparation is checking for the variables which have some sort of linear correlation between the dependent and the independent variables.Step 2: Remove NoiseIt is the process of reducing the number of attributes in the dataset by eliminating the features which have very little to no requirement for the construction of the model.Step 3: Remove CollinearityCollinearity tells us the strength of the relationship between independent variables. If two or more variables are highly collinear, it would not make sense to keep both the variables while evaluating the model and hence we can keep one of them.Step 4: Gaussian DistributionsThe linear regression model will produce more reliable results if the input and output variables have a Gaussian distribution. The Gaussian theorem states that  states that a sample mean from an infinite population is approximately normal, or Gaussian, with mean the same as the underlying population, and variance equal to the population variance divided by the sample size. The approximation improves as the sample size gets large.Step 5: Rescale InputsLinear regression model will produce more reliable predictions if the input variables are rescaled using standardization or normalization.Linear Regression with statsmodelsWe have already discussed OLS method, now we will move on and see how to use the OLS method in the statsmodels library. For this we will be using the popular advertising dataset. Here, we will only be looking at the TV variable and explore whether spending on TV advertising can predict the number of sales for the product. Let’s start by importing this csv file as a pandas dataframe using read_csv():# Import and display first five rows of advertising dataset advert = pd.read_csv('advertising.csv') advert.head()TVRadioNewspaperSales0230.137.869.222.1144.539.345.110.4217.245.969.312.03151.541.358.516.54180.810.858.417.9Now we will use statsmodels’ OLS function to initialize simple linear regression model. It will take the formula y ~ X, where X is the predictor variable (TV advertising costs) and y is the output variable (Sales). Then, we will fit the model by calling the OLS object’s fit() method.import statsmodels.formula.api as smf # Initialise and fit linear regression model using `statsmodels` model = smf.ols('Sales ~ TV', data=advert) model = model.fit()Once we have fit the simple regression model, we can predict the values of sales based on the equation we just derived using the .predict method and also visualise our regression model by plotting sales_pred against the TV advertising costs to find the line of best fit.# Predict values sales_pred = model.predict() # Plot regression against actual data plt.figure(figsize=(12, 6)) plt.plot(advert['TV'], advert['Sales'], 'o')       # scatter plot showing actual data plt.plot(advert['TV'], sales_pred, 'r', linewidth=2)   # regression line plt.xlabel('TV Advertising Costs') plt.ylabel('Sales') plt.title('TV vs Sales') plt.show()In the above graph, if you notice you will see that there is a positive linear relationship between TV advertising costs and Sales. You may also summarize by saying that spending more on TV advertising predicts a higher number of sales.Linear Regression with scikit-learnLet us learn to implement linear regression models using sklearn. For this model as well, we will continue to use the advertising dataset but now we will use two predictor variables to create a multiple linear regression model. Yₑ = α + β₁X₁ + β₂X₂ + … + βₚXₚ, where p is the number of predictors.In our example, we will be predicting Sales using the variables TV and Radio i.e. our model can be written as:Sales = α + β₁*TV + β₂*Radiofrom sklearn.linear_model import LinearRegression # Build linear regression model using TV and Radio as predictors # Split data into predictors X and output Y predictors = ['TV', 'Radio'] X = advert[predictors] y = advert['Sales'] # Initialise and fit model lm = LinearRegression() model = lm.fit(X, y) print(f'alpha = {model.intercept_}') print(f'betas = {model.coef_}')alpha = 4.630879464097768 betas = [0.05444896 0.10717457]model.predict(X)Now that we have fit a multiple linear regression model to our data, we can predict sales from any combination of TV and Radio advertising costs. For example, you want to know how many sales we would make if we invested $600 in TV advertising and $300 in Radio advertising. You can simply find it out by:new_X = [[600, 300]] print(model.predict(new_X))[69.4526273]We get the output as 69.45 which means if we invest $600 on TV and $300 on Radio advertising, we can expect to sell 69 units approximately.SummaryLet us sum up what we have covered in this article so far —How to understand a regression problemWhat is linear regression and how it worksOrdinary Least Square method and RegularizationImplementing Linear Regression in Python using statsmodel and sklearn libraryWe have discussed about a couple of ways to implement linear regression and build efficient models for certain business problems. If you are inspired by the opportunities provided by machine learning, enrol in our  Data Science and Machine Learning Courses for more lucrative career options in this landscape.
Rated 4.5/5 based on 4 customer reviews
7853
What is Linear Regression in Machine Learning

Machine Learning, being a subset of Artificial Int... Read More

What is K-Nearest Neighbor in Machine Learning: K-NN Algorithm

If you are thinking of a simple, easy-to-implement supervised machine learning algorithm which can be used to solve both classification as well as regression problems, K-Nearest Neighbor (K-NN) is the perfect choice. Learning K-NN is a great way to introduce yourself to machine learning and classification in general. Also, you will find a lot of intense application of K-NN in data mining, pattern recognition, semantic searching, intrusion detection and anomaly detection.K-Nearest Neighbors is one of the most basic supervised machine learning algorithms, yet very essential. A supervised machine learning algorithm is one of the types of machine learning algorithm which is dependent on labelled input data in order to learn a function which is capable of producing an output whenever a new unlabeled data is given as input.In real life scenarios, K-NN is widely used as it is non-parametric which means it does not make any underlying assumptions about the distributions of data. With the business world entirely revolving around Data Science, it has become one of the most lucrative fields. Hence, the heavy demand for a Data Science Certification.Parametric vs Non-parametric MethodsLet us look into how different is a parametric machine learning algorithm from a nonparametric machine learning algorithm.Machine learning, in other words can be called as learning a function (f) which maps input variables (X) to the output variables (Y).Y=f(X)An algorithm learns about the target mapping function from the training data. As we are unaware of the form of the function, we have to evaluate various machine learning algorithms and figure out which algorithms perform better at providing an approximation of the underlying function.Statistical Methods are classified on the basis of what we know about the population we are studying.Parametric statistics is a branch of statistics which assumes that sample data comes from a population that follows a probability distribution based on a fixed set of parameters.Nonparametric statistics is the branch of statistics that is not based solely on population parameters.Parametric Machine Learning AlgorithmsThis particular algorithm involves two steps:Selecting a form for the functionLearning the coefficients for the function from the training dataLet us consider a line to understand functional form for the mapping function as it is used in linear regression and simplify the learning process.b0 + b1*x1 + b2*x2 = 0Where b0, b1 and b2 are the coefficients of the line which control the intercept and slope, and x1 and x2 are two input variables.All we have to do now is to estimate the coefficients of the line equation to get a predictive model for the problem. Now, the problem is that the actual unknown underlying function may not be a linear function like a line. In that case, the approach will give poor results. Some of the examples of parametric machine learning algorithms are mentioned below:Logistic RegressionLinear Discriminant AnalysisPerceptronNaive BayesSimple Neural NetworksNonparametric Machine Learning AlgorithmsNonparametric methods always try to find the best fit training data while constructing the mapping function which also allows it to fit a large number of functional forms. Some of the examples of nonparametric machine learning algorithms are mentioned below:k-Nearest NeighborsDecision Trees like CART and C4.5Support Vector MachinesThe best example of nonparametric machine learning algorithms would be k-nearest neighbors algorithm which makes predictions based on the k most similar training patterns for a given set of new data instance. This method simply assumes that the patterns which are close are likely to be of similar type.Parametric Machine Learning AlgorithmsNonparametric Machine Learning AlgorithmsBenefitsSimple to understand and interpret resultsSpeed of learning from data in fastLess training data is requiredFlexible enough to fit a large number of functional formsNo assumptions about the underlying functionsProvides high performance for predictionLimitationsChoosing a functional form constrains the method to the specified formIt has limited complexity and more suited to simpler problemsIt is unlikely to match the underlying mapping function and results in poor fitRequires more training data in order to estimate the mapping functionDue to more parameters to train, it is slower comparativelyThere is a risk to overfit the training dataMethod Based LearningThere are several learning models namely:Association rules basedEnsemble method basedDeep Learning basedClustering method basedRegression Analysis basedBayesian method basedDimensionality reduction basedKernel method basedInstance basedLet us understand what Instance Based Learning is all about.Instance Based Learning (IBL)Instance-Based methods are the simplest form of learningInstance -Based learning is lazy learningK-NN model works on identified instanceInstances are retrieved from memory and then this data is used to classify the new query instanceInstance based learning is also called memory-based or case-basedUnder Instance-based Learning we have,Nearest-neighbor classifierUses k “closest” points (nearest neighbors) for performing classification. For example: It’s how people judge by observing our peers. We tend to move with people of similar attributes.Lazy Learning vs Eager LearningLazy LearningEager LearningSimply stores the training data and waits until it is given a test tuple.Munges the training data as soon as it receives it.It's slow as it calculates based on the current data set instead of coming up with an algorithm based on historical data.It's fast as it has pre-calculated algorithm.Localized data so generalization takes time at every iteration.On the basis of training set ,it constructs a classification model before receiving new data to classify.What is K-NN?One of the biggest applications of K-Nearest Neighbor search is Recommender Systems. If you have noticed while you are shopping as a user on Amazon and you like a particular item, you are recommended with similar items.It also recommends similar items bought by other users and other set of items which are often bought together. Basically, the algorithm compares the set of users who like each item and looks for similarity. This not only applies to recommending items or products but also recommending media and even advertisements to display to a user.K nearest neighbors or K-NN Algorithm is a simple algorithm which uses the entire dataset in its training phase. Whenever a prediction is required for an unseen data instance, it searches through the entire training dataset for k-most similar instances and the data with the most similar instance is finally returned as the prediction.This algorithm suggests that if you’re similar to your neighbours, then you are one of them. Let us consider a simple example, if apple looks more similar to peach, pear, and cherry (fruits) than monkey, cat or a rat (animals), then most likely apple is a fruit.Nearest Neighbours algorithm has been in action for the last sixty years. It is mainly used in statistical estimation and pattern recognition, as a non-parametric method, for regression and classification. The main aim of the K-Nearest Neighbor algorithm is to classify a new data point by comparing it to all previously seen data points. The classification of the k most similar previous cases are used for predicting the classification of the current data point. It is a simple algorithm which stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions).When do we use K-NN algorithm?K-NN algorithm can be used for applications which require high accuracy as it makes highly accurate predictions. The quality of predictions is completely dependent on the distance measure. Thus, this algorithm is suitable for applications for which you have sufficient domain knowledge so that it can help you select an appropriate measure.As we have already seen K-NN algorithm is a type of lazy learning, the computation for the generation is postponed until classification which indeed increases the costs of computation compared to other machine learning algorithms. But still K-NN is considered to be the better choice for applications where accuracy is more important and predictions are not requested frequently.K-NN can be used for both regression and classification predictive problems. However, in the industry it is mostly used in classification problems.Generally we mainly look at 3 important aspects in order to evaluate any technique:Ease to interpret outputCalculation timePredictive PowerLet us consider a few examples to place K-NN in the scale :If you notice the chart mentioned above, K-NN algorithm exceeds in most of the parameters. It is most commonly used for ease of interpretation and low calculation time.How does the K-NN algorithm work?K-NN algorithm works on the basis of feature similarity. The classification of a given data point is determined by how closely out-of-sample features resemble our training set.The above figure shows an example of k-NN classification. If you consider the nearest neighbor to the test sample, it is a blue square (Class 1) and k=1. This falls inside the inner circle.Now, if you consider k=3, then you will see 2 red triangles and only 1 blue square falls under the outer circle. Thus, the test sample is classified as a red triangle now (Class 2).Similarly, if you consider k=5, it is assigned to the first class (3 squares vs. 2 triangles outside the outer circle).K-NN in RegressionIn regression problems, K-NN is used for prediction based on the mean or the median of the K-most similar instances.K-NN in ClassificationK-nearest-neighbor classification was actually developed from the need to perform discriminant analysis when reliable parametric estimates of probability densities are unknown or are difficult to determine. When K-NN is used for classification, the output is easily calculated by the class having the highest frequency from the K-most similar instances. The class with maximum vote is taken into consideration for prediction.The probabilities of Classes can be calculated as the normalized frequency of samples that belong to each class in the set of K most similar instances for a new data instance.For example, in a binary classification problem (class is 0 or 1):p(class=0) = count(class=0) / (count(class=0)+count(class=1))If you are using K and you have an even number of classes (e.g. 2) it is a good idea to choose a K value with an odd number to avoid a tie. And the inverse, use an even number for K when you have an odd number of classes.Ties can be broken consistently by expanding K by 1 and looking at the class of the next most similar instance in the training dataset.Making Predictions with K-NNA case can be classified by a majority vote of its neighbors. The case is then assigned to the most common class amongst its K nearest neighbors measured by a distance function. Suppose the value of K is 1, then the case is simply assigned to the class of its nearest neighbor.The three distance measures mentioned above are valid only for continuous variables. For categorical variables, the Hamming distance is used. It also brings up the issue of standardization of the numerical variables between 0 and 1 when there is a mixture of numerical and categorical variables in the dataset.By inspecting the data, you can choose the best optimal value for K. Generally, a large value of K is more accurate as it tends to reduce the overall noise but is not always true. Another way to retrospectively determine a good K value by using an independent dataset to validate the K value is Cross-validation. According to observation, the optimal K for most datasets has been between 3-10 which provides better results than 1NN.For example, let us consider an example where the data mentioned below us concerned with credit default. Age and Loan are two numerical variables (predictors) and Default is the target.By observing the data mentioned above, we can use the training set in order to classify an unknown case (Age=48 and Loan=$142,000) using Euclidean distance. If K=1 then the nearest neighbor is the last case in the training set with Default=Y.AgeLoanDefaultDistance25$40,000N10200035$60,000N8200045$80,000N6200020$20,000N12200035$120,000N22000252$18,000N12400023$95,000Y4700040$62,000Y8000060$100,000Y42000348$220,000Y7800033$150,000Y8000148$142,000?With K=3, there are two Default=Y and one Default=N out of three closest neighbors. The prediction for the unknown case is again Default=Y.Standardized DistanceOne major drawback in calculating distance measures directly from the training set is in the case where variables have different measurement scales or there is a mixture of numerical and categorical variables. For example, if one variable is based on annual income in dollars, and the other is based on age in years then income will have a much higher influence on the distance calculated. One solution is to standardize the training set as shown below.AgeLoanDefaultDistance0.1250.11N0.76520.3750.21N0.52000.6250.31N0.316000.01N0.92450.3750.50N0.34280.80.00N0.62200.0750.38Y0.66690.50.22Y0.443710.41Y0.36500.71.00Y0.38610.3250.65Y0.37710.70.61?Using the standardized distance on the same training set, the unknown case returned a different neighbor which is not a good sign of robustness.Between-sample geometric distanceThe k-nearest-neighbor classifier is commonly based on the Euclidean distance between a test sample and the specified training samples. Let  xi be an input sample with p features, (xi1, xi2, …, xip), n be the total number of input samples (i=1,2,…,n) and p the total number of features (j=1,2,…,p) . The Euclidean distance between sample xi and xl (l=1,2,…,n) is defined as:A graphical representation of the nearest neighbor concept is illustrated in the Voronoi tessellation. The tessellation shows 19 samples marked with a "+", and the Voronoi cell, R, surrounding each sample. A Voronoi cell encapsulates all neighboring points that are nearest to each sample and is defined as:Where Ri is the Voronoi cell for sample xi, and x represents all possible points within Voronoi cell Ri.The Voronoi tessellation reflects two characteristics of the example 2-dimensional coordinate system: i) all possible points within a sample's Voronoi cell are the nearest neighboring points for that sample, and ii) for any sample, the nearest sample is determined by the closest Voronoi cell edge.According to the latter characteristic, the k-nearest-neighbor classification rule is to assign to a test sample the majority category label of its k nearest training samples. In practice, k is usually chosen to be odd, so as to avoid ties. The k = 1 rule is generally called the nearest-neighbor classification rule.Curse of DimensionalityThe curse of dimensionality refers to various phenomena that are witnessed while analyzing and organizing data in high-dimensional spaces (often with hundreds or thousands of dimensions). Such phenomenon do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience.K-NN algorithm will work absolutely fine when you are dealing with a small number of input variables (p)  but will struggle when there are a large number of inputs.K-NN works well with a small number of input variables (p), but struggles when the number of inputs is very large. Each input variable can be considered a dimension of a p-dimensional input space. For example, suppose you have two input variables x1 and x2, the input space would be 2-dimensional. With the increase in the number of dimensions, the volume of the input space increases at an exponential rate.In case of higher dimensions, the points which are similar may have large distances. All these points will be then away from each other and our intuition about 2 to 3 dimensional spaces will not be applicable. This kind of problem is called the “Curse of Dimensionality“. How is K in K-means different from K in K-NN?K-Means Clustering and k-Nearest Neighbors algorithm, both are commonly used algorithms in Machine Learning. They are often confused with each other, especially when we are talking about the k-factor. The ‘K’ in K-Means Clustering has nothing to do with the ‘K’ in K-NN algorithm. k-Means Clustering is an unsupervised learning algorithm that is used for clustering whereas K-NN is a supervised learning algorithm used for classification.K-Means AlgorithmThe k-means algorithm is an unsupervised clustering algorithm which takes a couple of unlabeled points and then groups them into “k” number of clusters.The “k” in k-means denotes the number of clusters you would like to have in the end. Suppose the value of k is 5, it means you will have 5 clusters on the data set.Let us see how it works.Step 1: First you determine the value of K by Elbow method and then specify the number of clusters KStep 2: Next you have to randomly assign each data point to a clusterStep 3: Determine the cluster centroid coordinatesStep 4: Determine the distances of each data point to the centroids and re-assign each point to the closest cluster centroid based upon minimum distanceStep 5: Calculate cluster centroids againStep 6: Repeat steps 4 and 5 until we reach global optima where further improvements are not possible and there is no provision to switch data points from one cluster to another.Implementation in Python#Finding the optimum number of clusters for k-means clustering Nc = range(1, 10) kmeans = [KMeans(n_clusters=i) for i in Nc] kmeans score = [kmeans[i].fit(x).score(x) for i in range(len(kmeans))] score pl.plot(Nc,score) pl.xlabel('Number of Clusters') pl.ylabel('Score') pl.title('Elbow Curve') pl.show()You can clearly see why it is called 'The elbow method' from the above graph, the optimum clusters is where the elbow occurs.Now that we have the optimum amount of clusters (k=3), we can move on to applying K-means clustering to the Iris dataset.#Implementation of K-Means Clustering model = KMeans(n_clusters = 3) model.fit(x) model.labels_ colormap = np.array(['Red', 'Blue', 'Green']) z = plt.scatter(x.sepal_length, x.sepal_width, x.petal_length, c = colormap[model.labels_])#Accuracy of K-Means Clustering accuracy_score(iris.target,model.labels_) 0.8933333333333333K-NN AlgorithmBy now, we already know that K-NN algorithm is a supervised classification algorithm. It takes into consideration a couple of labelled points and then uses those points to learn how to label other points. To be able to assign label to other points, K-NN algorithm looks for the closest neighbor of the new point and checks for voting. The most number of neighbors around the new point decide the label of the new point.The “k” in K-Nearest Neighbors is the number of neighbors it checks. It is supervised because it is trying to classify a point on the basis of the known classification of other points.Let us see how it works.Step 1: Firstly, you determine the value for K.Step 2: Then you calculate the distances between the new input (test data) and all the training data. The most commonly used metrics for calculating distance are Euclidean, Manhattan and MinkowskiStep 3: Sort the distance and determine k nearest neighbors based on minimum distance valuesStep 4: Analyze the category of those neighbors and assign the category for the test data based on majority voteStep 5: Return the predicted classImplementation using Pythonerror = [] # Calculating error for K values between 1 and 40 for i in range(1, 40): K-NN = KNeighborsClassifier(n_neighbors=i) K-NN.fit(X_train, y_train) pred_i = K-NN.predict(X_test) error.append(np.mean(pred_i != y_test)) plt.figure(figsize=(12, 6)) plt.plot(range(1, 40), error, color='black', linestyle='dashed', marker='o',     markerfacecolor='grey', markersize=10) plt.title('Error Rate K Value') plt.xlabel('K Value') plt.ylabel('Mean Error') Text(0, 0.5, 'Mean Error')Now we know for what values of ‘K’, the error rate will be less. Let’s fix k=5 and implement K-NN algorithm.#Creating training and test splits from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20) #Performing Feature Scaling from sklearn.preprocessing import StandardScaler scaler = StandardScaler() scaler.fit(X_train) X_train = scaler.transform(X_train) X_test = scaler.transform(X_test) #Training K-NN with k=5 from sklearn.neighbors import KNeighborsClassifier classifier = KNeighborsClassifier(n_neighbors=5) classifier.fit(X_train, y_train) KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,                     weights='uniform') y_pred = classifier.predict(X_test) from sklearn.metrics import classification_report, confusion_matrix print(confusion_matrix(y_test, y_pred)) print(classification_report(y_test, y_pred)) [[10  0  0] [ 0  9  2] [ 0  1  8]]                 precision recall   f1-score   support     Iris-setosa        1.00         1.00       1.00       10 Iris-versicolor       0.90       0.82     0.86       11 Iris-virginica     0.80         0.89       0.84       9       accuracy                   0.90         30       macro avg     0.90       0.90     0.90     30   weighted avg    0.90       0.90     0.90       30Practical Applications of K-NNNow that we have we have seen how K-NN works, let us look into some of the practical applications of K-NN.Recommending products to people with similar interests, recommending movies and TV shows as per viewer’s choice and interest, recommending hotels and other accommodation facilities while you are travelling based on your previous bookings.Assigning credit ratings based on financial characteristics, comparing people with similar financial features in a database. By analyzing the nature of a credit rating, people with similar financial details, they would be assigned similar credit ratings.Should the bank give a loan to an individual? Would an individual default on his or her loan? Is that person closer in characteristics to people who defaulted or did not default on their loans?Some advanced examples could include handwriting detection (like OCR), image recognition and even video recognition.Some pros and cons of K-NNProsTraining phase of K-nearest neighbor classification is faster in comparison with other classification algorithms.Training of a model is not required for generalization.Simple algorithm — to explain and understand/interpret.High accuracy (relatively) — it is pretty high but not competitive in comparison to better supervised learning models.K-NN can be useful in case of nonlinear data.Versatile — useful for classification or regression.ConsTesting phase of K-nearest neighbor classification is slower and costlier with respect to time and memory. High memory requirement - Requires large memory for storing the entire training dataset.K-NN requires scaling of data because K-NN uses the Euclidean distance between two data points to find nearest neighbors.Euclidean distance is sensitive to magnitudes. The features with high magnitudes will weigh more than features with low magnitudes.Not suitable for large dimensional data.How to improve the performance of K-NN?Rescaling Data: K-NN performs much better if all of the data has the same scale. Normalizing your data to the range [0, 1] is a good idea. It may also be a good idea to standardize your data if it has a Gaussian distribution.Addressing Missing Data: Missing data will mean that the distance between samples can not be calculated. These samples could be excluded or the missing values could be imputed.Reducing Dimensionality: K-NN is suited for lower dimensional data. You can try it on high dimensional data (hundreds or thousands of input variables) but be aware that it may not perform as good as other techniques. K-NN can benefit from feature selection that reduces the dimensionality of the input feature space.In this article we have learned about the K-Nearest Neighbor algorithm, where we should use it, how it works and so on. Also, we have discussed about parametric and nonparametric machine learning algorithms, instance based learning, eager and lazy learning, advantages and disadvantages of using K-NN, performance improvement suggestions and have implemented K-NN in Python. To learn more about other machine learning algorithms, join our Data Science Certification course and expand your learning skill set and career opportunities.
Rated 4.5/5 based on 8 customer reviews
8981
What is K-Nearest Neighbor in Machine Learning: K-...

If you are thinking of a simple, easy-to-implement... Read More

How to Stand Out in a Python Coding Interview - Functions, Data Structures & Libraries

Any coding interview is a test which primarily focuses on your technical skills and algorithm knowledge. However, if you want to stand out among the hundreds of interviewees, you should know how to use the common functionalities of Python in a convenient manner.The type of interview you might face can be a remote coding challenge, a whiteboard challenge or a full day on-site interview. So if you can prove your coding skills at that moment, the job letter will reach you in no time. You may go through some of the top Python interview questions and answers provided by experts which are divided into three levels- beginner, intermediate and advanced. A thorough practice of these questions and answers on Python will definitely help you achieve your dream job as a Python Developer, Full Stack engineer, and other top profiles.A Python coding interview is basically a technical interview. They are not just about solving problems, they are more about how technically sound you are and how you can write clean productive Python code. This will show your depth of knowledge about Python and how you can use Python’s built-in functions and libraries to implement your code. Go through our Python Tutorials to learn more about  concepts related to Python. Let us look into some of the built-in functions provided by Python and how to select the correct one, learn about the effective use of data structures, how standard libraries in Python can be utilized and so on.How to Select the Correct Built-in Function?Python’s library of built-in functions is small as compared to the standard library. The built-in functions are always available and are not needed to be imported. It is suggested to learn each function before sitting for the interview. Till then, let us learn a few built-in functions and how to use them and also what alternatives can be used.Perform iteration with enumerate() instead of range() Consider a situation during a coding interview: You have a list of elements and you have to iterate over the list with the access to both the indices and values. To differentiate between iteration with enumerate()  and iteration with range(), let us take a look at the classic coding interview question FizzBuzz. It can be solved by iterating over both indices and values. You will be given a list of integers and your task will be as follows:Replace all integers that are evenly distributed by 3 with “fizz”.Replace all integers divisible by 5 with “buzz”.Replace all integers divisible by 3 and 5 with “fizzbuzz”.Developers make use of range() in these situations which can access the elements by index:>>> list_num = [30, 29, 10, 65, 95, 99] >>> for i in range(len(list_num)):       if list_num[i] % 3 == 0 and list_num[i] % 5 == 0:           list_num[i] = 'fizzbuzz'       elif list_num[i] % 3 == 0:           list_num[i] = 'fizz'       elif list_num[i] % 5 == 0:           list_num[i] = 'buzz'   >>> list_num ['fizzbuzz', 22, 14, 'buzz', 97, 'fizz']Though range() can be used in a lot of iterative methods, it is better to use enumerate() in this case since it can access the element’s index and value at the same time:>>> list_num = [30, 29, 10, 65, 95, 99] >>> for i,num in enumerate(list_num):       if list_num[i] % 3 == 0 and list_num[i] % 5 == 0:           list_num[i] = 'fizzbuzz'       elif list_num[i] % 3 == 0:           list_num[i] = 'fizz'       elif list_num[i] % 5 == 0:           list_num[i] = 'buzz' >>> list_num ['fizzbuzz', 22, 14, 'buzz', 97, 'fizz']The enumerate() function returns a counter and the element value for each element. The counter is set to 0 by default which is also the element’s index.  However, if you are not willing to start your counter from 0, you can set an offset using the start parameter:>>> list_num = [30, 29, 10, 65, 95, 99] >>> for i, num in enumerate(list_num, start=11):       print(i, num) 11 30 12 29 13 10 14 65 14 95 16 99You can access all of the same elements using the start parameter. However, the count will start from the specified integer value.Using List Comprehensions in place of map() and filter()Python supports list comprehensions which are easier to read and are analogous in functionality as map() and filter(). This is one of the reasons why Guido van Rossum, the creator of Python felt that dropping map() and filter() was quite uncontroversial.An example to show  map() along with this equivalent list comprehension:>>> list_num = [1, 2, 3, 4, 5, 6] >>> def square_num(z): ...    return z*z ... >>> list(map(square_num, list_num)) [1, 4, 9, 16, 25, 36] >>> [square_num(z) for z in numbers] [1, 4, 9, 16, 25, 36]Though map() and list comprehension returns the same values but the list comprehension part is easier to read and understand.An example to show  filter() and its equivalent list comprehension:>>> def odd_num_check(z):       return bool(z % 2)   >>> list(filter(odd_num_check, num_list)) [1, 3, 5] >>> [z for z in numbers if odd_num_check(z)] [1, 3, 5]It is the same with filter()as it was with map(). The return values are the same but the list comprehension is easier to follow.List comprehensions are easier to read and beginners are able to catch it more intuitively.Though other programming language developers might argue to the fact but if you make use of list comprehensions during your coding interview, it is more likely to communicate your knowledge about the common functionalities to the recruiter.Debugging With breakpoint() instead of print() Debugging is an essential part of writing software and it shows your knowledge of Python tools which will be useful in developing quickly in your job in the long run. However, using print() to debug a small problem might be good initially but your code will become clumsy. On the other hand, if you use a debugger like breakpoint(), it will always act faster than print().If you’re using Python 3.7, you can simply call breakpoint() at the point in your code where you want to debug without the need of importing anything:# Complicated Code With Bugs ... ... ... breakpoint()Whenever you call breakpoint(), you will be put into The Python Debugger - pdb. However, if you’re using Python 3.6 or older, you can perform an explicit importing which will be exactly like calling breakpoint():import pdb; pdb.set_trace()In this example, you’re being put into the pdb by the pdb.set_trace().  Since it’s a bit difficult to remember, it is recommended to use breakpoint() whenever a debugger is needed. There are also other debuggers that you can try. Getting used to debuggers before your interview would be a great advantage but you can always come back to pdb since it’s a part of the Python Standard Library and is always available. Formatting Strings with the help of f-StringsIt can be confusing to know what type of string formatting should we use since Python consists of a number of different string formatting techniques. However, it is a good approach and is suggested to use Python’s f-strings during a coding interview for Python 3.6 or greater.Literal String Interpolation or f-strings is a powerful string formatting technique that is more readable, more concise, faster and less prone to error than other formatting techniques. It supports the string formatting mini-language which makes string interpolation simpler. You also have the option of adding new variables and Python expressions and they can be evaluated before run-time:>>> def name_and_age(name, age):       return f"My name is {name} and I'm {age / 10:.5f} years old."   >>> name_and_age("Alex", 21) My name is Alex and I'm 2.10000 years old.The f-string allows you to add the name Alex into the string and his corresponding age with the type of formatting you want in one single operation.Note that it is suggested to use Template Strings if the output consists of user-generated values.Sorting Complex Lists with sorted()There are a lot of interview questions that are mostly based on sorting and it is one of the most important concepts you should be clear about before you sit for a coding interview. However, it is always a better option to use sorted() unless you are asked to make your own sorting algorithm by the interviewer.Example code to illustrate simple uses of sorting like sorting numbers or strings:>>> sorted([6,5,3,7,2,4,1]) [1, 2, 3, 4, 5, 6, 7] >>> sorted(['IronMan', 'Batman', 'Thor', 'CaptainAmerica', 'DoctorStrange'], reverse=False) ['Batman', 'CaptainAmerica', 'DoctorStrange', 'IronMan', 'Thor']sorted() performs sorting in ascending order by default and also when the reverse argument is set to False. If you sorting complex data types, you might want to add a function which allows custom sorting rules:>>> animal_list = [ ...    {'type': 'bear', 'name': 'Stephan', 'age': 9}, ...    {'type': 'elephant', 'name': 'Devory', 'age': 5}, ...    {'type': 'jaguar', 'name': 'Moana', 'age': 7}, ... ] >>> sorted(animal_list, key=lambda animal: animal['age']) [     {'type': 'elephant', 'name': 'Devory', 'age': 5},     {'type': 'jaguar', 'name': 'Moana', 'age': 7},     {'type': 'bear, 'name': 'Stephan, 'age': 9}, ]You can easily sort a list of dictionaries using the lambda keyword. In the example above, the lambda returns each element’s age and the dictionary is sorted in ascending order by age.Effective Use of Data StructuresData Structures are one of the most important concepts you should know before getting into an interview and if you choose the perfect data structure during an interviewing context, it will certainly impact your performance. Python’s standard data structure implementations are incredibly powerful and give a lot of default functionalities which will surely be helpful in coding interviews.Storing Values with SetsMake use of sets instead of lists whenever you want to remove duplicate elements from an existing dataset.Consider a function random_word that always returns a random word from a set of words:>>> import random >>> words = "all the words in the world".split() >>> def random_word():       return random.choice(words)In the example above, you need to call random_word repeatedly to get 1000 random selections and then return a data structure that will contain every unique word.Let us look at three approaches to execute this – two suboptimal approaches and one good approach.Bad Approach An example to store values in a list and then convert into a set:>>> def unique_words():       words = []       for _ in range(1000):           words.append(random_word())       return set(words) >>> unique_words() {'planet', 'earth', 'to', 'words'}In this example, creating a list and then converting it into a set is an unnecessary approach. Interviewers notice this type of design and questions about it generally.Worse ApproachYou can store values into a list to avoid the conversion from list to a set. You can then check for the uniqueness by comparing new values with all current elements in the list:>>> def unique_words():       words = []       for _ in range(1000):     word = unique_words()     if word not in words:     words.append(word)       return words >>> unique_words() {'planet', 'earth', 'to', 'words'}This approach is much worse than the previous one since you have to compare every word to every other word already present in the list. In simple terms, the time complexity is much greater in this case than the earlier example.Good ApproachIn this example, you can skip the lists and use sets altogether from the beginning:>>> def unique_words():       words = set()       for _ in range(1000):           words.add(random_word())       return words >>> unique_words() {'planet', 'earth', 'to', 'words'}This approach differs from the second approach as the storing of elements in this approach allows near-constant-time-checks whether a value is present in the set or not whereas linear time-lookups were required when lists were used. The time complexity for this approach is O(N) which is much better than the second approach whose time complexity grew at the rate of O(N²).Saving Memory with GeneratorsThough lists comprehensions are convenient tools, it may lead to excessive use of memory.Consider a situation where you need to find the sum of the first 1000 squares starting with 1 using list comprehensions:>>> sum([z * z for z in range(1, 1001)])333833500Your solution returns the correct answer by making a list of every perfect square and then sums the values. However, the interviewer asks you to increase the number of perfect squares. Initially, your program might work well but it will gradually slow down and the process will be changed completely.  However, you can resolve this memory issue just by replacing the brackets with parentheses:>>> sum((z * z for z in range(1, 1001)))333833500When you make the change from brackets to parentheses, the list comprehension changes to generator expressions. It returns a generator object. The object calculates the next value only when asked. Generators are mainly used on massive sequences of data and in situations when you want to retrieve data from a sequence but don’t want to access all of it at the same time.Defining Default Values in Dictionaries with .get() and .setdefault()Adding, modifying or retrieving an item from a dictionary is one of the most primitive tasks of programming and it is easy to perform with Python functionalities. However, developers often check explicitly for values even its not necessary.Consider a situation where a dictionary named shepherd exists and you want to get that cowboy’s name by explicitly checking for the key with a conditional:>>> shepherd = {'age': 20, 'sheep': 'yorkie', 'size_of_hat': 'large'} >>> if 'name' in shepherd:       name = shepherd['name']     else:       name = 'The Man with No Name'   >>> nameIn this example, the key name is searched in the dictionary and the corresponding value is returned otherwise a default value is returned.You can use .get() in a single line instead of checking keys explicitly:>>> name = shepherd.get('name', 'The Man with No Name')The get() performs the same operation as the first approach does, but they are now handled automatically. However, .get() function does not help in situations where you need to update the dictionary with a default value while still accessing the same key. In such a case, you again need to use explicit checking:>>> if 'name' not in shepherd:       shepherd['name'] = 'The Man with No Name'   >>> name = shepherd['name']However, Python still offers a more elegant way of performing this approach using .setdefault():>>> name = shepherd.setdefault('name', 'The Man with No Name')The .setdefault() function performs the same operation as the previous approach did. If name exists in shepherd, it returns a value otherwise it sets shepherd[‘name’]  to The Man with No Name and returns a new value.Taking Advantage of the Python Standard LibraryPython’s functionalities are powerful on its own and all the things can be accessed just by using the import statement. If you know how to make good use of the standard library, it will boost your coding interview skills.How to handle missing dictionaries?You can use .get() and .setdefault() when you want to set a default for a single key. However, there will be situations where you will need to set a default value for all possible unset keys, especially during the context of a coding interview.Consider you have a  group of students and your task is to keep track of their grades on assignments. The input value is a tuple with student_name and grade. You want to look upon all the grades for a single student without iterating over the whole list. An example to store grade data using a dictionary:>>> grades_of_students = {} >>> grades = [       ('alex', 89),       ('bob', 95),       ('charles', 81),       ('alex', 94),       ] >>> for name, grade in grades:       if name not in grades_of_student:           grades_of_student[name] = []       grades_of_student[name].append(grade) >>> student_grades{'alex': [89, 94], 'bob': [95], 'charles': [81]}In the example above, you iterate over the list and check if the names are already present in the dictionary or not. If it isn’t, then you add them to the dictionary with an empty list and then append their actual grades to the student’s list of grades.However, the previous approach is good but there is a cleaner approach for such cases using the defaultdict:>>> from collections import defaultdict >>> student_grades = defaultdict(list) >>> for name, grade in grades:       student_grades[name].append(grade)In this approach, a defaultdict is created that uses the list() with no arguments. The list()returns an empty list. defaultdict calls the list() if the name does not exist and then appends the grade.Using the defaultdict, you can handle all the common default values at once and need not worry about default values at the key level. Moreover, it generates a much cleaner application code.How to Count Hashable Objects?Pretend you have a long string of words with no punctuation or capital letters and you are asked to count the number of the appearance of each word. In this case, you can use collections.Counter that uses 0 as the default value for any missing element and makes it easier and cleaner to count the occurrence of different objects:>>> from collections import Counter >>> words = "if I am there but if \ ... he was not there then I was not".split() >>> counts = Counter(words) >>> countsCounter({'if': 2, 'there': 2, 'was': 1, 'not': 2, 'but': 1, ‘I’: 2, ‘am’: 1, }When the list is passed to Counter, it stores each word and also the number of occurrences of that word in the list.If you want to know the two most common words in a list of strings like above, you can use .most_common() which simply returns the n most frequently inputs by count:>>> counts.most_common(2)[('if': 2), ('there': 2), ('not': 2), (‘I’: 2)] How to Access Common String Groups?If you want to check whether ‘A’ > ‘a’ or not, you have to do it using the ASCII chart. The answer will be false since the ASCII value for A is 65 and a is 97, which is clearly greater. However, it would be a difficult task to remember the ASCII code when it comes to lowercase and uppercase ASCII characters and also this method is a bit clumsy. You can use the much easier and convenient constants which are a part of the string module. An example to check whether all the characters in a string are uppercase or not:>>> import string >>> def check_if_upper(word):       for letter in word:           if letter not in string.ascii_uppercase:               return False       return True   >>> check_if_upper('Thanks Alex') False >>> check_if_upper('ROFL') TrueThe function check_if_upper iterates over the letters in words, and checks whether the letters are part of string.ascii_uppercase. It is set to the literal ‘ABCDEFGHIJKLMNOPQRSTUVWXYZ’.There are a number of string constants that are frequently used for referencing string values that are easy to read and use. Some of which are as follows:string.ascii_lettersstring.ascii_upercasestring.ascii_lowercasestring.ascii_digitsstring.ascii_hexdigitsstring.ascii_octdigitsstring.ascii_punctuationstring.ascii_printablestring.ascii_whitespaceConclusionClearing interview with confidence and panache is a skill. You might be a good programmer but it’s only a small part of the picture. You might fail to clear a few interviews, but if you follow a good process, it will certainly help you in the long run. Being enthusiastic is an important factor that will have a huge impact on your interview results. In addition to that is practice. Practice always helps. Brush up on all the common interview concepts and then head off to practicing different interview questions. Interviewers also help during interviews if you can communicate properly and interact. Ask questions and always talk through a brute-force and optimized solution.Let us now sum up what we have learned in this article so far:To use enumerate() to iterate over both indices and values.To debug problematic code with breakpoint().To format strings effectively with f-strings.To sort lists with custom arguments.To use generators instead of list comprehensions to save memory.To define default values when looking up dictionary keys.To count hashable objects with collections.Counter class.Hope you have learned about most of the powerful Python’s built-in functions, data structures, and standard library packages that will help you in writing better, faster and cleaner code. Though there are a lot of other things to learn about the language, join our Python certification course to gain more skills and knowledge.
Rated 4.5/5 based on 14 customer reviews
5878
How to Stand Out in a Python Coding Interview - Fu...

Any coding interview is a test which primarily foc... Read More

How to Install Docker on Windows, Mac, & Linux: A Step-By-Step Guide

Docker is intended to benefit developers and system managers and makes it a component of a number of toolchains for DevOps (developers + activities). This implies that designers can concentrate their attention on writing code without worrying about the scheme that it will eventually run on. It also gives them the opportunity to take advantage of one of the thousands of programs intended to operate as part of their implementation in a container at Docker. Docker offers flexibility for the operational team and decreases possibly a smaller overhead footprint and lower overhead the number of devices required.Let’s now deep dive into installation steps for docker on different platforms.Install Docker on Windows The community version of Docker for Microsoft Windows is Docker Desktop for Windows.Download from Docker Hub. System RequirementsThe software and hardware requirements need to operate Client Hyper-V on Windows 10 effectively are:Software Requirements:Windows-10 64-bit system requirements: Pro, Enterprise or EducationWindows characteristics of Hyper-V and Containers must be activatedHardware Requirements:The support for virtualization of hardware-level Client Hyper-V in BIOS settings must be allowed with the 64-bit processor with second-level address translation (SLAT). Minimum 4 GB RAMTo run Docker Desktop, Microsoft Hyper-V is needed. The Windows installer Docker Desktop allows Hyper-V and restarts your computer if needed. VirtualBox no longer operates when Hyper-V is activated. All VirtualBox VM images are however maintained.The DOCKer VMs (including the default one generated during the installation of the Toolbox) are no longer started. VirtualBox The Docker desktop can not use these VMs side-by-side. You can still handle remote VMs using the docker.What is included in Installation?The installation of Docker Desktop consists of the Docker Engine, Docker CLI, Docker Compose, Docker Machine, and Kitematic. Docker Desktop containers and images are shared among all user accounts on the machines where they are installed. All Windows accounts are building and running containers using the same VM. Nested virtualization situations, such as operating Docker Desktop with VMWare or Parallels, might operate. See Running Docker Desktop in nested situations for more data.Installation steps To run the installer, double-click Docker Desktop Installer.exe to install Docker Desktop on Windows. The installer can be accessed from Docker Hub if you have not previously downloaded (Docker Desktop Installer.exe). It typically downloads to your download directory or can be executed at the bottom of your internet browser from the latest download bar.Follow the installation wizard directions for licensing, authorizing the installer and proceeding with the installation. If advised, authorize your system password during the installation of the Docker Desktop Installer. The networking elements, connections to the applications of Docker and the management of Hyper-V VMs need to be privately accessible.Click Finish in the setup window and launch the application Docker Desktop.Start Docker DesktopAfter installation, Docker Desktop will not begin automatically. Search for Docker and select the search outcomes for Docker Desktop.If the whale icon remains stable in the status bar, Docker Desktop is up and running and can be accessed from any terminal window.You also get a pop-up message with the next steps, as well as a link to this documentation, after the Docker Desktop app is installed.When you're done initializing, click on the whale icon in the Notifications region and pick About Docker to check that your recent version is available.Install Docker on MacThe very first step is to download the Docker Toolbox for Mac. Get the downloadable link- Download from Docker HubSystem RequirementDocker Desktop for Mac starts only when all these requirements can be met:Mac hardware must be 2010 models or newer, including Extended Page Tables (EPT) and Unrestricted Mode, with Intel hardware to provide memory management unit (MMU) virtualization. This support can be checked to see if the following command is being run on your computer: sysctl kern.hv_supportmacOS Sierra 10.12 and newer versions of macOS are endorsed. The upgrade to the newest version of macOS is recommended.VirtualBox (incompatible with Docker Desktop on Mac) before version 4.3.30 must not be installed. It's alright if you have a newer VirtualBox version installed.Installation stepsDouble-click Docker.dmg and drag the whale Moby to the application folder to open the installer.In the Applications directory, double-click Docker.app to launch Docker. In the instance below, the applications folder is in the Grid view modeYou are led to allow Docker.app with your system password after starting it. Privileged access is required to install Docker app connections and networking elements.The whale in the top status bar shows that Docker runs from a terminal and is available.You will also get a success message, with the next steps and a link to this documentation, if you have just installed the app. To reject this pop-up, click on the whale in the status bar.To get Preferences and other options, click on the whale (whale menu).To check that you have the latest version, select About Docker.Notes:Getting started provides an overview of Docker Desktop for Mac, basic Docker command examples, how to get help or give feedback, and links to all topics in the Docker Desktop for Mac guide.Troubleshooting describes common problems, workarounds, how to run and submit diagnostics, and submit issues.Install Docker on LinuxLet’s use a Ubuntu example to begin installing Docker. If you don't already have it, you can use Oracle Virtual Box to install a virtual Linux example. A straightforward Ubuntu server mounted on the Oracle Virtual Box is shown in the following screenshot. There is an OS user called a demo defined with full root access to the scheme:Step 1 − We must first make sure you have the correct version of the Linux kernel running before installing Docker. Only version 3.8 or greater is intended for Docker on Linux kernel. We can do this with the instructions below.Uname: The system data for the Linux system is returned by this method. This method will return the kernel name, kernel release, kernel version information on the Linux system.uname -aa − Used for ensuring the return of the system data.Step 2 − You need to install packages from the internet onto the Linux system via the following command, the recent packages can be updated to the OS.apt-get Optionssudo− The sudo command is used to make sure the command runs with root access.update− Update option ensures that all packages on the Linux system are updated.sudo apt-get update Step 3- The next step is to install the certificates needed to later download required Docker packages for a job with the Docker site. The following command can be used.sudo apt-get install apt-transport-https ca-certificates Step 4− Adding fresh GPG key will be the next step. This key must guarantee that the required packages for Docker are all encrypted.This command is intended to download the key from hkp:/ha.pool.sks-keyservers.net:80 and add it to the adv keychain by means of the ID58118E89F3A912897C070ADBF76221572C52609D. Please note that to download the necessary Docker packages, this specific key is needed.Step 5 − Next, you need to add the appropriate site to docker.list of the apt package manager, depending on the version of Ubuntu which you hold, to allow it to detect and download the Docker packages from the Docker site.Precise 12.04 (LTS) ─ deb https://apt.dockerproject.org/repoubuntu-precise mainTrusty 14.04 (LTS) ─ deb https://apt.dockerproject.org/repo/ ubuntu-trusty mainWily 15.10 ─ deb https://apt.dockerproject.org/repo ubuntu-wily mainXenial 16.04 (LTS) - https://apt.dockerproject.org/repo ubuntu-xenial mainecho "deb https://apt.dockerproject.org/repo ubuntu-trusty main”     | sudo tee /etc/apt/sources.list.d/docker.listStep 6 –The next step is to update the packages on Ubuntu scheme with the apt-get update command.Step 7 ‐ if we want to make sure that the package manager points towards the correct repository then we can do this by issuing the apt-cache command.apt-cache policy docker-engineStep 8– Edit the update command apt-get to guarantee that all local system packages are up-to-date.Step 9- The Linux-image-extra-* kernel packages that allow the user to use the aufs storage driver are required for Ubuntu Trusty, Wily and Xenial. The newer variants of Docker use this engine.The following command can be used:sudo apt-get install linux-image-extra-$(uname -r)  linux-image-extra-virtualStep 10− Installing Docker is the final step and this can be done with the following command:sudo apt-get install –y docker-engineHere, apt-get utilizes the installation feature to download and install Docker from the Docker page. The Docker engine is the official package for Ubuntu based devices by the Docker Corporation.The docker running version can be checked by running below command:docker version
Rated 4.5/5 based on 10 customer reviews
5887
How to Install Docker on Windows, Mac, & Linux...

Docker is intended to benefit developers and syste... Read More

11 Top Features of Docker That You Must Know

Docker is an open platform to develop, ship and run applications containers on a common operating system. It enables you to separate applications from infrastructures so that software is delivered quickly. Infrastructure can be managed by Docker in the same way as one managed their applications. The delay between writing code and running it for production can be significantly reduced with the help of Docker’s methodologies for quick shipping, testing, and deployment of codes. Features of Docker:Docker provides various features, some of which are listed and discussed below.Faster and easier configurationApplication isolationIncrease in productivitySwarm Services Routing Mesh Security Management Rapid scaling of Systems Better Software Delivery Software-defined networkingHas the Ability to Reduce the Size1. Faster and Easier configuration: It is one of the key features of Docker that helps you in configuring the system in a faster and easier manner. Due to this feature, codes can be deployed in less time and with fewer efforts. The infrastructure is not linked with the environment of the application as Docker is used with a wide variety of environments. 2. Application isolation:Docker provides containers that are used to run applications in an isolated environment. Since each container is independent, Docker can execute any kind of application. 3. Increase in productivity:It helps in increasing productivity by easing up the technical configuration and rapidly deploying applications. Moreover, it not only provides an isolated environment to execute applications, but it reduces the resources as well.4. Swarm: Swarm is a clustering and scheduling tool for Docker containers. At the front end, it uses the Docker API, which helps us to use various tools to control it.  It is a self-organizing group of engines that enables pluggable backends.5. Services: Services is a list of tasks that specifies the state of a container inside a cluster. Each task in the Services lists one instance of a container that should be running, while Swarm schedules them across the nodes. 7. Security Management: It saves secrets into the swarm and chooses to give services access to certain secrets, including a few important commands to the engine such as secret inspect, secret create, etc.8. Rapid scaling of Systems: Containers require less computing hardware and get more work done. They allow data centre operators to cram more workload into less hardware, meaning sharing of hardware, resulting in lower costs. 9. Better Software Delivery: Software Delivery with the help of containers is said to be more efficient. Containers are portable, self-contained and include an isolated disk volume. This isolated volume goes along with the container as it develops and is deployed to various environments. 10. Software-defined networking:Docker supports Software-defined networking. Without having touched a single router, the Docker CLI and Engine enables operators to define isolated networks for containers. Operators and Developers design systems with complex network topologies, as well as define the networks in configuration files. Since the application’s containers can run in an isolated virtual network, with controlled ingress and egress path, it acts as a security benefit as well.11. Has the Ability to Reduce the Size:Since it provides a smaller footprint of the OS via containers, Docker holds the capability to reduce the size of the development. Who is Docker for?Docker as a tool benefits both developers and system administrators, and hence is a part of various toolchains of DevOps (Developers+Operations). It helps developers to focus on writing the code and not worry about the system that it will run on. Moreover, they can make use of one of the thousands of programs that are already designed to run in a Docker container as a part of their applications and get a head start. As for Operations, Docker provides flexibility as well as reduces the number of systems needed due to its lower overhead and small footprint. To Sum Up…We have discussed the top 11 Docker Features that help it stand out from the crowd and gives it huge popularity. It is popular due to its revolutionized development in the software industry, creating vast economies of scale. Hence, containers and Dockers hold the potential to open up new opportunities for your enterprise. 
Rated 4.5/5 based on 11 customer reviews
5878
11 Top Features of Docker That You Must Know

Docker is an open platform to develop, ship and ru... Read More

8 Key Challenges Of Implementing DevOps And Overcoming Them

The increase in the number of companies to adopt DevOps to improve their workflow and productivity has led to an increase in the recurring concerns regarding its implementation. The answers to questions such as ‘Where and how do I start with my DevOps adoption?’, ‘What are the challenges that I might face?’ and ‘How do I go about to resolve those challenges?’, are very commonly sought after. Bringing about such a revolutionary change from the traditional Waterfall approach to DevOps is not an easy process. The following lists some of the major challenges that organisations face while implementing DevOps.Change in Culture: The workplace culture undergoes the major amount of transformation while implementing DevOps. It is also one of the most difficult areas of transformation as it is a long term process which also requires a lot of patience and endurance. To make the process a bit easier, enterprises should try and maintain a positive as well as a transparent atmosphere in the workplace. Switching from Legacy Infrastructure to Microservices: In order to reduce stability issues, organisations now use infrastructure as code along with microservices for quicker development along with sharp innovations. Moreover, organisations need to update their hardware and software systems according to the latest trends on a regular basis, so that new systems can co-exist with the existing systems. Issues with the standards and metrics: Dev and Ops departments have different goals and working systems, hence they have different toolsets as well. It might become very tedious to sit together and integrate the tools. Under such circumstances, it is advisable that the teams agree upon a commonly decided metric system.Tool Turbulence: Switching to DevOps practices might make people dependent on the various tools that are available to solve even the smallest of their problems.  Due to this, organisations might become addicted to those tools which provide with short-term benefits over the ones which provide with long-term benefits. Some of the tools are open-sourced or SaaS-based and can be easily adopted without any authorization. To make things easier, you can provide teams with a set of library tools from which they can opt for their preferred tools. This will also help the leaders stay up-to-date with the activities of the employees. Resistance to Change: You might come across people in your company who might not be supportive of the legacy systems. They are the ones who have become comfortable with their way of working and are not willing to leave their comfort zones. Hence, it is very important that you don’t bend down to such elements but instead bear with the discomfort of change. Challenges during the process: Adopting DevOps can prove to be challenging for workers who blindly follow guidelines and stay stuck to the rules, or for companies which follow specific guidelines for software development, as DevOps doesn’t have any fixed framework stating procedures that employees can follow to reach their desired goals.The teams can decide on their own course of action without any structural approach, giving them opportunities and more scope for innovation. Test Automation: Test Automation holds equal importance as CI/CD deployments. It has been commonly observed that companies tend to neglect test automation and focus more on CI/CD deployments. For DevOps to be a success, continuous testing acts as a key. Cost and Budget: It is very important to keep in mind that open source does not necessarily mean that it is free of cost. Moreover, factor in integration and operational complexity to your overall costs. In a Nutshell:As Heraclitus, a Greek philosopher says that change is the only constant. It might be hard in the beginning, messy during the process, but it is always glorious in the end. Evolving in the IT culture, DevOps brings you closer to bridge the boundary between business, development and operations. Overcoming these challenges from the root will make the transition process smoother for you. 
Rated 4.5/5 based on 19 customer reviews
9883
8 Key Challenges Of Implementing DevOps And Overco...

The increase in the number of companies to adopt D... Read More