As a Data Science enthusiast, you might already know that a majority of business decisions these days are data-driven. However, it is essential to understand how to parse through all the data and types of big data. One of the most important types of data analysis in this field is Regression Analysis. Regression Analysis is a form of predictive modeling technique mainly used in statistics. The term “regression” in this context, was first coined by Sir Francis Galton, a cousin of Sir Charles Darwin. The earliest form of regression was developed by Adrien-Marie Legendre and Carl Gauss - a method of least squares. Before getting into the what and how of Regression in Data Science, let us first understand why regression analysis is essential and R squared meaning.

## Why is Regression Analysis important?

The evaluation of relationship between two or more variables is called Regression Analysis. It is a statistical technique.

Regression Analysis helps enterprises to understand what their data points represent,and use them wisely in coordination with different business analytical techniques in order to make better decisions.

Regression Analysis helps an individual to understand how the typical value of the dependent variable changes when one of the independent variables is varied, while the other independent variables remain unchanged. Therefore, this powerful statistical tool is used by Business Analysts and other data professionals for removing unwanted variables and choosing only the important ones.

The benefit of regression analysis is that it allows data crunching to help businesses make better decisions. A greater understanding of the variables can impact the success of a business in the coming weeks, months, and years in the future.

### Regression in Data Science and Data Analytics

The regression method of forecasting, as the name implies, is used for forecasting and for finding the casual relationship between variables. From a business point of view, the regression method of forecasting can be helpful for an individual working with data in the following ways:

- Predicting sales in the near and long term.
- Understanding demand and supply.
- Understanding inventory levels.
- Review and understand how variables impact all these factors.

However, businesses can use regression methods to understand the following:

- Why did the customer service calls drop in the past months?
- How the sales will look like in the next six months?
- Which ‘marketing promotion’ method to choose?
- Whether to expand the business or to create and market a new product.

The ultimate benefit of regression analysis is to determine which independent variables have the most effect on a dependent variable. It also helps to determine which factors can be ignored and those that should be emphasized.

Let us now understand what regression analysis is and its associated variables.

In addition, you can read more about measures of dispersion here.

## What is regression analysis?

According to the renowned American mathematician John Tukey, “An approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem". This is precisely what regression analysis strives to achieve.

Regression analysis is basically a set of statistical processes which investigates the relationship between a dependent (or target) variable and an independent (or predictor) variable. It helps assess the strength of the relationship between the variables and can also model the future relationship between the variables.

Regression analysis is widely used for prediction and forecasting, which overlaps with Machine Learning. On the other hand, it is also used for time series modeling and finding causal effect relationships between variables. For example, the relationship between rash driving and the number of road accidents by a driver can be best analyzed using regression.

Let us now understand regression analysis with examples.

### Meaning of Regression

Let us understand the concept of regression with an example.

Consider a situation where you conduct a case study on several college students. We will understand if students with high CGPA also get a high GRE score.

Our first job is to collect the details of the GRE scores and CGPAs of all the students of a college in a tabular form. The GRE scores and the CGPAs are listed in the 1st and 2nd columns, respectively.

To understand the relationship between CGPA and GRE score, we need to draw a scatter plot.

Here, we can see a linear relationship between CGPA and GRE score in the scatter plot. This indicates that if the CGPA increases, the GRE scores also increase. Thus, it would also mean that a student with a high CGPA is likely to have a greater chance of getting a high GRE score.

However, if a question arises like “If the CGPA of a student is 8.51, what will be the GRE score of the student?”. We need to find the relationship between these two variables to answer this question. This is the place where Regression plays its role.

In a regression algorithm, we usually have one dependent variable and one or more than one independent variable where we try to regress the dependent variable "Y" (in this case, GRE score) using the independent variable "X" (in this case, CGPA). In layman's terms, we are trying to understand how the value of "Y" changes concerning the change in "X".

Let us now understand the concept of dependent and independent variables.

### Dependent and Independent variables

In data science, variables refer to the properties or characteristics of certain events or objects.

There are mainly two types of variables while performing regression analysis which is as follows:

- Independent variables – These variables are manipulated or are altered by researchers whose effects are later measured and compared. They are also referred to as predictor variables. They are called predictor variables because they predict or forecast the values of dependent variables in a regression model.
- Dependent variables – These variables are the type of variable that measures the effect of the independent variables on the testing units. It is safer to say that dependent variables are completely dependent on them. They are also referred to as predicted variables. They are called because these are the predicted or assumed values by the independent or predictor variables.

When an individual is looking for a relationship between two variables, he is trying to determine what factors make the dependent variable change. For example, consider a scenario where a student's score is a dependent variable. It could depend on many independent factors like the amount of study he did, how much sleep he had the night before the test, or even how hungry he was during the test.

In data models, independent variables can have different names such as “regressors”, “explanatory variable”, “input variable”, “controlled variable”, etc. On the other hand, dependent variables are called “regressand,” “response variable”, “measured variable,” “observed variable,” “responding variable,” “explained variable,” “outcome variable,” “experimental variable,” or “output variable.”

Below are a few examples to understand the usage and significance of dependent and independent variables in a wider sense:

- Suppose you want to estimate the cost of living of a person using a regression model. In that case, you need to take independent variables as factors such as salary, age, marital status, etc. The cost of living of a person is highly dependent on these factors. Thus, it is designated as the dependent variable.
- Another scenario is in the case of a student's poor performance in an examination. The independent variable could be factors, for example, poor memory, inattentiveness in class, irregular attendance, etc. Since these factors will affect the student's score, the dependent variable, in this case, is the student's score.
- Suppose you want to measure the effect of different quantities of nutrient intake on the growth of a newborn child. In that case, you need to consider the amount of nutrient intake as the independent variable. In contrast, the dependent variable will be the growth of the child, which can be calculated by factors such as height, weight, etc.

Let us now understand the concept of a regression line.

## Difference between Regression and Classification

Regression and Classification both come under supervised learning methods, which indicate that they use labelled training datasets to train their models and make future predictions. Thus, these two methods are often classified under the same column in machine learning.

However, the key difference between them is the output variable. In regression, the output tends to be numerical or continuous, whereas, in classification, the output is categorical or discrete in nature.

Regression and Classification have certain different ways to evaluate the predictions, which are as follows:

- Regression predictions can be interpreted using root mean squared error, whereas classification predictions cannot.
- Classification predictions can be evaluated using accuracy, whereas, on the other hand, regression predictions cannot be evaluated using the same.

Conclusively, we can use algorithms like decision trees and neural networks for regression and classification with small alterations. However, some other algorithms are more difficult to implement for both problem types, for example, linear regression for regressive predictive modeling and logistic regression for classification predictive modeling.

## What is a Regression Line?

In the field of statistics, a regression line is a line that best describes the behaviour of a dataset, such that the overall distance from the line to the points (variable values) plotted on a graph is the smallest. In layman's words, it is a line that best fits the trend of a given set of data.

Regression lines are mainly used for forecasting procedures. The significance of the line is that it describes the interrelation of a dependent variable “Y” with one or more independent variables “X”. It is used to minimize the squared deviations of predictions.

If we take two variables, X and Y, there will be two regression lines:

- Regression line of Y on X: This gives the most probable Y values from the given values of X.
- Regression line of X on Y: This gives the most probable values of X from the given values of Y.

The correlation between the variables X and Y depend on the distance between the two regression lines. The degree of correlation is higher if the regression lines are nearer to each other. In contrast, the degree of correlation will be lesser if the regression lines are farther from each other.

If the two regression lines coincide, i.e. only a single line exists, correlation tends to be either perfect positive or perfect negative. However, if the variables are independent, then the correlation is zero, and the lines of regression will be at right angles.

Regression lines are widely used in the financial sector and business procedures. Financial Analysts use linear regression techniques to predict prices of stocks, commodities and perform valuations, whereas businesses employ regressions for forecasting sales, inventories, and many other variables essential for business strategy and planning.

## What is the Regression Equation?

In statistics, the Regression Equation is the algebraic expression of the regression lines. In simple terms, it is used to predict the values of the dependent variables from the given values of independent variables.

Let us consider one regression line, say Y on X and another line, say X on Y, then there will be one regression equation for each regression line:

### Regression Equation of Y on X:

This equation depicts the variations in the dependent variable Y from the given changes in the independent variable X. The expression is as follows:

**Y****e = a + bX **

Where,

- Ye is the dependent variable,
- X is the independent variable,
- a and b are the two unknown constants that determine the position of the line.

The parameter “a” indicates the distance of a line above or below the origin, i.e. the level of the fitted line, whereas parameter "b" indicates the change in the value of the independent variable Y for one unit of change in the dependent variable X.

The parameters "a" and "b" can be calculated using the least square method. According to this method, the line needs to be drawn to connect all the plotted points. In mathematical terms, the sum of the squares of the vertical deviations of observed Y from the calculated values of Y is the least. In other words, the best-fitted line is obtained when ∑ (Y-Ye)2 is the minimum.

To calculate the values of parameters “a” and “b”, we need to simultaneously solve the following algebraic equations:

**∑**** Y = Na + b ∑ X **

**∑ XY = a ∑ X + b ∑ X2**** **

### Regression Equation of X on Y:

This equation depicts the variations in the independent variable Y from the given changes in the dependent variable X. The expression is as follows:

**X****e = a + bY **

Where,

- Xe is the dependent variable,
- Y is the independent variable,
- a and b are the two unknown constants that determine the position of the line.

Again, in this equation, the parameter “a” indicates the distance of a line above or below the origin, i.e. the level of the fitted line, whereas parameter "b" indicates the slope, i.e. change in the value of the dependent variable X for a unit of change in the independent variable Y.

To calculate the values of parameters “a” and “b” in this equation, we need to simultaneously solve the following two normal equations:

**∑**** X = Na + b ∑ Y **

**∑ XY = a ∑ Y + b ∑ Y2**** **

Please note that the regression lines can be completely determined only if we obtain the constant values “a” and “b”.

## How does Linear Regression work?

Linear Regression is a Machine Learning algorithm that allows an individual to map numeric inputs to numeric outputs, fitting a line into the data points. It is an approach to modeling the relationship between one or more variables. This allows the model to able to predict outputs.

Let us understand the working of a Linear Regression model using an example.

Consider a scenario where a group of tech enthusiasts has created a start-up named Orange Inc. Now, Orange has been booming since 2016. On the other hand, you are a wealthy investor, and you want to know whether you should invest your money in Orange in the next year or not.

Let us assume that you do not want to risk a lot of money, so you buy a few shares. Firstly, you study the stock prices of Orange since 2016, and you see the following figure:

It is indicative that Orange is growing at an amazing rate where their stock price has gone from 100 dollars to 500 dollars in only three years. Since you want your investment to boom along with the company's growth, you want to invest in Orange in the year 2021. You assume that the stock price will fall somewhere around $500 since the trend will likely not go through a sudden change.

Based on the information available on the stock prices of the last couple of years, you were able to predict what the stock price is going to be like in 2021.

You just inferred your model in your head to predict the value of Y for a value of X that is not even in your knowledge. This mental method you undertook is not accurate anyway because you were not able to specify what exactly will be the stock price in the year 2021. You just have an idea that it will probably be above 500 dollars.

This is where Regression plays its role. The task of Regression is to find the line that best fits the data points on the plot so that we can calculate where the stock price is likely to be in the year 2021.

Let us examine the Regression line (in red) by understanding its significance. By making some alterations, we obtained that the stock price of Orange is likely to be a little higher than 600 dollars by the year 2021.

This example is quite oversimplified, so let us examine the process and how we got the red line on the next plot.

### Training the Regressor

The example mentioned above is an example of Univariate Linear Regression since we are trying to understand the change in an independent variable X to one dependent variable, Y.

Any regression line on a plot is based on the formula:

**f(X) = MX + B**** **

Where,

- M is the slope of the line,
- B is the y-intercept that allows the vertical movement of the line,
- And X is the function’s input variable.

In the field of Machine Learning, the formula is as follows:

**h(X) = W****0 + W1X **

Where,

- W0 and W1 are the weights,
- X is the input variable,
- h(X) is the label or the output variable.

Regression works by finding the weights W0 and W1 that lead to the best-fitting line for the input variable X. The best-fitted line is obtained in terms of the lowest cost.

Now, let us understand what does cost means here.

### The cost function

Depending upon the Machine Learning application, the cost could take different forms. However, in a generalized view, cost mainly refers to the loss or error that the regression model yields in its distance from the original training dataset.

In a Regression model, the cost function is the Squared Error Cost:

**J(W****0,W1) = (1/2n) Σ { (h(Xi) - Ti)2} for all i =1 until i = n **

Where,

- J(W0, W1) is the total cost of the model with weights W0 and W1,
- h(Xi) is the model’s prediction of the independent variable Y at feature X with index i,
- Ti is the actual y-value at index i,
- and n refers to the total number of data points in the dataset.

The cost function is used to obtain the distance between the y-value the model predicted and the actual y-value in the data set. Then, the function squares this distance and divides it by the number of data points, resulting in the average cost. The 2 in the term ‘(1/2n)’ is merely to make the differentiation process in the cost function easier.

### Training the dataset

Training a regression model uses a Learning Algorithm to find the weights W0 and W1 that will minimize the cost and plug them into the straight-line function to obtain the best-fitted line. The pseudo-code for the algorithm is as follows:

Repeat until convergence {
temp0 := W0 - a.((d/dW0) J(W0,W1))
temp1 := W1 - a.((d/dW1) J(W0,W1))
W0 = temp0
W1 = temp1
}

Here, (d/dW0) and (d/dW1) refer to the partial derivatives of J(W0,, W1) concerning W0, and W1 respectively.

The gist of the partial differentiation is basically the derivatives:

- (d/dW0) J(W0,W1) = W0 + W1.X - T
- (d/dW1) j(W0,W1) = (W0 + W1.X - T).X

Implementing the Gradient Descent Learning algorithm will result in a model with minimum cost. The weights that led to the minimum cost are dealt with as the final values for the line function h(X) = W0 + W1X.

### Goodness-of-Fit in a Regression Model

The Regression Analysis is a part of the linear regression technique. It examines an equation that lessens the distance between the fitted line and all data points. Determining how well the model fits the data is crucial in a linear model.

The general idea is that if the deviations between the observed values and the predicted values of the linear model are small and unbiased, the model has well-fit data.

In technical terms, “Goodness-of-fit” is a mathematical model describing the differences between the observed and expected values or how well the model fits a set of observations. This measure can be used in statistical hypothesis testing.

## How do businesses use Regression Analysis?

Regression Analysis is a statistical technique used to evaluate the relationship between two or more independent variables. Organizations use regression analysis to understand the significance of their data points and use analytical techniques to make better decisions.

Business Analysts and Data Professionals use this statistical tool to delete unwanted variables and select the significant ones. There are numerous ways that businesses use regression analysis. Let us discuss some of them below.

### 1. Decision-making

Businesses need to make better decisions to run smoothly and efficiently, and it is also necessary to understand the effects of the decision taken. They collect data on various factors such as sales, investments, expenditures, etc. and analyze them for further improvements.

Organizations use the Regression Analysis method by making sense of the data and gathering meaningful insights. Business analysts and data professionals use this method to make strategic business decisions.

### 2. Optimization of business

The main role of regression analysis is to convert the collected data into actionable insights. The old-school techniques like guesswork and assuming a hypothesis have been eliminated by organizations. They are now focusing on adopting data-driven decision-making techniques, which improves the work performance in an organization.

This analysis helps the management sectors in an organization to take practical and smart decisions. The huge volume of data can be interpreted and understood to gain efficient insights.

### 3. Predictive Analysis

Businesses make use of regression analysis to find patterns and trends. Business Analysts build predictions about future trends using historical data.

Regression methods can also go beyond predicting the impact on immediate revenue. Using this method, you can forecast the number of customers willing to buy a service and use that data to estimate the workforce needed to run that service.

Most insurance companies use regression analysis to calculate the credit health of their policyholders and the probable number of claims in a certain period.

Predictive Analysis helps businesses to:

- Minimize costs
- Minimize the number of required tools
- Provide fast and efficient results
- Detect fraud
- Risk Management
- Optimize marketing campaigns

### 4. Correcting errors

Regression Analysis is not only used for predicting trends, but it is also useful to identify errors in judgements.

Let us consider a situation where the executive of an organization wants to increase the working hours of its employees and make them work extra time to increase the profits. In such a case, regression analysis analyses all the variables and it may conclude that an increase in the working hours beyond their existing time of work will also lead to an increase in the operation expense like utilities, accounting expenditures, etc., thus leading to an overall decrease in the profit.

Regression Analysis provides quantitative support for better decision-making and helps organizations minimize mistakes.

### 5. New Insights

Organizations generate a large amount of cluttered data that can provide valuable insights. However, this vast data is useless without proper analysis.

Regression analysis is responsible for finding a relationship between variables by discovering patterns not considered in the model.

For example, analyzing data from sales systems and purchase accounts will result in market patterns such as increased demand on certain days of the week or at certain times of the year. You can maintain optimal stock and personnel using the information before a demand spike arises.

The guesswork gets eliminated by data-driven decisions. It allows companies to improve their business performance by concentrating on the significant areas with the highest impact on operations and revenue.

## Use cases of Regression Analysis

### Pharmaceutical companies

Pharmaceutical organizations use regression analysis to analyze the quantitative stability data for the retest period or estimate shelf life. In this method, we find the nature of the relationship between an attribute and time. We determine whether the data should be transformed for linear regression analysis or non-linear regression analysis using the analyzed data.

### Finance

The simple linear regression technique is also called the Ordinary Least Squares or OLS method. This method provides a general explanation for placing the line of the best fit among the data points. This particular tool is used for forecasting and financial analysis. You can also use it with the Capital Asset Pricing Model (CAPM), which depicts the relationship between the risk of investing and the expected return.

### Credit Card

Credit card companies use regression analysis to analyze various factors such as customer's risk of credit default, prediction of credit balance, expected consumer behaviour, and so on. With the help of the analyzed information, the companies apply specific EMI options and minimize the default among risky customers.

## When Should I Use Regression Analysis?

Regression Analysis is mainly used to describe the relationships between a set of independent variables and the dependent variables. It generates a regression equation where the coefficients correspond to the relationship between each independent and dependent variable.

### Analyze a wide variety of relationships

You can use the method of regression analysis in data analytics to perform many things, for example:

- To model multiple independent variables.
- Include continuous and categorical variables.
- Use polynomial terms for curve fitting.
- Evaluate interaction terms to examine whether the effect of one independent variable is dependent on the value of another variable.

Regression Analysis can untangle very critical problems where the variables are entwined. Consider yourself to be a researcher studying any of the following:

- What impact does socio-economic status and race have on educational achievement?
- Do education and IQ affect earnings?
- Impact of exercise habits and diet affect weight.
- Do drinking coffee and smoking cigarettes reduce the mortality rate?
- Does a particular exercise have an impact on bone density?

These research questions create a huge amount of data that entwines numerous independent and dependent variables and question their influence on each other. It is an important task to untangle this web of related variables and find out which variables are statistically essential and the role of each of these variables. To answer all these questions and rescue us in