Search

Series List Filter

What is Gradient Descent For Machine Learning

In our day-to-day lives, we are optimizing variables based on our personal decisions and we don’t even recognize the process consciously. We are constantly using optimization techniques all day long, for example, while going to work, choosing a shorter route in order to minimize traffic woes, figuring out and managing a quick walk around the campus during a snack break, or scheduling a cab in advance to reach the airport on time.Optimization is the ultimate goal, whether you are dealing with actual events in real-life or creating a technology-based product. Optimization is at the heart of most of the statistical and machine learning techniques which are widely used in data science. To gain more knowledge and skills on data science and machine learning, join the  certification course now.Optimization for Machine LearningAccuracy is the word with which we are most concerned, while we are dealing with problems related to machine learning and artificial intelligence. Any rate of errors cannot be tolerated while dealing with real-world problems and neither should they be compromised.Let us consider a case of self-driving cars. The model fitted in the car detects any obstacles that come in the way and takes appropriate actions, which can be slowing down the speed or pulling on the brakes and so on. Now we need to keep this in mind that there is no human in the car to operate or withdraw the actions taken by the self-driving car. In such a scenario, suppose the model is not accurate. It will not be able to detect other cars or any pedestrians and end up crashing leading to several lives at risk.This is where we need optimization algorithms to evaluate our model and judge whether the model is performing according to our needs or not. The evaluation can be made easy by calculating the cost function (which we will look into in a while in this article in detail). It is basically a mapping function that tells us about the difference between the desired output and what our model is computing. We can accordingly correct the model and avoid any kind of undesired activities.Optimization may be defined as the process by which an optimum is achieved. It is all about designing an optimal output for your problems with the use of resources available. However, optimization in machine learning is slightly different. In most of the cases, we are aware of the data, the shape and size, which also helps us know the areas we need to improve. But in machine learning we do not know how the new data may look like, this is where optimization acts perfectly. Optimization techniques are performed on the training data and then the validation data set is used to check its performance.There are a lot of advanced applications of optimization which are widely used in airway routing, market basket analysis, face recognition and so on. Machine learning algorithms such as linear regression, KNN, neural networks completely depend on optimization techniques. Here, we are going to look into one such popular optimization technique called Gradient Descent.What is Gradient Descent?Gradient descent is an optimization algorithm which is mainly used to find the minimum of a function. In machine learning, gradient descent is used to update parameters in a model. Parameters can vary according to the algorithms, such as coefficients in Linear Regression and weights in Neural Networks.Let us relate gradient descent with a real-life analogy for better understanding. Think of a valley you would like to descend when you are blind-folded. Any sane human will take a step and look for the slope of the valley, whether it goes up or down. Once you are sure of the downward slope you will follow that and repeat the step again and again until you have descended completely (or reached the minima).Similarly, let us consider another analogy. Suppose you have a ball and you place it on an inclined plane (at position A). As per laws, it will start rolling until it travels to a gentle plane where it will be stationary (at position B as shown in the figure below).This is exactly what happens in gradient descent. The inclined and/or irregular is the cost function when it is plotted and the role of gradient descent is to provide direction and the velocity (learning rate)  of the movement in order to attain the minima of the function i.e where the cost is minimum.How does Gradient Descent work?The primary goal of machine learning algorithms is always to build a model, which is basically a hypothesis which can be used to find an estimation for Y based on X. Let us consider an example of a model based on certain housing data which comprises of the sale price of the house, the size of the house etc. Suppose we want to predict the pricing of the house based on its size. It is clearly a regression problem where given some inputs, we would like to predict a continuous output.The hypothesis is usually presented aswhere the theta values are the parameters.Let us look into some examples and visualize the hypothesis:This yields h(x) = 1.5 + 0x. 0x means no slope, and y will always be the constant 1.5. This looks like:Now let us consider,Where, h(x) = 1 + 0.5xCost FunctionThe objective in the case of gradient descent is to find a line of best fit for some given inputs, or X values, and any number of Y values, or outputs. A cost function is defined as “a function that maps an event or values of one or more variables onto a real number intuitively representing some “cost” associated with the event.”With a known set of inputs and their corresponding outputs, a machine learning model attempts to make predictions according to the new set of inputs.Machine Learning ProcessThe Error would be the difference between the two predictions.This relates to the idea of a Cost function or Loss function.A Cost Function/Loss Function tells us “how good” our model is at making predictions for a given set of parameters. The cost function has a curve and a gradient, the slope of this curve helps us to update our parameters and make an accurate model.Minimizing the Cost FunctionIt is always the primary goal of any Machine Learning Algorithm to minimize the Cost Function. Minimizing cost functions will also result in a lower error between the predicted values and the actual values which also denotes that the algorithm has performed well in learning. How do we actually minimize any function?Generally, the cost function is in the form of Y = X². In a Cartesian coordinate system, this represents an equation for a parabola which can be graphically represented as :ParabolaNow in order to minimize the function mentioned above, firstly we need to find the value of X which will produce the lowest value of Y (in this case it is the red dot). With lower dimensions (like 2D in this case) it becomes easier to locate the minima but it is not the same while dealing with higher dimensions. For such cases, we need to use the Gradient Descent algorithm to locate the minima.Now a function is required which will minimize the parameters over a dataset. The most common function which is often used is the  mean squared error. It measures the difference between the estimated value (the prediction) and the estimator (the dataset).Mean Squared ErrorIt turns out we can adjust the equation a little to make the calculation down the track a little more simple. Now a question may arise, Why do we take the squared differences and simply not the absolute differences? Because the squared differences make it easier to derive a regression line. Indeed, to find that line we need to compute the first derivative of the Cost function, and it is much harder to compute the derivative of absolute values than squared values. Also, the squared differences increase the error distance, thus, making the bad predictions more pronounced than the good ones.The equation looks like -Mean Squared ErrorLet us apply this cost function to the following data:Here we will calculate some of the theta values and then plot the cost function by hand. Since this function passes through (0, 0), we will look only at a single value of theta. Also, let us refer to the cost function as J(ϴ) from now on.When the value of ϴ is 1, for J(1), we get a 0. You will notice the value of J(1) gives a straight line which fits the data perfectly. Now let us try with ϴ = 0.5J(0.5)The MSE function gives us a value of 0.58. Let’s plot both our values so far:J(1) = 0J(0.5) = 0.58With J(1) and J(0.5)Let us go ahead and calculate some more values of J(ϴ).Now if we join the dots carefully, we will get -Visualizing the cost function J(ϴ)As we can see, the cost function is at a minimum when theta = 1, which means the initial data is a straight line with a slope or gradient of 1 as shown by the orange line in the above figure.Using a trial and error method, we minimized J(ϴ). We did all of these by trying out a lot of values and with the help of visualizations. Gradient Descent does the same thing in a much better way, by changing the theta values or parameters until it descends to the minimum value.You may refer below for the Python code to find out cost function:import matplotlib.pyplot as plt import numpy as np # original data set X = [1, 2, 3] y = [1, 2, 3] # slope of best_fit_1 is 0.5 # slope of best_fit_2 is 1.0 # slope of best_fit_3 is 1.5 hyps = [0.5, 1.0, 1.5] # multiply the original X values by the theta # to produce hypothesis values for each X def multiply_matrix(mat, theta): mutated = [] for i in range(len(mat)):     mutated.append(mat[i] * theta) return mutated # calculate cost by looping each sample # subtract hyp(x) from y # square the result # sum them all together def calc_cost(m, X, y): total = 0 for i in range(m):     squared_error = (y[i] - X[i]) ** 2     total += squared_error     return total * (1 / (2*m)) # calculate cost for each hypothesis for i in range(len(hyps)): hyp_values = multiply_matrix(X, hyps[i])   print("Cost for ", hyps[i], " is ", calc_cost(len(X), y, hyp_values))Cost for 0.5 is 0.5833333333333333 Cost for 1.0 is 0.0 Cost for 1.5 is 0.5833333333333333 Learning RateLet us now start by initializing theta0 and theta1 to any two values, say 0 for both, and go from there. The algorithm is as follows:Gradient Descentwhere α, alpha, is the learning rate, or how rapidly do we want to move towards the minimum. We can always overshoot if the value of α is too large.The derivative which refers to the slope of the function is calculated. Here we calculate the partial derivative of the cost function. It helps us to know the direction (sign) in which the coefficient values should move so that they attain a lower cost on the following iteration. Partial Derivative of the Cost Function which we need to calculateOnce we know the direction from the derivative, we can update the coefficient values. Now you need to specify a learning rate parameter which will control how much the coefficients can change on each update.coefficient = coefficient – (alpha * delta)This particular process is repeated as long as the cost of the coefficients is 0.0 or close enough to zero.This turns out to be:Image from Andrew Ng’s machine learning courseWhich gives us linear regression!Linear RegressionTypes of Gradient Descent AlgorithmsGradient descent variants’ trajectory towards the minimum1. Batch Gradient Descent: In this type of gradient descent, all the training examples are processed for each iteration of gradient descent. It gets computationally expensive if the number of training examples is large. This is when batch gradient descent is not preferred, rather a stochastic gradient descent or mini-batch gradient descent is used.Algorithm for batch gradient descent:Let hθ(x) be the hypothesis for linear regression. Then, the cost function is given by:Let Σ represents the sum of all training examples from i=1 to m.Repeat {For every j =0 …n}Where xj(i) represents the jth feature of the ith training example. So if m is very large, then the derivative term fails to converge at the global minimum.2. Stochastic Gradient Descent: The word stochastic is related to a system or a process that is linked with a random probability. Therefore, in Stochastic Gradient Descent (SGD) samples are selected at random for each iteration instead of selecting the entire data set. When the number of training examples is too large, it becomes computationally expensive to use batch gradient descent, however, Stochastic Gradient Descent uses only a single sample, i.e., a batch size of one, to perform each iteration. The sample is randomly shuffled and selected for performing the iteration. The parameters are updated even after one iteration where only one has been processed. Thus, it gets faster than batch gradient descent.Algorithm for stochastic gradient descent:Firstly shuffle the data set randomly in order to train the parameters evenly for each type of data.As mentioned above, it takes into consideration one example per iteration.Hence,Let (x(i),y(i)) be the training exampleRepeat {For i=1 to m{        For every j =0 …n              }}3. Mini Batch gradient descent: This type of gradient descent is considered to be faster than both batch gradient descent and stochastic gradient descent. Even if the number of training examples is large, it processes it in batches in one go. Also, the number of iterations are lesser in spite of working with larger training samples.Algorithm for mini-batch gradient descent:Let us consider b be the number of examples in one batch, where b<m. Now, assume b=10 and m=100.The batch size can be adjusted. It is generally kept as a power of 2. The reason behind it is because some hardware such as GPUs achieve better run time with common batch sizes such as a power of 2.Repeat { For i=1,11, 21,…..,91Let Σ be the summation from i to i+9 represented by k.  For every j =0 …n}Convergence trends in different variants of Gradient DescentFor Batch Gradient Descent, the algorithm traces a straight line towards the minimum. If the cost function is convex, then it converges to a global minimum and if the cost function is not convex, then it converges to a local minimum. The learning rate is typically held constant over here.For stochastic gradient descent and mini-batch gradient descent, the algorithm keeps on fluctuating around the global minimum instead of converging. In order to converge, the learning rate needs to be changed slowly.Challenges in executing Gradient DescentThere are many cases where gradient descent fails to perform well. There are mainly three reasons when this would happen:Data challengesGradient challengesImplementation challengesData ChallengesThe arrangement of data sometimes leads to challenges. If it is arranged in such a way that it poses a  non-convex optimization problem then it becomes difficult to perform optimization using gradient descent. Gradient descent works for problems which are arranged with a well-defined convex optimization problem.During the optimization of a convex optimization problem, you will come across several minimal points. The lowest among all the points is called the global minimum, and other points are called the local minima. You will have to make sure you go to the global minimum and avoid local minima.There is also a saddle point problem. This is a situation where the gradient is zero but is not an optimal point. It cannot be avoided and is still an active part of the research.Gradient ChallengesWhile using gradient descent, if the execution is not proper, it leads to certain problems like vanishing gradient. This happens when the gradient is either too small or too large which results in no convergence.Implementation ChallengesSmaller memory results in the failure of network. A lot of neural network practitioners do not pay attention but it is very important to look at the resource utilization by the network.Another important thing to look at is to keep track of things like floating point considerations and hardware/software prerequisites.Variants of Gradient Descent algorithmsLet us look at some of the most commonly used gradient descent algorithms and how they are implemented.Vanilla Gradient DescentOne of the simplest forms of gradient descent technique is the Vanilla Gradient Descent. Here, vanilla means pure / without any adulteration. In this algorithm, the main feature is that small steps are taken in the direction of minima by taking the gradient of cost function.The pseudocode for the same is mentioned below.update = learning_rate * gradient_of_parameters parameters = parameters - updateIf you see here, the parameters are updated by taking the gradient of the parameters and then the learning rate is multiplied which suggest how quickly we should go towards the minimum. Learning rate is a hyper-parameter and while choosing its value you should be careful.Gradient Descent with MomentumIn this case, we adjust the algorithm in such a manner that we are aware about the prior step before taking the next step.The pseudocode for the same is mentioned below.update = learning_rate * gradient velocity = previous_update * momentum parameter = parameter + velocity - updateHere, our update is the same as that of vanilla gradient descent. But we are introducing a new term called velocity, which considers the previous update and a constant which is called momentum.SourceADAGRADADAGRAD (Adaptive Gradient Algorithm) mainly uses an adaptive technique to learn rate updation. In this algorithm, we try to change the algorithm on the basis of how the gradient has been changing for all the previous iterations.The pseudocode for the same is mentioned below.grad_component = previous_grad_component + (gradient * gradient) rate_change = square_root(grad_component) + epsilon adapted_learning_rate = learning_rate * rate_change  update = adapted_learning_rate * gradient  parameter = parameter - updateIn the above code, epsilon is a constant which is used to keep the rate of change of learning rate in check.ADAMADAM is another adaptive technique which is built out of ADAGRAD and further reduces its downside. In simple words you can consider it to be ADAGRAD + momentum.The pseudocode for the same is mentioned below.adapted_gradient = previous_gradient + ((gradient - previous_gradient) * (1 - beta1)) gradient_component = (gradient_change - previous_learning_rate) adapted_learning_rate =  previous_learning_rate + (gradient_component * (1 - beta2)) update = adapted_learning_rate * adapted_gradient parameter = parameter - updateHere beta1 and beta2 are constants to keep changes in gradient and learning rate in checkTips for Gradient DescentIn this section you will learn about some tips and tricks for getting the most out of the gradient descent algorithm for machine learning.Plot Cost versus Time: It is suggested to collect and plot the cost values calculated by the algorithm for each iteration. It helps you keep track of the descent. For a well-performing gradient descent the cost always decreases in each iteration. If you see there is no decrease, reduce the learning rate.Learning Rate: The learning rate value is a small real value such as 0.1, 0.001 or 0.0001. Keep trying different values to check which works best for your algorithm.Rescale Inputs: Try to achieve a range such as [0, 1] or [-1, 1] by rescaling all the input variables. The algorithm reaches the minimum cost faster if the shape of the cost function is not distorted or skewed.Few Passes: Stochastic gradient descent often does not need more than 1-to-10 passes through the training dataset to converge on good or good enough coefficients.Plot Mean Cost: The updates for each training dataset instance can result in a noisy plot of cost over time when using stochastic gradient descent. Try to take the average over 10, 100, or 1000 updates. This will give you a better idea of the learning trend for the algorithm.Implementation of Gradient Descent in PythonNow that we have gone through all the elements related to gradient descent, let us implement gradient descent in Python. A simple gradient Descent Algorithm is as follows:Obtain a function in order to minimize f(x)Initialize a value x from which you want to start the descent or optimization fromSpecify a learning rate which will determine how much of a step to descend by or how quickly you want to converge to the minimum valueFind the derivative of that value x (the descent)Now proceed to descend by the derivative of that value and then multiply it by the learning rateUpdate the value of x with the new value descended toCheck your stop condition in order to see whether to stopIf condition satisfies, stop. If not, proceed to step 4 with the new x value and keep repeating the algorithmLet us create an arbitrary loss function and try to find a local minimum value for that function by implementing a simple representation of gradient descent using Python.import numpy as np import matplotlib.pyplot as plt %matplotlib inlineWe will find the gradient descent of this function: x3 - 3x2 + 5#creating the function and plotting it function = lambda x: (x ** 3)-(3*(x ** 2))+5 #Get 1000 evenly spaced numbers between -1 and 3 (arbitrarily chosen to ensure steep curve) x = np.linspace(-1,3,500) #Plot the curve plt.plot(x, function(x)) plt.show()Here, we can see that our minimum value should be around 2.0Let us now use the gradient descent to find the exact valuedef deriv(x):      ''' Description: This function takes in a value of x and returns its derivative based on the initial function we specified.      Arguments:      x - a numerical value of x      Returns:      x_deriv - a numerical value of the derivative of x      '''      x_deriv = 3* (x**2) - (6 * (x)) return x_deriv def step(x_new, x_prev, precision, l_r): ''' Description: This function takes in an initial or previous value for x, updates it based on steps taken via the learning rate and outputs the minimum value of x that reaches the precision satisfaction.      Arguments:      x_new - a starting value of x that will get updated based on the learning rate      x_prev - the previous value of x that is getting updated to the new one      precision - a precision that determines the stop of the stepwise descent      l_r - the learning rate (size of each descent step)      Output:      1. Prints out the latest new value of x which equates to the minimum we are looking for 2. Prints out the number of x values which equates to the number of gradient descent steps 3. Plots a first graph of the function with the gradient descent path 4. Plots a second graph of the function with a zoomed in gradient descent path in the important area      '''      # create empty lists where the updated values of x and y wil be appended during each iteration      x_list, y_list = [x_new], [function(x_new)] # keep looping until your desired precision while abs(x_new - x_prev) > precision:          # change the value of x     x_prev = x_new      # get the derivation of the old value of x     d_x = - deriv(x_prev)          # get your new value of x by adding the previous, the multiplication of the derivative and the learning rate     x_new = x_prev + (l_r * d_x)          # append the new value of x to a list of all x-s for later visualization of path     x_list.append(x_new)          # append the new value of y to a list of all y-s for later visualization of path     y_list.append(function(x_new)) print ("Local minimum occurs at: "+ str(x_new)) print ("Number of steps: " + str(len(x_list)))           plt.subplot(1,2,2) plt.scatter(x_list,y_list,c="g") plt.plot(x_list,y_list,c="g") plt.plot(x,function(x), c="r") plt.title("Gradient descent") plt.show() plt.subplot(1,2,1) plt.scatter(x_list,y_list,c="g") plt.plot(x_list,y_list,c="g") plt.plot(x,function(x), c="r") plt.xlim([1.0,2.1]) plt.title("Zoomed in Gradient descent to Key Area") plt.show() #Implement gradient descent (all the arguments are arbitrarily chosen) step(0.5, 0, 0.001, 0.05)Local minimum occurs at: 1.9980265135950486Number of steps: 25 SummaryIn this article, you have learned about gradient descent for machine learning. Here we tried to cover most of the topics. To learn more about machine learning algorithms in-depth,  click here. Let us summarize all that we have covered in this article.Optimization is the heart and soul of machine learning.Gradient descent is a simple optimization technique which can be used with other machine learning algorithms.Batch gradient descent refers to calculating the derivative from all training data before calculating an update.Stochastic gradient descent refers to calculating the derivative from each training data instance and calculating the update immediately.If you are inspired by the opportunities provided by Data Science, enrol in our  Data Science and Machine Learning Courses for more lucrative career options in this landscape.
Rated 4.5/5 based on 34 customer reviews

What is Gradient Descent For Machine Learning

12557
What is Gradient Descent For Machine Learning

In our day-to-day lives, we are optimizing variables based on our personal decisions and we don’t even recognize the process consciously. We are constantly using optimization techniques all day long, for example, while going to work, choosing a shorter route in order to minimize traffic woes, figuring out and managing a quick walk around the campus during a snack break, or scheduling a cab in advance to reach the airport on time.

Optimization is the ultimate goal, whether you are dealing with actual events in real-life or creating a technology-based product. Optimization is at the heart of most of the statistical and machine learning techniques which are widely used in data science. To gain more knowledge and skills on data science and machine learning, join the  certification course now.

Optimization for Machine Learning

Accuracy is the word with which we are most concerned, while we are dealing with problems related to machine learning and artificial intelligence. Any rate of errors cannot be tolerated while dealing with real-world problems and neither should they be compromised.

Optimization for machine Learning

Let us consider a case of self-driving cars. The model fitted in the car detects any obstacles that come in the way and takes appropriate actions, which can be slowing down the speed or pulling on the brakes and so on. Now we need to keep this in mind that there is no human in the car to operate or withdraw the actions taken by the self-driving car. In such a scenario, suppose the model is not accurate. It will not be able to detect other cars or any pedestrians and end up crashing leading to several lives at risk.

This is where we need optimization algorithms to evaluate our model and judge whether the model is performing according to our needs or not. The evaluation can be made easy by calculating the cost function (which we will look into in a while in this article in detail). It is basically a mapping function that tells us about the difference between the desired output and what our model is computing. We can accordingly correct the model and avoid any kind of undesired activities.

Optimization may be defined as the process by which an optimum is achieved. It is all about designing an optimal output for your problems with the use of resources available. However, optimization in machine learning is slightly different. In most of the cases, we are aware of the data, the shape and size, which also helps us know the areas we need to improve. But in machine learning we do not know how the new data may look like, this is where optimization acts perfectly. Optimization techniques are performed on the training data and then the validation data set is used to check its performance.

There are a lot of advanced applications of optimization which are widely used in airway routing, market basket analysis, face recognition and so on. Machine learning algorithms such as linear regression, KNN, neural networks completely depend on optimization techniques. Here, we are going to look into one such popular optimization technique called Gradient Descent.

What is Gradient Descent?

Gradient descent is an optimization algorithm which is mainly used to find the minimum of a function. In machine learning, gradient descent is used to update parameters in a model. Parameters can vary according to the algorithms, such as coefficients in Linear Regression and weights in Neural Networks.

Let us relate gradient descent with a real-life analogy for better understanding. Think of a valley you would like to descend when you are blind-folded. Any sane human will take a step and look for the slope of the valley, whether it goes up or down. Once you are sure of the downward slope you will follow that and repeat the step again and again until you have descended completely (or reached the minima).

Gradient Descent in Machine Learning:- Valley, Slope

Similarly, let us consider another analogy. Suppose you have a ball and you place it on an inclined plane (at position A). As per laws, it will start rolling until it travels to a gentle plane where it will be stationary (at position B as shown in the figure below).

Gradient Descent in Machine Learning:- Ball placed on an inclined plane

This is exactly what happens in gradient descent. The inclined and/or irregular is the cost function when it is plotted and the role of gradient descent is to provide direction and the velocity (learning rate)  of the movement in order to attain the minima of the function i.e where the cost is minimum.

The graphical representation of Gradient Descent in Machine Learning

How does Gradient Descent work?

The primary goal of machine learning algorithms is always to build a model, which is basically a hypothesis which can be used to find an estimation for Y based on X. Let us consider an example of a model based on certain housing data which comprises of the sale price of the house, the size of the house etc. Suppose we want to predict the pricing of the house based on its size. It is clearly a regression problem where given some inputs, we would like to predict a continuous output.

The hypothesis is usually presented as

hypothesis formula

where the theta values are the parameters.

Let us look into some examples and visualize the hypothesis:

hypothesis values

This yields h(x) = 1.5 + 0x. 0x means no slope, and y will always be the constant 1.5. This looks like:

bar graph of hypothesis with no slope

Now let us consider,

hypothesis values -2

Bar Graph of Hypothesis with slope

Where, h(x) = 1 + 0.5x

Cost Function

The objective in the case of gradient descent is to find a line of best fit for some given inputs, or X values, and any number of Y values, or outputs. A cost function is defined as “a function that maps an event or values of one or more variables onto a real number intuitively representing some “cost” associated with the event.”

With a known set of inputs and their corresponding outputs, a machine learning model attempts to make predictions according to the new set of inputs.

Machine Learning Cost Function Process PredictionMachine Learning Process

The Error would be the difference between the two predictions.

This relates to the idea of a Cost function or Loss function.

A Cost Function/Loss Function tells us “how good” our model is at making predictions for a given set of parameters. The cost function has a curve and a gradient, the slope of this curve helps us to update our parameters and make an accurate model.

Minimizing the Cost Function

It is always the primary goal of any Machine Learning Algorithm to minimize the Cost Function. Minimizing cost functions will also result in a lower error between the predicted values and the actual values which also denotes that the algorithm has performed well in learning. 

How do we actually minimize any function?

Generally, the cost function is in the form of Y = X². In a Cartesian coordinate system, this represents an equation for a parabola which can be graphically represented as :

Parabola in Minimizing the Cost FunctionParabola

Now in order to minimize the function mentioned above, firstly we need to find the value of X which will produce the lowest value of Y (in this case it is the red dot). With lower dimensions (like 2D in this case) it becomes easier to locate the minima but it is not the same while dealing with higher dimensions. For such cases, we need to use the Gradient Descent algorithm to locate the minima.

Now a function is required which will minimize the parameters over a dataset. The most common function which is often used is the  mean squared error. It measures the difference between the estimated value (the prediction) and the estimator (the dataset).

Formula of Mean Squared ErrorMean Squared Error

It turns out we can adjust the equation a little to make the calculation down the track a little more simple. 

Now a question may arise, Why do we take the squared differences and simply not the absolute differences? Because the squared differences make it easier to derive a regression line. Indeed, to find that line we need to compute the first derivative of the Cost function, and it is much harder to compute the derivative of absolute values than squared values. Also, the squared differences increase the error distance, thus, making the bad predictions more pronounced than the good ones.

The equation looks like -

Mean Squared Error in Machine Learning with Squared DifferencesMean Squared Error

Let us apply this cost function to the following data:

Data Set before applying the cost function.

Here we will calculate some of the theta values and then plot the cost function by hand. Since this function passes through (0, 0), we will look only at a single value of theta. Also, let us refer to the cost function as J(ϴ) from now on.

When the value of ϴ is 1, for J(1), we get a 0. You will notice the value of J(1) gives a straight line which fits the data perfectly. Now let us try with ϴ = 0.5

Cost function applied data set graph -1J(0.5)

The MSE function gives us a value of 0.58. Let’s plot both our values so far:

J(1) = 0

J(0.5) = 0.58

Cost function applied data set graph-2With J(1) and J(0.5)

Let us go ahead and calculate some more values of J(ϴ).

Cost function applied data set graph-3

Now if we join the dots carefully, we will get -

Visualising the Cost Function GraphVisualizing the cost function J(ϴ)

As we can see, the cost function is at a minimum when theta = 1, which means the initial data is a straight line with a slope or gradient of 1 as shown by the orange line in the above figure.

Using a trial and error method, we minimized J(ϴ). We did all of these by trying out a lot of values and with the help of visualizations. Gradient Descent does the same thing in a much better way, by changing the theta values or parameters until it descends to the minimum value.

You may refer below for the Python code to find out cost function:

import matplotlib.pyplot as plt
import numpy as np

# original data set
X = [1, 2, 3]
y = [1, 2, 3]

# slope of best_fit_1 is 0.5
# slope of best_fit_2 is 1.0
# slope of best_fit_3 is 1.5

hyps = [0.5, 1.0, 1.5]

# multiply the original X values by the theta
# to produce hypothesis values for each X
def multiply_matrix(mat, theta):
mutated = []
for i in range(len(mat)):
    mutated.append(mat[i] * theta)

return mutated

# calculate cost by looping each sample
# subtract hyp(x) from y
# square the result
# sum them all together
def calc_cost(m, X, y):
total = 0
for i in range(m):
    squared_error = (y[i] - X[i]) ** 2
    total += squared_error
    
return total * (1 / (2*m))

# calculate cost for each hypothesis
for i in range(len(hyps)):
hyp_values = multiply_matrix(X, hyps[i])
 
print("Cost for ", hyps[i], " is ", calc_cost(len(X), y, hyp_values))
Cost for 0.5 is 0.5833333333333333
Cost for 1.0 is 0.0
Cost for 1.5 is 0.5833333333333333

Learning Rate

Let us now start by initializing theta0 and theta1 to any two values, say 0 for both, and go from there. The algorithm is as follows:

Gradient Descent


where α, alpha, is the learning rate, or how rapidly do we want to move towards the minimum. We can always overshoot if the value of α is too large.

Big Learning Rate vs Small Learning Rate

The derivative which refers to the slope of the function is calculated. Here we calculate the partial derivative of the cost function. It helps us to know the direction (sign) in which the coefficient values should move so that they attain a lower cost on the following iteration. 

Partial derivative of the Cost Function which we need to calculate.Partial Derivative of the Cost Function which we need to calculate

Once we know the direction from the derivative, we can update the coefficient values. Now you need to specify a learning rate parameter which will control how much the coefficients can change on each update.

coefficient = coefficient – (alpha * delta)

This particular process is repeated as long as the cost of the coefficients is 0.0 or close enough to zero.

This turns out to be:

Cost function formula 1Image from Andrew Ng’s machine learning course

Which gives us linear regression!

Linear Regression Formula in Machine LearningLinear Regression

Types of Gradient Descent Algorithms

Types of Gradient Descent Algorithms Graphical RepresentationGradient descent variants’ trajectory towards the minimum

1. Batch Gradient Descent: In this type of gradient descent, all the training examples are processed for each iteration of gradient descent. It gets computationally expensive if the number of training examples is large. This is when batch gradient descent is not preferred, rather a stochastic gradient descent or mini-batch gradient descent is used.

Batch Gradient Descent in Machine LearningAlgorithm for batch gradient descent:

Let hθ(x) be the hypothesis for linear regression. Then, the cost function is given by:

Let Σ represents the sum of all training examples from i=1 to m.

formula2

Repeat {

formula3For every j =0 …n

}

Where xj(i) represents the jth feature of the ith training example. So if m is very large, then the derivative term fails to converge at the global minimum.

2. Stochastic Gradient Descent: The word stochastic is related to a system or a process that is linked with a random probability. Therefore, in Stochastic Gradient Descent (SGD) samples are selected at random for each iteration instead of selecting the entire data set. When the number of training examples is too large, it becomes computationally expensive to use batch gradient descent, however, Stochastic Gradient Descent uses only a single sample, i.e., a batch size of one, to perform each iteration. The sample is randomly shuffled and selected for performing the iteration. The parameters are updated even after one iteration where only one has been processed. Thus, it gets faster than batch gradient descent.

Stochastic in Gradient Descent Graph in Machine Learning

Algorithm for stochastic gradient descent:

  1. Firstly shuffle the data set randomly in order to train the parameters evenly for each type of data.
  2. As mentioned above, it takes into consideration one example per iteration.

Hence,
Let (x(i),y(i)) be the training example

formula4

formula 5

Repeat {
For i=1 to m{

formula 6

        For every j =0 …n
              }
}

3. Mini Batch gradient descent: This type of gradient descent is considered to be faster than both batch gradient descent and stochastic gradient descent. Even if the number of training examples is large, it processes it in batches in one go. Also, the number of iterations are lesser in spite of working with larger training samples.

Mini Batch gradient descent graph in Machine Learning

Algorithm for mini-batch gradient descent:

Let us consider b be the number of examples in one batch, where b<m. Now, assume b=10 and m=100.
The batch size can be adjusted. It is generally kept as a power of 2. The reason behind it is because some hardware such as GPUs achieve better run time with common batch sizes such as a power of 2.

Repeat {

 For i=1,11, 21,…..,91

Let Σ be the summation from i to i+9 represented by k.

formula -6

  For every j =0 …n
}

Convergence trends in different variants of Gradient Descent

For Batch Gradient Descent, the algorithm traces a straight line towards the minimum. If the cost function is convex, then it converges to a global minimum and if the cost function is not convex, then it converges to a local minimum. The learning rate is typically held constant over here.

Convergence trends in different variants of Gradient Descent in Machine Learning

For stochastic gradient descent and mini-batch gradient descent, the algorithm keeps on fluctuating around the global minimum instead of converging. In order to converge, the learning rate needs to be changed slowly.

Challenges in executing Gradient Descent

There are many cases where gradient descent fails to perform well. There are mainly three reasons when this would happen:

  1. Data challenges
  2. Gradient challenges
  3. Implementation challenges

Data Challenges

  • The arrangement of data sometimes leads to challenges. If it is arranged in such a way that it poses a  non-convex optimization problem then it becomes difficult to perform optimization using gradient descent. Gradient descent works for problems which are arranged with a well-defined convex optimization problem.
  • During the optimization of a convex optimization problem, you will come across several minimal points. The lowest among all the points is called the global minimum, and other points are called the local minima. You will have to make sure you go to the global minimum and avoid local minima.
  • There is also a saddle point problem. This is a situation where the gradient is zero but is not an optimal point. It cannot be avoided and is still an active part of the research.

Gradient Challenges

  • While using gradient descent, if the execution is not proper, it leads to certain problems like vanishing gradient. This happens when the gradient is either too small or too large which results in no convergence.

Implementation Challenges

  • Smaller memory results in the failure of network. A lot of neural network practitioners do not pay attention but it is very important to look at the resource utilization by the network.
  • Another important thing to look at is to keep track of things like floating point considerations and hardware/software prerequisites.

Variants of Gradient Descent algorithms

Let us look at some of the most commonly used gradient descent algorithms and how they are implemented.

Vanilla Gradient Descent

One of the simplest forms of gradient descent technique is the Vanilla Gradient Descent. Here, vanilla means pure / without any adulteration. In this algorithm, the main feature is that small steps are taken in the direction of minima by taking the gradient of cost function.

The pseudocode for the same is mentioned below.

update = learning_rate * gradient_of_parameters
parameters = parameters - update

If you see here, the parameters are updated by taking the gradient of the parameters and then the learning rate is multiplied which suggest how quickly we should go towards the minimum. Learning rate is a hyper-parameter and while choosing its value you should be careful.

Vanilla Gradient Descent Graph in Machine Learning

Gradient Descent with Momentum

In this case, we adjust the algorithm in such a manner that we are aware about the prior step before taking the next step.

The pseudocode for the same is mentioned below.

update = learning_rate * gradient
velocity = previous_update * momentum
parameter = parameter + velocity - update

Here, our update is the same as that of vanilla gradient descent. But we are introducing a new term called velocity, which considers the previous update and a constant which is called momentum.

Gradient Descent with Momentum Update in machine LearningSource

ADAGRAD

ADAGRAD (Adaptive Gradient Algorithm) mainly uses an adaptive technique to learn rate updation. In this algorithm, we try to change the algorithm on the basis of how the gradient has been changing for all the previous iterations.

The pseudocode for the same is mentioned below.

grad_component = previous_grad_component + (gradient * gradient)
rate_change = square_root(grad_component) + epsilon
adapted_learning_rate = learning_rate * rate_change 
update = adapted_learning_rate * gradient 
parameter = parameter - update

In the above code, epsilon is a constant which is used to keep the rate of change of learning rate in check.

ADAM

ADAM is another adaptive technique which is built out of ADAGRAD and further reduces its downside. In simple words you can consider it to be ADAGRAD + momentum.

The pseudocode for the same is mentioned below.

adapted_gradient = previous_gradient + ((gradient - previous_gradient) * (1 - beta1))

gradient_component = (gradient_change - previous_learning_rate)
adapted_learning_rate =  previous_learning_rate + (gradient_component * (1 - beta2))
update = adapted_learning_rate * adapted_gradient
parameter = parameter - update

Here beta1 and beta2 are constants to keep changes in gradient and learning rate in check

Tips for Gradient Descent

In this section you will learn about some tips and tricks for getting the most out of the gradient descent algorithm for machine learning.

  • Plot Cost versus Time: It is suggested to collect and plot the cost values calculated by the algorithm for each iteration. It helps you keep track of the descent. For a well-performing gradient descent the cost always decreases in each iteration. If you see there is no decrease, reduce the learning rate.
  • Learning Rate: The learning rate value is a small real value such as 0.1, 0.001 or 0.0001. Keep trying different values to check which works best for your algorithm.
  • Rescale Inputs: Try to achieve a range such as [0, 1] or [-1, 1] by rescaling all the input variables. The algorithm reaches the minimum cost faster if the shape of the cost function is not distorted or skewed.
  • Few Passes: Stochastic gradient descent often does not need more than 1-to-10 passes through the training dataset to converge on good or good enough coefficients.
  • Plot Mean Cost: The updates for each training dataset instance can result in a noisy plot of cost over time when using stochastic gradient descent. Try to take the average over 10, 100, or 1000 updates. This will give you a better idea of the learning trend for the algorithm.

Implementation of Gradient Descent in Python

Now that we have gone through all the elements related to gradient descent, let us implement gradient descent in Python. A simple gradient Descent Algorithm is as follows:

  1. Obtain a function in order to minimize f(x)
  2. Initialize a value x from which you want to start the descent or optimization from
  3. Specify a learning rate which will determine how much of a step to descend by or how quickly you want to converge to the minimum value
  4. Find the derivative of that value x (the descent)
  5. Now proceed to descend by the derivative of that value and then multiply it by the learning rate
  6. Update the value of x with the new value descended to
  7. Check your stop condition in order to see whether to stop
  8. If condition satisfies, stop. If not, proceed to step 4 with the new x value and keep repeating the algorithm

Let us create an arbitrary loss function and try to find a local minimum value for that function by implementing a simple representation of gradient descent using Python.

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

We will find the gradient descent of this function: x3 - 3x2 + 5

#creating the function and plotting it

function = lambda x: (x ** 3)-(3*(x ** 2))+5

#Get 1000 evenly spaced numbers between -1 and 3 (arbitrarily chosen to ensure steep curve)
x = np.linspace(-1,3,500)

#Plot the curve
plt.plot(x, function(x))
plt.show()

plotting the data set

Here, we can see that our minimum value should be around 2.0
Let us now use the gradient descent to find the exact value

def deriv(x):
    
'''
Description: This function takes in a value of x and returns its derivative based on the
initial function we specified.
    
Arguments:
    
x - a numerical value of x
    
Returns:
    
x_deriv - a numerical value of the derivative of x
    
'''
    
x_deriv = 3* (x**2) - (6 * (x))
return x_deriv


def step(x_new, x_prev, precision, l_r):
'''
Description: This function takes in an initial or previous value for x, updates it based on
steps taken via the learning rate and outputs the minimum value of x that reaches the precision satisfaction.
    
Arguments:
    
x_new - a starting value of x that will get updated based on the learning rate
    
x_prev - the previous value of x that is getting updated to the new one
    
precision - a precision that determines the stop of the stepwise descent
    
l_r - the learning rate (size of each descent step)
    
Output:
    
1. Prints out the latest new value of x which equates to the minimum we are looking for
2. Prints out the number of x values which equates to the number of gradient descent steps
3. Plots a first graph of the function with the gradient descent path
4. Plots a second graph of the function with a zoomed in gradient descent path in the important area
    
'''
    
# create empty lists where the updated values of x and y wil be appended during each iteration
    
x_list, y_list = [x_new], [function(x_new)]
# keep looping until your desired precision
while abs(x_new - x_prev) > precision:
    
    # change the value of x
    x_prev = x_new
    
# get the derivation of the old value of x
    d_x = - deriv(x_prev)
    
    # get your new value of x by adding the previous, the multiplication of the derivative and the learning rate
    x_new = x_prev + (l_r * d_x)
    
    # append the new value of x to a list of all x-s for later visualization of path
    x_list.append(x_new)
    
    # append the new value of y to a list of all y-s for later visualization of path
    y_list.append(function(x_new))

print ("Local minimum occurs at: "+ str(x_new))
print ("Number of steps: " + str(len(x_list)))
    
    
plt.subplot(1,2,2)
plt.scatter(x_list,y_list,c="g")
plt.plot(x_list,y_list,c="g")
plt.plot(x,function(x), c="r")
plt.title("Gradient descent")
plt.show()

plt.subplot(1,2,1)
plt.scatter(x_list,y_list,c="g")
plt.plot(x_list,y_list,c="g")
plt.plot(x,function(x), c="r")
plt.xlim([1.0,2.1])
plt.title("Zoomed in Gradient descent to Key Area")
plt.show() 
#Implement gradient descent (all the arguments are arbitrarily chosen)
step(0.5, 0, 0.001, 0.05)

Local minimum occurs at: 1.9980265135950486
Number of steps: 25 

Gradient Descent Machine Learning Graph

Zoomed in Gradient Descent to Key Area in Machine Learning

Summary

In this article, you have learned about gradient descent for machine learning. Here we tried to cover most of the topics. To learn more about machine learning algorithms in-depth,  click here. Let us summarize all that we have covered in this article.

  • Optimization is the heart and soul of machine learning.
  • Gradient descent is a simple optimization technique which can be used with other machine learning algorithms.
  • Batch gradient descent refers to calculating the derivative from all training data before calculating an update.
  • Stochastic gradient descent refers to calculating the derivative from each training data instance and calculating the update immediately.

If you are inspired by the opportunities provided by Data Science, enrol in our  Data Science and Machine Learning Courses for more lucrative career options in this landscape.

Priyankur

Priyankur Sarkar

Data Science Enthusiast

Priyankur Sarkar loves to play with data and get insightful results out of it, then turn those data insights and results in business growth. He is an electronics engineer with a versatile experience as an individual contributor and leading teams, and has actively worked towards building Machine Learning capabilities for organizations.

Join the Discussion

Your email address will not be published. Required fields are marked *

Suggested Blogs

What Is Data Science(with Examples), It's Lifecycle and Who exactly is a Data Scientist

Oh yes, Science is everywhere. A while ago, when children embarked on the journey of learning everyday science in school, the statement that always had a mention was “Science is everywhere”. The situation is more or less the same even in present times. Science has now added a few feathers to its cap. Yes, the general masses sing the mantra “Data Science” is everywhere. What does it mean when I say Data Science is everywhere? Let us take a look at the Science of Data. What are those aspects that make this Science unique from everyday Science?The Big Data Age as you may call it has in it Data as the object of study.Data Science for a person who has set up a firm could be a money spinnerData Science for an architect working at an IT consulting company could be a bread earnerData Science could be the knack behind the answers that come out from the juggler’s hatData Science could be a machine imported from the future, which deals with the Math and Statistics involved in your lifeData science is a platter full of data inference, algorithm development, and technology. This helps the users find recipes to solve analytically complex problems.With data as the core, we have raw information that streams in and is stored in enterprise data warehouses acting as the condiments to your complex problems. To extract the best from the data generated, Data Science calls upon Data Mining. At the end of the tunnel, Data Science is about unleashing different ways to use data and generate value for various organizations.Let us dig deeper into the tunnel and see how various domains make use of Data Science.Example 1Think of a day without Data Science, Google would not have generated results the way it does today.Example 2Suppose you manage an eatery that churns out the best for different taste buds. To model a product in the pipeline, you are keen on knowing what the requirements of your customers are. Now, you know they like more cheese on the pizza than jalapeno toppings. That is the existing data that you have along with their browsing history, purchase history, age and income. Now, add more variety to this existing data. With the vast amount of data that is generated, your strategies to bank upon the customers’ requirements can be more effective. One customer will recommend your product to another outside the circle; this will further bring more business to the organization.Consider this image to understand how an analysis of the customers’ requirements helps:Example 3Data Science plays its role in predictive analytics too.I have an organization that is into building devices that will send a trigger if a natural calamity is soon to occur. Data from ships, aircraft, and satellites can be accumulated and analyzed to build models that will not only help with weather forecasting but also predict the occurrence of natural calamities. The model device that I build will send triggers and save lives too.Consider the image shown below to understand how predictive analytics works:Example 4A lot many of us who are active on social media would have come across this situation while posting images that show you indulging in all fun and frolic with your friends. You might miss tagging your friends in the images you post but the tag suggestion feature available on most platforms will remind you of the tagging that is pending.The automatic tag suggestion feature uses the face recognition algorithm.Lifecycle of Data ScienceCapsulizing the main phases of the Data Science Lifecycle will help us understand how the Data Science process works. The various phases in the Data Science Lifecycle are:DiscoveryData PreparationModel PlanningModel BuildingOperationalizingCommunicating ResultsPhase 1Discovery marks the first phase of the lifecycle. When you set sail with your new endeavor,it is important to catch hold of the various requirements and priorities. The ideation involved in this phase needs to have all the specifications along with an outline of the required budget. You need to have an inquisitive mind to make the assessments – in terms of resources, if you have the required manpower, technology, infrastructure and above all time to support your project. In this phase, you need to have a business problem laid out and build an initial hypotheses (IH) to test your plan. Phase 2Data preparation is done in this phase. An analytical sandbox is used in this to perform analytics for the entire duration of the project. While you explore, preprocess and condition data, modeling follows suit. To get the data into the sandbox, you will perform ETLT (extract, transform, load and transform).We make use of R for data cleaning, transformation, and visualization and further spot the outliers and establish a relationship between the variables. Once the data is prepared after cleaning, you can play your cards with exploratory analytics.Phase 3In this phase of Model planning, you determine the methods and techniques to pick on the relationships between variables. These relationships set the base for the algorithms that will be implemented in the next phase.  Exploratory Data Analytics (EDA) is applied in this phase using various statistical formulas and visualization tools.Subsequently, we will look into the various models that are required to work out with the Data Science process.RR is the most commonly used tool. The tool comes with a complete set of modeling capabilities. This proves a good environment for building interpretive models.SQL Analysis Services SQL Analysis services has the ability to perform in-database analytics using basic predictive models and common data mining functions.SAS/ACCESS  SAS/ACCESS helps you access data from Hadoop. This can be used for creating repeatable and reusable model flow diagrams.You have now got an overview of the nature of your data and have zeroed in on the algorithms to be used. In the next stage, the algorithm is applied to further build up a model.Phase 4This is the Model building phase as you may call it. Here, you will develop datasets for training and testing purposes. You need to understand whether your existing tools will suffice for running the models that you build or if a more robust environment (like fast and parallel processing) is required. The various tools for model building are SAS Enterprise Miner, WEKA, SPCS Modeler, Matlab, Alpine Miner and Statistica.Phase 5In the Operationalize phase, you deliver final reports, briefings, code and technical documents. Moreover, a pilot project may also be implemented in a real-time production environment on a small scale. This helps users get a clear picture of the performance and other related constraints before full deployment.Phase 6The Communicate results phase is the conclusion. Here, we evaluate if you have been able to meet your goal the way you had planned in the initial phase. It is in this phase that the key findings pop their heads out. You communicate to the stakeholders in this phase. This phase brings you the result of your project whether it is a success or a failure.Why Do We Need Data Science?Data Science to be precise is an amalgamation of Infrastructure, Software, Statistics and the various data sources.To really understand big data, it would help us if we bridge back to the historical background. Gartner’s definition circa 2001, which is still the go-to definition says,Big data is data that contains greater variety arriving in increasing volumes and with ever-higher velocity. This is known as the three Vs.When we break the definition into simple terms, all that it means is, big data is humongous. This involves the multiplication of complex data sets with the addition of new data sources. When the data sets are in such high volumes, our traditional data processing software fails to manage them. It is just like how you cannot expect your humble typewriter to do the job of a computer. You cannot expect a typewriter to even do the ctrl c + ctrl v job for you. The amount of data that comes with the solutions to all your business problems is massive. To help you with the processing of this data, you have Data Science playing the key role.The concept of big data itself may sound relatively new; however, the origins of large data sets can be traced back to the 1960s and the '70s. This is when the world of data was just getting started. The world witnessed the set up of the first data centers and the development of the relational database.Around 2005, Facebook, YouTube, and other online services started gaining immense popularity. The more people indulged in the use of these platforms, the more data they generated. The processing of this data involved a lot of Data Science. The masses had to store the amassed data and analyse it at a later point. As a platform that answers to the storage and analysis of the amassed data, Hadoop was developed. Hadoop is an open-source framework that helps in the storage and analysis of big data sets. And as we say, the rest will follow suit; we had NoSQL gaining popularity during this time.With the advent of big data, the need for its storage also grew. The storage of data became a major issue for enterprise industries until 2010. We have had Hadoop, Spark and other frameworks mitigating the challenge to a very large extent. Though the volume of big data is skyrocketing, the focus remains on the processing of the data, all thanks to these efficient frameworks. And, Data Science once again hogs the limelight.Can we say it is only the users leading to huge amounts of data? No, we cannot. It is not only humans generating the data but also the work they indulge in.Delving into the iota of the Internet of Things (IoT) will get us some clarity on the question that we just raised. As we have more objects and devices connected to the Internet, data gathers not just by use but also by the pattern of your usage and the performance of the various products.The Three Vs of Big DataData Science helps in the extraction of knowledge from the accumulated data. While big data has come far with the accumulation of users’ data, its usefulness is only just beginning.Following are the Three Properties that define Big Data:VolumeVelocityVarietyVolumeThe amount of data is a crucial factor here. Big data stands as a pillar when you have to process a multitude of low-density, unstructured data. The data may contain unknown value – such as clickstreams on a webpage or a mobile app and Twitter data feeds. The values of the data may differ from user to user. For some, the value might be in tens of terabytes of data. For others, the value might be in hundreds of petabytes.Consider the different social media platforms – Facebook records 2 billion users, YouTube has 1 billion users, 350 million users for Twitter and a whopping 700 million users on Instagram. There is exchange of billions of images, posts and tweets on these platforms. Imagine the amuck storage of data the users contribute too. Mind Boggling, is it not? This insanely large amount of data is generated every minute and every hour.VelocityThe fast rate at which the data is received and acted upon is the Velocity. Usually, the data is written to the disk. When there is data with highest velocity, it streams directly into the memory. With the advancement in technology, we now have more numbers of Internet-connected devices across industries. The velocity of the data generated through these devices that act real time or near real time may call for real-time evaluation and action.Sticking to our social media example, Facebook accounts for 900 million photo uploads, Twitter handles 500 million tweets, Google is to go to solution for 3.5 billion searches, YouTube calls for 0.4 millions hours of video uploads; all this on a daily basis. The bundled amount of data is stifling.VarietyThe data generated by the users comes in different types. The different types form different varieties of data. Dating back, we had traditional data types that were structured and organized in a relational database.Texts, tweets, videos, photos uploaded form the different varieties of structured data uploaded on the Internet.Voicemails, emails, ECG reading, audio recordings and a lot more form the different varieties of unstructured data that we find on the Internet.Who is a Data Scientist? A curious brain and an impressive training is all that you need to become a Data Scientist. Not as easy as it may sound.Deep thinking, deep learning with intense intellectual curiosity is a common trait found in data scientists. The more you ask questions, the more discoveries you come up with, the more augmented your learning experience is, the more it gets easier for you to tread on the path of Data Science.A factor that differentiates a data scientist from a normal bread earner is that they are more obsessed with creativity and ingenuity. A normal bread earner will go seeking money whereas, the motivator for a data scientist is the ability to solve analytical problems with a pinch of curiosity and creativity. Data scientists are always on a treasure hunt – hunting for the best from the trove.If you think, you need a degree in Sciences or you need to be a PhD in Math to become a legitimate data scientist, mind you, you are carrying a misconception. A natural propensity in these areas will definitely add to your profile but you can be an expert data scientist without a degree in these areas too. Data Science becomes a cinch with heaps of knowledge in programming and business acumen.Data Science is a discipline gaining colossal prominence of late. Educational institutions are yet to come up with comprehensive Data Science degree programs. A data scientist can never claim to have undergone all the required schooling. Learning the rights skills, guided by self-determination is a never-ending process for a data scientist.As Data Science is multidisciplinary, many people find it confusing to differentiate between Data Scientist and Data Analyst.Data Analytics is one of the components of Data Science. Analytics help in understanding the data structure of an organization. The achieved output is further used to solve problems and ring in business insights.The Basic Differences between a Data Scientist and a Data AnalystScientists and Analysts are not exactly synonymous. The roles are not mutually exclusive either. The roles of Data Scientists and Data Analysts differ a lot. Let us take a look at some of the basic differences:CriteriaData ScientistData AnalystGoalInquisitive nature and a strong business acumen helps Data Scientists to arrive at solutionsThey perform data analysis and sourcingTasksData Scientists need to be adept at data insight mining, preparation, and analysis to extract informationData Analysts gather, arrange, process and model both structured and unstructured dataSubstantive expertiseRequiredNot RequiredNon-technical skillsRequiredNot RequiredWhat Skills Are Required To Become a Data Scientist?Data scientists blend with the best skills. The fundamental skills required to become a Data Scientist are as follows:Proficiency in MathematicsTechnology knowhow and the knack to hackBusiness AcumenProficiency in MathematicsA Data Scientist needs to be equipped with a quantitative lens. You can be a Data Scientist if you have the ability to view the data quantitatively.Before a data product is finally built, it calls for a tremendous amount of data insight mining. There are portions of data that include textures, dimensions and correlations. To be able to find solutions to come with an end product, a mathematical perspective always helps.If you have that knack for Math, finding solutions utilizing data becomes a cakewalk laden with heuristics and quantitative techniques. The path to finding solutions to major business problems is a tedious one. It involves the building of analytical models. Data Scientists need to identify the underlying nuts and bolts to successfully build models.Data Science carries with it a misconception that it is all about statistics. Statistics is crucial; however, only the Math type is more accountable. Statistics has two offshoots – the classical and the Bayesian. When people talk about stats, they are usually referring to classical stats. Data Scientists need to refer both types to arrive at solutions. Moreover, there is a mix of inferential techniques and machine learning algorithms; this mix leans on the knowledge of linear algebra. There are popular methods in Data Science; finding a solution using these methods calls upon matrix math which has got very less to do with classical stats.Technology knowhow and the knack to hackOn a lighter note, let us put a disclaimer… you are not being asked to learn hacking to come crashing on computers. As a hacker, you need to be gelled with the amalgam of creativity and ingenuity. You are expected to use the right technical skills to build models and thereby find solutions to complex analytical problems.Why does the world of Data Science vouch on your hacking ability? The answer finds its element in the use of technology by Data Scientists. Mindset, training and the right technology when put together can squeeze out the best from mammoth data sets. Solving complex algorithms requires more sophisticated tools than just Excel. Data scientists need to have the nitty-gritty ability to code. They should be able to prototype quick solutions, as well as integrate with complex data systems. SQL, Python, R, and SAS are the core languages associated with Data Science. A knowhow of Java, Scala, Julia, and other languages also helps. However, the knowledge of language fundamentals does not suffice the quest to extract the best from enormous data sets. A hacker needs to be creative to sail through technical waters and make the codes reach the shore.Business AcumenA strong business acumen is a must-have in the portfolio of any Data Scientist. You need to make tactical moves and fetch that from the data, which no one else can. To be able to translate your observation and make it a shared knowledge calls for a lot of responsibility that can face no fallacy.With the right business acumen, a Data Scientist finds it easy to present a story or the narration of a problem or a solution.To be able to put your ideas and the solutions you arrive at across the table, you need to have business acumen along with the prowess for tech and algorithms.Data, Math, and tech will not help always. You need to have a strong business influence that can further be influenced by a strong business acumen.Companies Using Data ScienceTo address the issues associated with the management of complex and expanding work environments, IT organizations make use of data to identify new value sources. The identification helps them exploit future opportunities and to further expand their operations. What makes the difference here is the knowledge you extract from the repository of data. The biggest and the best companies use analytics to efficiently come up with the best business models.Following are a few top companies that use Data Science to expand their services and increase their productivity.GoogleAmazonProcter & GambleNetflixGoogle Google has always topped the list on a hiring spree for top-notch data scientists. A force of data scientists, artificial intelligence and machine learning by far drives Google. Moreover, when you are here, you get the best when you give the best of your data expertise.AmazonAmazon, the global e-commerce and cloud computing giant hire data scientists on a big scale. To bank upon the customers’ mindsets, enhance the geographical outreach of both the cloud domain and e-commerce domain among other business-driven goals, they make use of Data Science. Data Scientists play a crucial role in steering Data Science.Procter & Gamble and NetflixBig Data is a major component of Data Science.It has answers to a range of business problems – from customer experience to analytics.Netflix and Procter & Gamble join the race of product development by using big data to anticipate customer demand. They make use of predictive analytics, an offshoot of Data Science to build models for services in their pipeline. This modelling is an attribute that contributes to their commercial success. The significant addition to the commercial success of P&G is that it uses data and analytics from test markets, social media, and early store rollouts. Following this strategy, it further plans, produces, and launches the final products. And, the finale often garners an overwhelming response for them.The Final Component of the Big Data StoryWhen speed multiplied with storage capabilities, thus evolved the final component of the Big Data story – the generation and collection of the data. If we still had massive room-sized calculators working as computers, we may not have come across the humongous amount of data that we see today. With the advancement in technology, we called upon ubiquitous devices. With the increase in the number of devices, we have more data being generated. We are generating data at our own pace from our own space owing to the devices that we make use of from our comfort zones. Here I tweet, there you post, while a video is being uploaded on some platform by someone from some corner of the room you are seated in.The more you inform people about what you are doing in your life, the more data you end up writing. I am happy and I share a quote on Facebook expressing my feelings; I am contributing to more data. This is how enormous amount of data is generated. The Internet-connected devices that we use support in writing data. Anything that you engage with in this digital world, the websites you browse, the apps you open on your cell phone, all the data pertaining to these can be logged in a database miles away from you.Writing data and storing it is not an arduous task anymore. At times, companies just push the value of the data to the backburner. At some point of time, this data will be fetched and cooked when they see the need for it.There are different ways to cash upon the billions of data points. Data Science puts the data into categories to get a clear picture. On a Final NoteIf you are an organization looking out to expand your horizons, being data-driven will take you miles. The application of an amalgam of Infrastructure, Software and Statistics, and the various data sources is the secret formula to successfully arrive at key business solutions. The future belongs to Data Science. Today, it is data that we see all around us. This new age sounds the bugle for more opportunities in the field of Data Science. Very soon, the world will need around one million Data Scientists.If you are keen on donning the hat of a Data Scientist, be your own architect when it comes to solving analytical problems. You need to be a highly motivated problem solver to overcome the toughest analytical challenges.
Rated 4.5/5 based on 1 customer reviews
9741
What Is Data Science(with Examples), It's Lif...

Oh yes, Science is everywhere. A while ago, when c... Read More

Essential Skills to Become a Data Scientist

The demand for Data Science professionals is now at an all-time high. There are companies in virtually every industry looking to extract the most value from the heaps of information generated on a daily basis.With the trend for Data Science catching up like never before, organizations are making complete use of their internal data assets to further examine the integration of hundreds of third-party data sources. What is crucial here is the role of the data scientists.Not very long back, the teams playing the key role of working on the data always found their places in the back rooms of multifold IT organizations. The teams though sitting on the backseat would help in steering the various corporate systems with the required data that acted as the fuel to keep the activities running. The critical database tasks performed by the teams responsible allowed corporate executives to report on operations activities and deliver financial results.When you take up a career in Data Science, your previous experience or skills do not matter. As a matter of fact, you would need a whole new range of skills to pursue a career in Data Science. Below are the skills required to become a top dog in Data Science.What should Data Scientists knowData scientists are expected to have knowledge and expertise in the following domains:The areas arch over dozens of languages, frameworks, and technologies that data scientists need to learn. Data scientists should always have the curiosity to amass more knowledge in their domain so that they stay relevant in this dynamic field.The world of Data Science demands certain important attributes and skills, according to IT leaders, industry analysts, data scientists, and others.How to become a Data Scientist?A majority of Data scientists already have a Master’s degree. If Master’s degree does not quench their thirst for more degrees, some even go on to acquire PhD degrees. Mind you, there are exceptions too. It isn’t mandatory that you should be an expert in a particular subject to become a Data Scientist. You could become one even with a qualification in Computer Science, Physical Sciences, Natural Sciences, Statistics or even Social Sciences. However, a degree in Mathematics and Statistics is always an added benefit for enhanced understanding of the concepts.Qualifying with a degree is not the end of the requirements. Brush up your skills by taking online lessons in a special skill set of your choice — get certified on how to use Hadoop, Big Data or R. You can also choose to enroll yourself for a Postgraduate degree in the field of Data Science, Mathematics or any other related field.Remember, learning does not end with earning a degree or certification. You need to practice what you learned — blog and share your knowledge, build an app and explore other avenues and applications of data.The Data Scientists of the modern world have a major role to play in businesses across the globe. They have the ability to extract useful insights from vast amounts of raw data using sophisticated techniques. The business acumen of the Data Scientists help a big deal in predicting what lies ahead for enterprises. The models that the Data Scientists create also bring out measures to mitigate potential threats if any.Take up organizational challenges with ABCDE skillsetAs a Data Scientist, you may have to face challenges while working on projects and finding solutions to problems.A = AnalyticsIf you are a Data Scientist, you are expected not just to study the data and identify the right tools and techniques; you need to have your answers ready to all the questions that come across while you are strategizing on working on a solution with or without a business model.B = Business AcumenOrganizations vouch for candidates with strong business acumen. As a Data Scientist, you are expected to showcase your skills in a way that will make the organization stand one step ahead of the competition. Undertaking a project and working on it is not the end of the path scaled by you. You need to understand and be able to make others understand how your business models influence business outcomes and how the outcomes will prove beneficial to the organization.C = CodingAnd a Data Scientist is expected to be adept at coding too. You may encounter technical issues where you need to sit and work on codes. If you know how to code, it will make you further versatile in confidently assisting your team.D = DomainThe world does not expect Data Scientists to be perfect with knowledge of all domains. However, it is always assumed that a Data Scientist has know-how of various industrial operations. Reading helps as a plus point. You can gain knowledge in various domains by reading the resources online.E = ExplainTo be a successful Data Scientist, you should be able to explain the problem you are faced with to figure out a solution to the problem and share it with the relevant stakeholders. You need to create a difference in the way you explain without leaving any communication gaps.The Important Skills for a Data ScientistLet us now understand the important skills to become an expert Data Scientist – all the skills that go in, to become one. The skills are as follows:Critical thinkingCodingMathML, DL, AICommunication1. Critical thinkingData scientists need to keep their brains racing with critical thinking. They should be able to apply the objective analysis of facts when faced with a complex problem. Upon reaching a logical analysis, a data scientist should formulate opinions or render judgments.Data scientists are counted upon for their understanding of complex business problems and the risks involved with decision-making. Before they plunge into the process of analysis and decision-making, data scientists are required to come up with a 'model' or 'abstract' on what is critical to coming up with the solution to a problem. Data scientists should be able to determine the factors that are extraneous and can be ignored while churning out a solution to a complex business problem.According to Jeffry Nimeroff, CIO at Zeta Global, which provides a cloud-based marketing platform – A data scientist needs to have experience but also have the ability to suspend belief...Before arriving at a solution, it is very important for a Data Scientist to be very clear on what is being expected and if the expected solution can be arrived at. It is only with experience that your intuition works stronger. Experience brings in benefits.If you are a novice and a problem is posed in front of you; all that the one who put the problem in front of you would get is a wide-eyed expression, perhaps. Instead, if you have hands-on experience of working with complex problems no matter what, you will step back, look behind at your experience, draw some inference from multiple points of view and try assessing the problem that is put forth.In simple steps, critical thinking involves the following steps:a. Describe the problem posed in front of you.b. Analyse the arguments involved – The IFs and BUTs.c. Evaluate the significance of the decisions being made and the successes or failures thereafter.2. CodingHandling a complex task might at times call for the execution of a chain of programming tasks. So, if you are a data scientist, you should know how to go about writing code. It does not stop at just writing the code; the code should be executable and should be crucial in helping you find a solution to a complex business problem.In the present scenario, Data Scientists are more inclined towards learning and becoming an expert with Python as the language of choice. There is a substantial crowd following R as well. Scala, Clojure, Java and Octave are a few other languages that find prominence too.Consider the following aspects to be a successful Data Scientist that can dab with programming skills –a) You need to deal with humongous volumes of data.b) Working with real-time data should be like a cakewalk for you.c) You need to hop around cloud computing and work your way with statistical models like the ones shown below:Different Statistical ModelsRegressionOptimizationClusteringDecision treesRandom forestsData scientists are expected to understand and have the ability to code in a bundle of languages – Python, C++ or Java.Gaining the knack to code helps Data Scientists; however, this is not the end requirement. A Data Scientist can always be surrounded by people who code.3. MathIf you have never liked Mathematics as a subject or are not proficient in Mathematics, Data Science is probably not the right career choice for you.You might own an organization or you might even be representing it; the fact is while you engage with your clients, you might have to look into many disparate issues. To deal with the issues that lay in front of you, you will be required to develop complex financial or operational models. To finally be able to build a worthy model, you will end up pulling chunks from large volumes of data. This is where Mathematics helps you.If you have the expertise in Mathematics, building statistical models is easier. Statistical models further help in developing or switching over to key business strategies. With skills in both Mathematics and Statistics, you can get moving in the world of Data Science. Spell the mantra of Mathematics and Statistics onto your lamp of Data Science, lo and behold you can be the genie giving way to the best solutions to the most complex problems.4. Machine learning, Deep Learning, AIData Science overlaps with the fields of Machine Learning, Deep Learning and AI.There is an increase in the way we work with computers, we now have enhanced connectivity; a large amount of data is being collected and industries make use of this data and are moving extremely fast.AI and deep learning may not show up in the requirements of job postings; yet, if you have AI and deep learning skills, you end up eating the big pie.A data scientist needs to be hawk-eyed and alert to the changes in the curve while research is in progress to come up with the best methodology to a problem. Coming up with a model might not be the end. A Data Scientist must be clear as to when to apply which practice to solve a problem without making it more complex.Data scientists need to understand the depth of problems before finding solutions. A data scientist need not go elsewhere to study the problems; all that is there in the data fetched is what is needed to bring out the best solution.A data scientist should be aware of the computational costs involved in building an environment and the following system boundary conditions:a. Interpretabilityb. Latencyc. BandwidthStudying a customer can act as a major plus point for both a data scientist and an orgaStudying nization… This helps in understanding what technology to apply.No matter how generations advance with the use of automated tools and open source is readily available, statistical skills are considered the much-needed add-ons for a data scientist.Understanding statistics is not an easy job; a data scientist needs to be competent to comprehend the assumptions made by the various tools and software.Experts have put forth a few important requisites for data scientists to make the best use of their models:Data scientists need to be handy with proper data interpretation techniques and ought to understand –a. the various functional interfaces to the machine learning algorithmsb. the statistics within the methodsIf you are a data scientist, try dabbing your profile with colours of computer science skills. You must be proficient in working with the keyboard and have a sound knowledge of fundamentals in software engineering.5. CommunicationCommunication and technology show a cycle of operations wherein, there is an integration between people, applications, systems, and data. Data science does not stand separate in this. Working with Data Science is no different. As a Data Scientist, you should be able to communicate with various stakeholders. Data plays a key attribute in the wheel of communication.Communication in Data Science ropes in the ‘storytelling’ ability. This helps you translate a solution you have arrived at into action or intervention that you have put in the pipeline. As a Data Scientist, you should be adept at knitting with the data you have extracted and communicated it clearly to your stakeholders.What does a data scientist communicate to the stakeholders?The benefits of dataThe technology and the computational costs involved in the process of extracting and making use of the dataThe challenges posed in the form of data quality, privacy, and confidentialityA Data Scientist also needs to keep an eye on the wide horizons for better prospects. The organization can be shown a map highlighting other areas of interest that can prove beneficial.If you are a Data Scientist with different feathers in your cap, one being that of a good communicator, you should be able to change a complex form of technical information to a simple and compact form before you present it to the various stakeholders. The information should highlight the challenges, the details of the data, the criteria for success and the anticipated results.If you want to excel in the field of Data Science, you must have an inquisitive bent of mind. The more you ask questions, the more information you gather, the easier it is to come up with paramount business models.6. Data architectureLet us draw some inference from the construction of a building and the role of an architect. Architects have the most knowledge of how the different blocks of buildings can go together and how the different pillars for a block make a strong support system. Like how architects manage and coordinate the entire construction process, so do the Data Scientists while building business models.A Data Scientist needs to understand all that happens to the data from the inception level to when it becomes a model and further until a decision is made based on the model.Not understanding the data architecture can have a tremendous impact on the assumptions made in the process and the decisions arrived at. If a Data Scientist is not familiar with the data architecture, it may lead to the organization taking wrong decisions leading to unexpected and unfavourable results.A slight change within the architecture might lead to situations getting worse for all the involved stakeholders.7. Risk analysis, process improvement, systems engineeringA Data Scientist with sharp business acumen should have the ability to analyse business risks, suggest improvements if any and facilitate further changes in various business processes. As a Data Scientist, you should understand how systems engineering works.If you want to be a Data Scientist and have sharp risk analysis, process improvement and systems engineering skills, you can set yourself for a smooth sail in this vast sea of Data Science.And, rememberYou will no more be a Data Scientist if you stop following scientific theories… After all, Data Science in itself is a major breakthrough in the field of Science.It is always recommended to analyse all the risks that may confront a business before embarking on a journey of model development. This helps in mitigating risks that an organization may have to encounter later. For a smooth business flow, a Data Scientist should also have the nature to probe into the strategies of the various stakeholders and the problems encountered by customers.A Data Scientist should be able to get the picture of the prevailing risks or the various systems that can have a whopping impact on the data or if a model can lead to positive fruition in the form of customer satisfaction.8. Problem-solving and strong business acumenData scientists are not very different when compared to the commoners. We can say this on the lines of problem-solving. The problem solving traits are inherent in every human being. What makes a data scientist stand apart is very good problem-solving skills. We come across complex problems even in everyday situations. How we differ in solving problems is in the perspectives that we apply. Understanding and analyzing before moving on to actually solving the problems by pulling out all the tools in practice is what Data Scientists are good at.The approach that a Data Scientist takes to solve a problem reaps more success than failure. With their approach, they bring critical thinking to the forefront.  Finding a Data Scientist with skill sets at variance is a problem faced by most of the employers.Technical Skills for a Data ScientistWhen the employers are on a hunt to trap the best, they look out for specialization in languages, libraries, and expertise in tech tools. If a candidate comes in with experience, it helps in boosting the profile.Let us see some very important technical skills:PythonRSQLHadoop/Apache SparkJava/SASTableauLet us briefly understand how these languages are in demand.PythonPython is one of the most in-demand languages. This has gained immense popularity as an open-source language. It is widely used both by beginners and experts. Data Scientists need to have Python as one of the primary languages in their kit.RR is altogether a new programming language for statisticians. Anyone with a mathematical bent of mind can learn it. Nevertheless, if you do not appreciate the nuances of Mathematics then it’s difficult to understand R. This never means that you cannot learn it, but without having that mathematical creativity, you cannot harness the power of it.SQLStructured Query Language or SQL is also highly in demand. The language helps in interacting with relational databases. Though it is not of much prominence yet, with a know-how in SQL you can gain a stand in the job market.Hadoop & SparkBoth Hadoop and Spark are open source tools from Apache for big data.Apache Hadoop is an open source software platform. Apache Hadoop helps when you have large data sets on computer clusters built from commodity hardware and you find it difficult to store and process the data sets.Apache Spark is a lightning-fast cluster computing and data processing engine designed for fast computation. It comes with a bunch of development APIs. It supports data workers with efficient execution of streaming, machine learning or SQL workloads.Java & SASWe also have Java and SAS joining the league of languages. These are in-demand languages by large players. Employers offer whopping packages to candidates with expertise in Java and SAS.TableauTableau joins the list as an analytics platform and visualization tool. The tool is powerful and user-friendly. The public version of the tool is available for free. If you wish to keep your data private, you have to consider the costs involved too.Easy tips for a Data ScientistLet us see the in-demand skill set for a Data Scientist in brief.a. A Data Scientist should have the acumen to handle data processing and go about setting models that will help various business processes.b. A Data Scientist should understand the depth of a business problem and the structure of the data that will be used in the process of solving it.c. A Data Scientist should always be ready with an explanation on how the created business models work; even the minute details count.A majority of the crowd out there is good at Maths, Statistics, Engineering or other related subjects. However, when interviewed, they may not show the required traits and when recruited may fail to shine in their performance levels. Sometimes the recruitment process to hire a Data Scientist gets so tedious that employers end up searching with lanterns even in broad daylight. Further, the graphical representation below shows some smart tips for smart Data Scientists.Smart tips for a Data ScientistWhat employers seek the most from Data Scientists?Let us now throw some light into what employers seek the most from Data Scientists:a. A strong sense of analysisb. Machine learning is at the core of what is sought from Data Scientists.c. A Data Scientist should infer and refer to data that has been in practice and will be in practice.d. Data Scientists are expected to be adept at Machine Learning and create models predicting the performance on the basis of demand.e. And, a big NOD to a combo skill set of statistics, Computer Science and Mathematics.Following screenshot shows the requirements of a topnotch employer from a Data Scientist. The requirements were posted on a jobs’ listing website.Let us do a sneak peek into the same job-listing website and see the skills in demand for a Data Scientist.ExampleRecommendations for a Data ScientistWhat are some general recommendations for Data Scientists in the present scenario? Let us walk you through a few.Exhibit your demonstration skills with data analysis and aim to become learned at Machine Learning.Focus on your communication skills. You would have a tough time in your career if you cannot show what you have and cannot communicate what you know. Experts have recommended reading Made to Stick for far-reaching impact of the ideas that you generate.Gain proficiency in deep learning. You must be familiar with the usage, interest, and popularity of deep learning framework.If you are wearing the hat of a Python expert, you must also have the know-how of common python data science libraries – numpy, pandas, matplotlib, and scikit-learn.ConclusionData Science is all about contributing more data to the technologically advanced world. Make your online presence a worthy one; learn while you earn.Start by browsing through online portals. If you are a professional, make your mark on LinkedIn. Securing a job through LinkedIn is now easier than scouring through job sites.Demonstrate all the skills that you are good at on the social portals you are associated with. Suppose you write an article on LinkedIn, do not refrain from sharing the link to the article on your Facebook account.Most important of all – when faced with a complex situation, understand why and what led to the problem. A deeper understanding of a problem will help you come up with the best model. The more you empathize with a situation, the more will be your success count. And in no time, you can become that extraordinary whiz in Data Science.Wishing you immense success if you happen to choose or have already chosen Data Science as the path for your career.All the best for your career endeavour!
Rated 4.5/5 based on 1 customer reviews
9048
Essential Skills to Become a Data Scientist

The demand for Data Science professionals is now a... Read More

What is Bias-Variance Tradeoff in Machine Learning

What is Machine Learning? Machine Learning is a multidisciplinary field of study, which gives computers the ability to solve complex problems, which otherwise would be nearly impossible to be hand-coded by a human being. Machine Learning is a scientific field of study which involves the use of algorithms and statistics to perform a given task by relying on inference from data instead of explicit instructions. Machine Learning Process:The process of Machine Learning can be broken down into several parts, most of which is based around “Data”. The following steps show the Machine Learning Process. 1. Gathering Data from various sources: Since Machine Learning is basically the inference drawn from data before any algorithm can be used, data needs to be collected from some source. Data collected can be of any form, viz. Video data, Image data, Audio data, Text data, Statistical data, etc. 2. Cleaning data to have homogeneity: The data that is collected from various sources does not always come in the desired form. More importantly, data contains various irregularities like Missing data and Outliers.These irregularities may cause the Machine Learning Model(s) to perform poorly. Hence, the removal or processing of irregularities is necessary to promote data homogeneity. This step is also known as data pre-processing. 3. Model Building & Selecting the right Machine Learning Model: After the data has been correctly pre-processed, various Machine Learning Algorithms (or Models) are applied on the data to train the model to predict on unseen data, as well as to extract various insights from the data. After various models are “trained” to the data, the best performing model(s) that suit the application and the performance criteria are selected.4. Getting Insights from the model’s results: Once the model is selected, further data is used to validate the performance and accuracy of the model and get insights as to how the model performs under various conditions. 5. Data Visualization: This is the final step, where the model is used to predict unseen and real-world data. However, these predictions are not directly understandable to the user, and hence, data Visualization or converting the results into understandable visual graphs is necessary. At this stage, the model can be deployed to solve real-world problems.How is Machine Learning different from Curve Fitting? To get the similarities out of the way, both, Machine Learning and Curve Fitting rely on data to infer a model which, ideally, fits the data perfectly. The difference comes in the availability of the data. Curve Fitting is carried out with data, all of which is already available to the user. Hence, there is no question of the model to encounter unseen data.However, in Machine Learning, only a part of the data is available to the user at the time of training (fitting) the model, and then the model has to perform equally well on data that it has never encountered before. Which is, in other words, the generalization of the model over a given data, such that it is able to correctly predict when it is deployed.A high-level introduction to Bias and Variance through illustrative and applied examples Let’s initiate the idea of Bias and Variance with a case study. Let’s assume a simple dataset of predicting the price of a house based on its carpet area. Here, the x-axis represents the carpet area of the house, and the y-axis represents the price of the property. The plotted data (in a 2D graph) is shown in the graph below: The goal is to build a model to predict the price of the house, given the carpet area of the property. This is a rather easy problem to solve and can easily be achieved by fitting a curve to the given data points. But, for the time being, let’s concentrate on solving the same using Machine Learning.In order to keep this example simple and concentrate on Bias and Variance, a few assumptions are made:Adequate data is present in order to come up with a working model capable of making relatively accurate predictions.The data is homogeneous in nature and hence no major pre-processing steps are involved.There are no missing values or outliers, and hence they do not interfere with the outcome in any way. The y-axis data-points are independent of the order of the sequence of the x-axis data-points.With the above assumptions, the data is processed to train the model using the following steps: 1. Shuffling the data: Since the y-axis data-points are independent of the order of the sequence of the x-axis data-points, the dataset is shuffled in a pseudo-random manner. This is done to avoid unnecessary patterns from being learned by the model. During the shuffling, it is imperative to keep each x-y pair data point constant. Mixing them up will change the dataset itself and the model will learn inaccurate patterns. 2. Data Splitting: The dataset is split into three categories: Training Set (60%), Validation Set (20%), and Testing Set (20%). These three sets are used for different purposes:Training Set - This part of the dataset is used to train the model. It is also known as the Development Set. Validation Set - This is separate from the Training Set and is only used for model selection. The model does not train or learn from this part of the dataset.Testing Set - This part of the dataset is used for performance evaluation and is completely independent of the Training or Validation Sets. Similar to the Validation Set, the model does not train on this part of the dataset.3. Model Selection: Several Machine Learning Models are applied to the Training Set and their Training and Validation Losses are determined, which then helps determine the most appropriate model for the given dataset.During this step, we assume that a polynomial equation fits the data correctly. The general equation is given below: The process of “Training” mathematically is nothing more than figuring out the appropriate values for the parameters: a0, a1, ... ,an, which is done automatically by the model using the Training Set.The developer does have control over how high the degree of the polynomial can be. These parameters that can be tuned by the developer are called Hyperparameters. These hyperparameters play a key role in deciding how well would the model learn and how generalized will the learned parameters be. Given below are two graphs representing the prediction of the trained model on training data. The graph on the left represents a linear model with an error of 3.6, and the graph on the right represents a polynomial model with an error of 1.7. By looking at the errors, it can be concluded that the polynomial model performs significantly better when compared to the linear model (Lower the error, better is the performance of the model). However, when we use the same trained models on the Testing Set, the models perform very differently. The graph on the left represents the same linear model’s prediction on the Testing Set, and the graph on the right side represents the Polynomial model’s prediction on the Testing Set. It is clearly visible that the Polynomial model inaccurately predicts the outputs when compared to the Linear model.In terms of error, the total error for the Linear model is 3.6 and for the Polynomial model is a whopping 929.12. Such a big difference in errors between the Training and Testing Set clearly signifies that something is wrong with the Polynomial model. This drastic change in error is due to a phenomenon called Bias-Variance Tradeoff.What is “Error” in Machine Learning? Error in Machine Learning is the difference in the expected output and the predicted output of the model. It is a measure of how well the model performs over a given set of data.There are several methods to calculate error in Machine Learning. One of the most commonly used terminologies to represent the error is called the Loss/Cost Function. It is also known as the Mean Squared Error (or MSE) and is given by the following equation:The necessity of minimization of Errors: As it is obvious from the previously shown graphs, the higher the error, the worse the model performs. Hence, the error of the prediction of a model can be considered as a performance measure: Lower the error of a model, the better it performs. In addition to that, a model judges its own performance and trains itself based on the error created between its own output and the expected output. The primary target of the model is to minimize the error so as to get the best parameters that would fit the data perfectly. Total Error: The error mentioned above is the Total Error and consists of three types of errors: Bias + Variance + Irreducible Error. Total Error = Bias + Variance + Irreducible ErrorEven for an ideal model, it is impossible to get rid of all the types of errors. The “irreducible” error rate is caused by the presence of noise in the data and hence is not removable. However, the Bias and Variance errors can be reduced to a minimum and hence, the total error can also be reduced significantly. Why is the splitting of data important? Ideally, the complete dataset is not used to train the model. The dataset is split into three sets: Training, Validation and Testing Sets. Each of these serves a specific role in the development of a model which performs well under most conditions.Training Set (60-80%): The largest portion of the dataset is used for training the Machine Learning Model. The model extracts the features and learns to recognize the patterns in the dataset. The quality and quantity of the training set determines how well the model is going to perform. Testing Set (15-25%): The main goal of every Machine Learning Engineer is to develop a model which would generalize the best over a given dataset. This is achieved by training the model(s) on a portion of the dataset and testing its performance by applying the trained model on another portion of the same/similar dataset that has not been used during training (Testing Set). This is important since the model might perform too well on the training set, but perform poorly on unseen data, as was the case with the example given above. Testing set is primarily used for model performance evaluation.Validation Set (15-25%): In addition to the above, because of the presence of more than one Machine Learning Algorithm (model), it is often not recommended to test the performance of multiple models on the same dataset and then choose the best one. This process is called Model Selection, and for this, a separate part of the training set is used, which is also known as Validation Set. A validation set behaves similar to a testing set but is primarily used in model selection and not in performance evaluation.Bias and Variance - A Technical Introduction What is Bias?Bias is used to allow the Machine Learning Model to learn in a simplified manner. Ideally, the simplest model that is able to learn the entire dataset and predict correctly on it is the best model. Hence, bias is introduced into the model in the view of achieving the simplest model possible.Parameter based learning algorithms usually have high bias and hence are faster to train and easier to understand. However, too much bias causes the model to be oversimplified and hence underfits the data. Hence these models are less flexible and often fail when they are applied on complex problems.Mathematically, it is the difference between the model’s average prediction and the expected value.What is Variance?Variance in data is the variability of the model in a case where different Training Data is used. This would significantly change the estimation of the target function. Statistically, for a given random variable, Variance is the expectation of squared deviation from its mean. In other words, the higher the variance of the model, the more complex the model is and it is able to learn more complex functions. However, if the model is too complex for the given dataset, where a simpler solution is possible, a model with high Variance causes the model to overfit. When the model performs well on the Training Set and fails to perform on the Testing Set, the model is said to have Variance.Characteristics of a biased model A biased model will have the following characteristics:Underfitting: A model with high bias is simpler than it should be and hence tends to underfit the data. In other words, the model fails to learn and acquire the intricate patterns of the dataset. Low Training Accuracy: A biased model will not fit the Training Dataset properly and hence will have low training accuracy (or high training loss). Inability to solve complex problems: A Biased model is too simple and hence is often incapable of learning complex features and solving relatively complex problems.Characteristics of a model with Variance A model with high Variance will have the following characteristics:Overfitting: A model with high Variance will have a tendency to be overly complex. This causes the overfitting of the model.Low Testing Accuracy: A model with high Variance will have very high training accuracy (or very low training loss), but it will have a low testing accuracy (or a low testing loss). Overcomplicating simpler problems: A model with high variance tends to be overly complex and ends up fitting a much more complex curve to a relatively simpler data. The model is thus capable of solving complex problems but incapable of solving simple problems efficiently.What is Bias-Variance Tradeoff? From the understanding of bias and variance individually thus far, it can be concluded that the two are complementary to each other. In other words, if the bias of a model is decreased, the variance of the model automatically increases. The vice-versa is also true, that is if the variance of a model decreases, bias starts to increase.Hence, it can be concluded that it is nearly impossible to have a model with no bias or no variance since decreasing one increases the other. This phenomenon is known as the Bias-Variance TradeA graphical introduction to Bias-Variance Tradeoff In order to get a clear idea about the Bias-Variance Tradeoff, let us consider the bulls-eye diagram. Here, the central red portion of the target can be considered the location where the model correctly predicts the values. As we move away from the central red circle, the error in the prediction starts to increase. Each of the several hits on the target is achieved by repetition of the model building process. Each hit represents the individual realization of the model. As can be seen in the diagram below, the bias and the variance together influence the predictions of the model under different circumstances.Another way of looking at the Bias-Variance Tradeoff graphically is to plot the graphical representation for error, bias, and variance versus the complexity of the model. In the graph shown below, the green dotted line represents variance, the blue dotted line represents bias and the red solid line represents the error in the prediction of the concerned model. Since bias is high for a simpler model and decreases with an increase in model complexity, the line representing bias exponentially decreases as the model complexity increases. Similarly, Variance is high for a more complex model and is low for simpler models. Hence, the line representing variance increases exponentially as the model complexity increases. Finally, it can be seen that on either side, the generalization error is quite high. Both high bias and high variance lead to a higher error rate. The most optimal complexity of the model is right in the middle, where the bias and variance intersect. This part of the graph is shown to produce the least error and is preferred. Also, as discussed earlier, the model underfits for high-bias situations and overfits for high-variance situations.Mathematical Expression of Bias-Variance Tradeoff The expected values is a vector represented by y. The predicted output of the model is denoted by the vector y for input vector x. The relationship between the predicted values and the inputs can be taken as y = f(x) + e, where e is the normally distributed error given by:The third term in the above equation, irreducible_error represents the noise term and cannot be fundamentally reduced by any given model. If hypothetically, infinite data is available, it is possible to tune the model to reduce the bias and variance terms to zero but is not possible to do so practically. Hence, there is always a tradeoff between the minimization of bias and variance. Detection of Bias and Variance of a modelIn model building, it is imperative to have the knowledge to detect if the model is suffering from high bias or high variance. The methods to detect high bias and variance is given below:Detection of High Bias:The model suffers from a very High Training Error.The Validation error is similar in magnitude to the training error.The model is underfitting.Detection of High Variance:The model suffers from a very Low Training Error.The Validation error is very high when compared to the training error.The model is overfitting.A graphical method to Detect a model suffering from High Bias and Variance is shown below: The graph shows the change in error rate with respect to model complexity for training and validation error. The left portion of the graph suffers from High Bias. This can be seen as the training error is quite high along with the validation error. In addition to that, model complexity is quite low. The right portion of the graph suffers from High Variance. This can be seen as the training error is very low, yet the validation error is very high and starts increasing with increasing model complexity.A systematic approach to solve a Bias-Variance Problem by Dr. Andrew Ng:Dr. Andrew Ng proposed a very simple-to-follow step by step architecture to detect and solve a High Bias and High Variance errors in a model. The block diagram is shown below:Detection and Solution to High Bias problem - if the training error is high: Train longer: High bias means a usually less complex model, and hence it requires more training iterations to learn the relevant patterns. Hence, longer training solves the error sometimes.Train a more complex model: As mentioned above, high bias is a result of a less than optimal complexity in the model. Hence, to avoid high bias, the existing model can be swapped out with a more complex model. Obtain more features: It is often possible that the existing dataset lacks the required essential features for effective pattern recognition. To remedy this problem: More features can be collected for the existing data.Feature Engineering can be performed on existing features to extract more non-linear features. Decrease regularization: Regularization is a process to decrease model complexity by regularizing the inputs at different stages, promote generalization and prevent overfitting in the process. Decreasing regularization allows the model to learn the training dataset better. New model architecture: If all of the above-mentioned methods fail to deliver satisfactory results, then it is suggested to try out other new model architectures. Detection and Solution to High Variance problem - if a validation error is high: Obtain more data: High variance is often caused due to a lack of training data. The model complexity and quantity of training data need to be balanced. A model of higher complexity requires a larger quantity of training data. Hence, if the model is suffering from high variance, more datasets can reduce the variance. Decrease number of features: If the dataset consists of too many features for each data-point, the model often starts to suffer from high variance and starts to overfit. Hence, decreasing the number of features is recommended. Increase Regularization: As mentioned above, regularization is a process to decrease model complexity. Hence, if the model is suffering from high variance (which is caused by a complex model), then an increase in regularization can decrease the complexity and help to generalize the model better.New model architecture: Similar to the solution of a model suffering from high bias, if all of the above-mentioned methods fail to deliver satisfactory results, then it is suggested to try out other new model architectures.Conclusion To summarize, Bias and Variance play a major role in the training process of a model. It is necessary to reduce each of these parameters individually to the minimum possible value. However, it should be kept in mind that an effort to decrease one of these parameters beyond a certain limit increases the probability of the other getting increased. This phenomenon is called as the Bias-Variance Tradeoff and is a parameter to consider during model building. 
Rated 4.5/5 based on 1 customer reviews
7599
What is Bias-Variance Tradeoff in Machine Learning

What is Machine Learning? Machine Learning is a m... Read More

20% Discount