Ashish is a techology consultant with 13+ years of experience and specializes in Data Science, the Python ecosystem and Django, DevOps and automation. He specializes in the design and delivery of key, impactful programs.

- Home
- Blog
- Data Science
- Gradient Descent in Machine Learning: What & How Does It Work

HomeBlogData ScienceGradient Descent in Machine Learning: What & How Does It Work

Share

Published

07th Sep, 2023

Views

Read TimeRead it in

17 Mins

Machine learning Series

Filter

In this article

In our day-to-day lives, we optimize variables based on our personal decisions and don’t even recognize the process consciously. We are constantly using optimization techniques all day long, for example, while going to work, choosing a shorter route to minimize traffic woes, figuring out and managing a quick walk around the campus during a snack break, or scheduling a cab in advance to reach the airport on time. To learn more about Linear Discriminant Analysis in machine learning, click here.

Optimization is the goal, whether you are dealing with actual events in real-life or creating a technology-based product. Optimization is at the heart of most of the statistical and Machine Learning techniques which are widely used in data science. Gain more knowledge and skills in data science and machine learning by signing up for the complete Machine Learning course with Python.

Gradient Descent is an iterative process that finds the minima of a function. Gradient descent is an optimization algorithm mainly used to find the minimum of a function. In machine learning, gradient descent is used to update parameters in a model. Parameters can vary according to the algorithms, such as *coefficients* in Linear Regression and weights in Neural Networks.

Let’s take an example of a Simple Linear regression problem where our aim is to predict the dependent variable(y) when only one independent variable is given. for the above linear regression model, the equation of the line would be as follows.

In the above equation, y is the dependent variable,

- x is the independent variable,
- m is the slope of the line and
- c is the intercept on the y-axis by the line.

For this explanation, we will take the loss function as the sum of squared error. and the equation of the loss function will be as follows.

Here we are required to optimize the value of ‘m’ and ‘c’ in order to minimize the Loss function. as y_predicted is the output given by the Linear Regression Equation, therefore Loss at any given point can be given by

In order to find out the negative of the slope, we proceed by finding the partial derivatives with respect to both ‘m’ and ‘c’

partial derivatives w.r.t m and c

*When 2 or more than 2 partial derivatives are done on the same equation w.r.t to 2 or more than 2 different variables, it is known as Gradient.*

After performing partial derivatives w.r.t to ‘m’ and ‘c’ we obtain 2 equations as given above; when some value of ‘m’ and ‘c’ is given and is summed across all the data points, we obtain the negative side of the slope.

The next step is to assume a Learning Rate, which is generally denoted by ‘α’ (alpha). In most cases, the Learning Rate is set very close to 0 e.g., 0.001 or 0.005.

*A small learning rate will result in too many steps by the gradient decent algorithm and if a large value of ‘α’ is selected, it may result in the model to never converging at the minima.*

Next is to determine Step Size based on our Learning Rate. Step Size can be defined as

finding the next points using the learning rate

This will give us 2 points which will represent the updated value of ‘m’ and ‘c’.

We Iterate over the steps of finding the negative of the slope and then update the value of ‘m’ and ‘c’ until we reach or converge on our minima.

There are three popular types of gradient descent that mainly differ in the amount of data they use:

In this type of gradient descent, all the training examples are processed for each iteration of gradient descent. It gets computationally expensive if the number of training examples is large. This is when batch gradient descent is not preferred. Rather a stochastic gradient descent or mini-batch gradient descent is used.

**Algorithm for batch gradient descent:**

Let hθ(x) be the hypothesis for linear regression. Then, the cost function is given by:

Let Σ represents the sum of all training examples from i=1 to m.

For every j =0 …n

}

Where xj(i) represents the jth feature of the ith training example, so if m is very large, then the derivative term fails to converge at the global minimum.

The word stochastic is related to a system or process linked with a random probability. Therefore, in Stochastic Gradient Descent (SGD), samples are selected at random for each iteration instead of selecting the entire data set. When the number of training examples is too large, it becomes computationally expensive to use batch gradient descent. However, Stochastic Gradient Descent uses only a single sample, i.e., a batch size of one, to perform each iteration. The sample is randomly shuffled and selected for performing the iteration. The parameters are updated even after one iteration, where only one has been processed. Thus, it gets faster than batch gradient descent.

**Algorithm for stochastic gradient descent:**** **

- Firstly shuffle the data set randomly in order to train the parameters evenly for each type of data.
- As mentioned above, it takes into consideration one example per iteration.

Hence,

Let (x(i),y(i)) be the training example

Repeat {

For i=1 to m{

For every j =0 …n

}

}

This type of gradient descent is faster than both batch gradient descent and stochastic gradient descent. Even if the number of training examples is large, it processes it in batches in one go. Also, the number of iterations is lesser despite working with larger training samples.

**Algorithm for mini-batch gradient descent:**

Let us consider b be the number of examples in one batch, where b<m. Now, assume b=10 and m=100.

The batch size can be adjusted. It is generally kept as a power of 2. The reason behind it is because some hardware such as GPUs achieve better run time with common batch sizes such as a power of 2.

Repeat { For i=1,11, 21,…..,91

Let Σ be the summation from i to i+9 represented by k.

**Convergence trends in different variants of Gradient Descent**

For Batch Gradient Descent, the algorithm traces a straight line toward the minimum. If the cost function is convex, then it converges to a global minimum, and if the cost function is not convex, then it converges to a local minimum. The learning rate is typically held constant over here.

For stochastic gradient descent and mini-batch gradient descent, the algorithm keeps on fluctuating around the global minimum instead of converging. In order to converge, the learning rate needs to be changed slowly.

You might recall the following formula for the slope of a line, which is y = mx + b, where *m* represents the slope and *b* is the intercept on the y-axis.

You may also recall plotting a scatterplot in statistics and finding the line of best fit, which required calculating the error between the actual output and the predicted output (y-hat) using the mean squared error formula. The gradient descent algorithm behaves similarly, but it is based on a convex function, such as the one below:

The starting point is just an arbitrary point for us to evaluate the performance. From that starting point, we will find the derivative (or slope), and from there, we can use a tangent line to observe the steepness of the slope. The slope will inform the updates to the parameters i.e., the weights and bias. The slope at the starting point will be steeper, but as new parameters are generated, the steepness should gradually reduce until it reaches the lowest point on the curve, known as the point of convergence.

Similar to finding the line of best fit in linear regression, the goal of gradient descent is to minimize the cost function or the error between predicted and actual y. To do this, requires two data points a direction and a learning rate. These factors determine the partial derivative calculations of future iterations, allowing it to arrive gradually at the local or global minimum (i.e. point of convergence). More detail on these components can be found below:

It is also referred to as step size or alpha. The learning rate is the size of the steps that are taken to reach the minimum. This is typically a small value, and it is evaluated and updated based on the behavior of the cost function. High learning rates result in larger steps but risk overshooting the minimum. Conversely, a low learning rate has small step sizes. While it has the advantage of more precision, the number of iterations compromises overall efficiency as this takes more time and computations to reach the minimum.

It measures the difference, or error, between actual y and predicted y at its current position. This improves the machine learning model’s efficacy by providing feedback to the model so that it can adjust the parameters to minimize the error and find the local or global minimum. It continuously iterates, moving along the direction of the steepest descent (or the negative gradient) until the cost function is close to or at zero. At this point, the model will stop learning. Additionally, while the terms cost function and loss function, are considered synonymous, there is a slight difference between them. It’s worth noting that a loss function refers to the error of one training example, while a cost function calculates the average error across an entire training set. To solve the gradient descent algorithm problems by building your skill set, you could enroll in the Best Data Science Courses Online.

Gradient descent is highly used in supervised learning to minimize the error function and find the optimal values for the parameters. Various extensions have been designed for the gradient descent algorithms. Some of them are discussed below:

This method is used to accelerate the gradient descent algorithm by taking into consideration the exponentially weighted average of the gradients. Using averages makes the algorithm converge towards the minima faster, as the gradients towards the uncommon directions are canceled out. The pseudocode for the momentum method is given below.

V = 0

for each iteration i:

compute dW V = β V + (1 - β) dW W = W - α V

**V** and **dW** are analogous to velocity and acceleration, respectively. α is the learning rate, and β is analogous to momentum normally kept at 0.9. Physics interpretation is that the velocity of a ball rolling downhill builds up momentum according to the direction of slope(gradient) of the hill and therefore helps in better arrival of the ball at a minimum value (in our case — at a minimum loss).

RMSprop was proposed by the University of Toronto’s Geoffrey Hinton. The intuition is to apply an exponentially weighted average method to the second moment of the gradients (dW2). The pseudocode for this is as follows:

S = 0

for each iteration i

compute dW S = β S + (1 - β) dW2 W = W - α dW⁄√S + ε

Adam optimization algorithm incorporates the momentum method and RMSprop, along with bias correction. The pseudocode for this approach is as follows:

V = 0 S = 0

for each iteration i

compute dW V = β1 S + (1 - β1) dW S = β2 S + (1 - β2) dW2 V = V⁄{1 - β1i} S = S⁄{1 - β2i} W = W - α V⁄√S + ε

Kingma and Ba, the proposers of Adam, recommended the following values for the hyperparameters.

α = 0.001 β1 = 0.9 β2 = 0.999 ε = 10-8

There are many cases where gradient descent fails to perform well. There are mainly three reasons why this would happen:

- Data challenges
- Gradient challenges
- Implementation challenges

- The arrangement of data sometimes leads to challenges. If it is arranged so that it poses a non-convex optimization problem, then it becomes difficult to perform optimization using gradient descent. Gradient descent works for problems that are arranged with a well-defined convex optimization problem.
- During the optimization of a convex optimization problem, you will come across several minimal points. The lowest among all the points is called the global minimum, and the other points are called the local minima. You will have to make sure you go to the global minimum and avoid local minima.
- There is also a saddle point problem. This is a situation where the gradient is zero but is not an optimal point. It cannot be avoided and is still an active part of the research.

- While using gradient descent in machine learning, if the execution is not proper, it leads to certain problems like vanishing gradient. This happens when the gradient is too small or too large, resulting in no convergence.

- Smaller memory results in the failure of network. A lot of neural network practitioners do not pay attention, but it is very important to look at the resource utilization by the network.
- Another important thing to look at is to keep track of things like floating point considerations and hardware/software prerequisites.

cur_x = 3 # The algorithm starts at x=3 rate = 0.01 # Learning rate precision = 0.000001 #This tells us when to stop the algorithm previous_step_size = 1 # max_iters = 10000 # maximum number of iterations iters = 0 #iteration counter df = lambda x: 2*(x+5) #Gradient of our function

Stop loop when difference between x values from 2 consecutive iterations is less than 0.000001 or when number of iterations exceeds 10,000

while previous_step_size > precision and iters < max_iters:

prev_x = cur_x #Store current x value in prev_x cur_x = cur_x - rate * df(prev_x) #Grad descent previous_step_size = abs(cur_x - prev_x) #Change in x iters = iters+1 #iteration count print("Iteration",iters,"\nX value is",cur_x) #Print iterations print("The local minimum occurs at", cur_x)

**Output **: From the output below, we can observe the x values for the first 10 iterations- which can be cross checked with our calculation above. The algorithm runs for 595 iterations before it terminates. The code and solution is embedded below for reference.

This section lists some tips and tricks for getting the most out of the gradient descent algorithm in machine learning.

Collect and plot the cost values calculated by the algorithm for each iteration. The expectation for a well-performing gradient descent run is a decrease in cost for each iteration. If it does not decrease, try reducing your learning rate.

The learning rate value is a small real value such as 0.1, 0.001 or 0.0001. Try different values for your problem and see which works best.

The algorithm will reach the minimum cost faster if the shape of the cost function is not skewed and distorted. You can achieve this by rescaling all the input variables (X) to the same range, such as [0, 1] or [-1, 1].

Stochastic gradient descent often does not need more than 1-to-10 passes through the training dataset to converge on good or good enough coefficients.

The updates for each training dataset instance can result in a noisy plot of cost over time when using stochastic gradient descent. Taking the average over 10, 100, or 1000 updates can give you a better idea of the learning trend for the algorithm.

This brings us to the end of this article, where we have learned about What is Gradient Descent in machine learning and how it works. Its various types of algorithms, challenges we face with gradient descent, gradient descent implementation in python

Code Implementation, examples, and tips to increase its accuracy. To learn more about machine learning algorithms in-depth. Let us summarize all that we have covered in this article.

Machine learning provides a wide range of career opportunities for Data Scientists. Check out KnowledgeHut’s Advanced Machine Learning with Python course that includes a hands-on approach to data science problems and opens gates for more lucrative career options in this landscape.

Suppose you are at the top of a mountain and want to reach the base camp, which is all the way down at the lowest point of the mountain. Also, due to the bad weather, the visibility is low, and you cannot see the path at all. How would you reach the base camp?

One of the ways is to use your feet to know where the land tends to descend. This will give an idea in what direction; the steep is low, and you should take your first step. If you follow the descending path until you encounter a plain area or an ascending path, it is very likely you will reach the base camp.

But what if there is a slight rise in the ground when you are going downhill? You would immediately stop if you reached the base camp (global minima), but in reality, you are still stuck at the mountain at global a local minima.

Backpropagation is an algorithm that efficiently calculates the gradient of the cost function with respect to the current weight in the neural network. Gradient descent is an optimizer that is used to update weights in the neural network. Both work simultaneously to minimize the cost function.

There is a misconception among people; they say that the model was trained with a backpropagation algorithm, but they forgot to specify the optimizer they used. Instead, one should say, “used backpropagation as gradient computing technique with SGD optimizer”

Gradient descent is an optimization algorithm used to minimize some functions by iteratively moving in the direction of the steepest descent as defined by the negative of the gradient. In machine learning, we use gradient descent to update the parameters of our model.

Gradients point in the direction of the steepest increases in the loss function. Gradient descent reduces loss by moving in the direction of the negative gradient as quickly as possible.

Mini Batch gradient descent is the gradient descent that works faster than both batch gradient descent and stochastic gradient descent.

Name | Date | Fee | Know more |
---|

Course Advisor