Series List
Filter

Rated 4.5/5
based on 34 customer reviews

- by Priyankur Sarkar
- 22nd Aug, 2019
- Last updated on 17th Sep, 2019
- 28 mins read

In our day-to-day lives, we are optimizing variables based on our personal decisions and we don’t even recognize the process consciously. We are constantly using optimization techniques all day long, for example, while going to work, choosing a shorter route in order to minimize traffic woes, figuring out and managing a quick walk around the campus during a snack break, or scheduling a cab in advance to reach the airport on time.

Optimization is the ultimate goal, whether you are dealing with actual events in real-life or creating a technology-based product. Optimization is at the heart of most of the statistical and machine learning techniques which are widely used in data science. To gain more knowledge and skills on data science and machine learning, join the certification course now.

Accuracy is the word with which we are most concerned, while we are dealing with problems related to machine learning and artificial intelligence. Any rate of errors cannot be tolerated while dealing with real-world problems and neither should they be compromised.

Let us consider a case of self-driving cars. The model fitted in the car detects any obstacles that come in the way and takes appropriate actions, which can be slowing down the speed or pulling on the brakes and so on. Now we need to keep this in mind that there is no human in the car to operate or withdraw the actions taken by the self-driving car. In such a scenario, suppose the model is not accurate. It will not be able to detect other cars or any pedestrians and end up crashing leading to several lives at risk.

This is where we need optimization algorithms to evaluate our model and judge whether the model is performing according to our needs or not. The evaluation can be made easy by calculating the cost function (which we will look into in a while in this article in detail). It is basically a mapping function that tells us about the difference between the desired output and what our model is computing. We can accordingly correct the model and avoid any kind of undesired activities.

Optimization may be defined as the process by which an optimum is achieved. It is all about designing an optimal output for your problems with the use of resources available. However, optimization in machine learning is slightly different. In most of the cases, we are aware of the data, the shape and size, which also helps us know the areas we need to improve. But in machine learning we do not know how the new data may look like, this is where optimization acts perfectly. Optimization techniques are performed on the training data and then the validation data set is used to check its performance.

There are a lot of advanced applications of optimization which are widely used in airway routing, market basket analysis, face recognition and so on. Machine learning algorithms such as linear regression, KNN, neural networks completely depend on optimization techniques. Here, we are going to look into one such popular optimization technique called *Gradient Descent**.*

Gradient descent is an optimization algorithm which is mainly used to find the minimum of a function. In machine learning, gradient descent is used to update parameters in a model. Parameters can vary according to the algorithms, such as *coefficients* in Linear Regression and weights in Neural Networks.

Let us relate gradient descent with a real-life analogy for better understanding. Think of a valley you would like to descend when you are blind-folded. Any sane human will take a step and look for the slope of the valley, whether it goes up or down. Once you are sure of the downward slope you will follow that and repeat the step again and again until you have descended completely (or reached the minima).

Similarly, let us consider another analogy. Suppose you have a ball and you place it on an inclined plane (at position A). As per laws, it will start rolling until it travels to a gentle plane where it will be stationary (at position B as shown in the figure below).

This is exactly what happens in gradient descent. The inclined and/or irregular is the cost function when it is plotted and the role of gradient descent is to provide direction and the velocity (learning rate) of the movement in order to attain the minima of the function i.e where the cost is minimum.

The primary goal of machine learning algorithms is always to build a model, which is basically a hypothesis which can be used to find an estimation for Y based on X. Let us consider an example of a model based on certain housing data which comprises of the sale price of the house, the size of the house etc. Suppose we want to predict the pricing of the house based on its size. It is clearly a regression problem where given some inputs, we would like to predict a continuous output.

The hypothesis is usually presented as

where the theta values are the *parameters*.

Let us look into some examples and visualize the hypothesis:

This yields h(x) = 1.5 + 0x. 0x means no slope, and y will always be the constant 1.5. This looks like:

Now let us consider,

Where, h(x) = 1 + 0.5x

The objective in the case of gradient descent is to find a line of best fit for some given *inputs*, or X values, and any number of Y values, or *outputs*. A cost function is defined as *“a function that maps an event or values of one or more variables onto a real number intuitively representing some “cost” associated with the event.”*

With a known set of inputs and their corresponding outputs, a machine learning model attempts to make predictions according to the new set of inputs.

The Error would be the difference between the two predictions.

This relates to the idea of a **Cost function or Loss function**.

A **Cost Function/Loss Function** tells us “how good” our model is at making predictions for a given set of parameters. The cost function has a curve and a gradient, the slope of this curve helps us to update our parameters and make an accurate model.

It is always the primary goal of any Machine Learning Algorithm to minimize the Cost Function. Minimizing cost functions will also result in a lower error between the predicted values and the actual values which also denotes that the algorithm has performed well in learning.

Generally, the cost function is in the form of **Y = X²**. In a Cartesian coordinate system, this represents an equation for a parabola which can be graphically represented as :

Now in order to minimize the function mentioned above, firstly we need to find the value of X which will produce the lowest value of Y (in this case it is the red dot). With lower dimensions (like 2D in this case) it becomes easier to locate the minima but it is not the same while dealing with higher dimensions. For such cases, we need to use the Gradient Descent algorithm to locate the minima.

Now a function is required which will minimize the parameters over a dataset. The most common function which is often used is the mean squared error. It measures the difference between the estimated value (the prediction) and the estimator (the dataset).

It turns out we can adjust the equation a little to make the calculation down the track a little more simple.

Now a question may arise, ** Why do we take the squared differences and simply not the absolute differences?** Because the squared differences make it easier to derive a regression line. Indeed, to find that line we need to compute the first derivative of the Cost function, and it is much harder to compute the derivative of absolute values than squared values. Also, the squared differences increase the error distance, thus, making the bad predictions more pronounced than the good ones.

The equation looks like -

Let us apply this cost function to the following data:

Here we will calculate some of the theta values and then plot the cost function by hand. Since this function passes through (0, 0), we will look only at a single value of theta. Also, let us refer to the cost function as J(ϴ) from now on.

When the value of ϴ is 1, for J(1), we get a 0. You will notice the value of J(1) gives a straight line which fits the data perfectly. Now let us try with ϴ = 0.5

The MSE function gives us a value of 0.58. Let’s plot both our values so far:

J(1) = 0

J(0.5) = 0.58

Let us go ahead and calculate some more values of J(ϴ).

Now if we join the dots carefully, we will get -

As we can see, the cost function is at a minimum when theta = 1, which means the initial data is a straight line with a slope or gradient of 1 as shown by the orange line in the above figure.

Using a trial and error method, we minimized J(ϴ). We did all of these by trying out a lot of values and with the help of visualizations. **Gradient Descent** does the same thing in a much better way, by changing the theta values or parameters until it descends to the minimum value.

You may refer below for the Python code to find out cost function:

import matplotlib.pyplot as plt import numpy as np # original data set X = [1, 2, 3] y = [1, 2, 3] # slope of best_fit_1 is 0.5 # slope of best_fit_2 is 1.0 # slope of best_fit_3 is 1.5 hyps = [0.5, 1.0, 1.5] # multiply the original X values by the theta # to produce hypothesis values for each X def multiply_matrix(mat, theta): mutated = [] for i in range(len(mat)): mutated.append(mat[i] * theta) return mutated # calculate cost by looping each sample # subtract hyp(x) from y # square the result # sum them all together def calc_cost(m, X, y): total = 0 for i in range(m): squared_error = (y[i] - X[i]) ** 2 total += squared_error return total * (1 / (2*m)) # calculate cost for each hypothesis for i in range(len(hyps)): hyp_values = multiply_matrix(X, hyps[i]) print("Cost for ", hyps[i], " is ", calc_cost(len(X), y, hyp_values))

Cost for 0.5 is 0.5833333333333333 Cost for 1.0 is 0.0 Cost for 1.5 is 0.5833333333333333

Let us now start by initializing theta0 and theta1 to any two values, say 0 for both, and go from there. The algorithm is as follows:

where α, alpha, is the **learning rate**, or how rapidly do we want to move towards the minimum. We can always overshoot if the value of α is too large.

The derivative which refers to the slope of the function is calculated. Here we calculate the partial derivative of the cost function. It helps us to know the direction (sign) in which the coefficient values should move so that they attain a lower cost on the following iteration.

Once we know the direction from the derivative, we can update the coefficient values. Now you need to specify a learning rate parameter which will control how much the coefficients can change on each update.

coefficient = coefficient – (alpha * delta)

This particular process is repeated as long as the cost of the coefficients is 0.0 or close enough to zero.

This turns out to be:

Which gives us linear regression!

**1. Batch Gradient Descent:** In this type of gradient descent, all the training examples are processed for each iteration of gradient descent. It gets computationally expensive if the number of training examples is large. This is when batch gradient descent is not preferred, rather a stochastic gradient descent or mini-batch gradient descent is used.

**Algorithm for batch gradient descent:**

Let h_{θ}(x) be the hypothesis for linear regression. Then, the cost function is given by:

Let Σ represents the sum of all training examples from i=1 to m.

Repeat {

For every j =0 …n

}

Where x_{j}^{(i)} represents the j^{th} feature of the i^{th} training example. So if m is very large, then the derivative term fails to converge at the global minimum.

**2. Stochastic Gradient Descent:** The word stochastic is related to a system or a process that is linked with a random probability. Therefore, in Stochastic Gradient Descent (SGD) samples are selected at random for each iteration instead of selecting the entire data set. When the number of training examples is too large, it becomes computationally expensive to use batch gradient descent, however, Stochastic Gradient Descent uses only a single sample, i.e., a batch size of one, to perform each iteration. The sample is randomly shuffled and selected for performing the iteration. The parameters are updated even after one iteration where only one has been processed. Thus, it gets faster than batch gradient descent.

**Algorithm for stochastic gradient descent:**

- Firstly shuffle the data set randomly in order to train the parameters evenly for each type of data.
- As mentioned above, it takes into consideration one example per iteration.

Hence,

Let (x^{(i)},y^{(i)}) be the training example

Repeat {

For i=1 to m{

For every j =0 …n

}

}

**3. Mini Batch gradient descent:** This type of gradient descent is considered to be faster than both batch gradient descent and stochastic gradient descent. Even if the number of training examples is large, it processes it in batches in one go. Also, the number of iterations are lesser in spite of working with larger training samples.

**Algorithm for mini-batch gradient descent:**

Let us consider b be the number of examples in one batch, where b<m. Now, assume b=10 and m=100.

The batch size can be adjusted. It is generally kept as a power of 2. The reason behind it is because some hardware such as GPUs achieve better run time with common batch sizes such as a power of 2.

Repeat {

For i=1,11, 21,…..,91

Let Σ be the summation from i to i+9 represented by k.

For every j =0 …n

}

For Batch Gradient Descent, the algorithm traces a straight line towards the minimum. If the cost function is convex, then it converges to a global minimum and if the cost function is not convex, then it converges to a local minimum. The learning rate is typically held constant over here.

For stochastic gradient descent and mini-batch gradient descent, the algorithm keeps on fluctuating around the global minimum instead of converging. In order to converge, the learning rate needs to be changed slowly.

There are many cases where gradient descent fails to perform well. There are mainly three reasons when this would happen:

- Data challenges
- Gradient challenges
- Implementation challenges

- The arrangement of data sometimes leads to challenges. If it is arranged in such a way that it poses a non-convex optimization problem then it becomes difficult to perform optimization using gradient descent. Gradient descent works for problems which are arranged with a well-defined convex optimization problem.
- During the optimization of a convex optimization problem, you will come across several minimal points. The lowest among all the points is called the global minimum, and other points are called the local minima. You will have to make sure you go to the global minimum and avoid local minima.
- There is also a saddle point problem. This is a situation where the gradient is zero but is not an optimal point. It cannot be avoided and is still an active part of the research.

While using gradient descent, if the execution is not proper, it leads to certain problems like vanishing gradient. This happens when the gradient is either too small or too large which results in no convergence.

- Smaller memory results in the failure of network. A lot of neural network practitioners do not pay attention but it is very important to look at the resource utilization by the network.
- Another important thing to look at is to keep track of things like floating point considerations and hardware/software prerequisites.

Let us look at some of the most commonly used gradient descent algorithms and how they are implemented.

One of the simplest forms of gradient descent technique is the Vanilla Gradient Descent. Here, vanilla means pure / without any adulteration. In this algorithm, the main feature is that small steps are taken in the direction of minima by taking the gradient of cost function.

The pseudocode for the same is mentioned below.

update = learning_rate * gradient_of_parameters parameters = parameters - update

If you see here, the parameters are updated by taking the gradient of the parameters and then the learning rate is multiplied which suggest how quickly we should go towards the minimum. Learning rate is a hyper-parameter and while choosing its value you should be careful.

In this case, we adjust the algorithm in such a manner that we are aware about the prior step before taking the next step.

The pseudocode for the same is mentioned below.

update = learning_rate * gradient velocity = previous_update * momentum parameter = parameter + velocity - update

Here, our update is the same as that of vanilla gradient descent. But we are introducing a new term called velocity, which considers the previous update and a constant which is called momentum.

ADAGRAD (Adaptive Gradient Algorithm) mainly uses an adaptive technique to learn rate updation. In this algorithm, we try to change the algorithm on the basis of how the gradient has been changing for all the previous iterations.

The pseudocode for the same is mentioned below.

grad_component = previous_grad_component + (gradient * gradient) rate_change = square_root(grad_component) + epsilon adapted_learning_rate = learning_rate * rate_change update = adapted_learning_rate * gradient parameter = parameter - update

In the above code, epsilon is a constant which is used to keep the rate of change of learning rate in check.

ADAM is another adaptive technique which is built out of ADAGRAD and further reduces its downside. In simple words you can consider it to be ADAGRAD + momentum.

The pseudocode for the same is mentioned below.

adapted_gradient = previous_gradient + ((gradient - previous_gradient) * (1 - beta1)) gradient_component = (gradient_change - previous_learning_rate) adapted_learning_rate = previous_learning_rate + (gradient_component * (1 - beta2)) update = adapted_learning_rate * adapted_gradient parameter = parameter - update

Here beta1 and beta2 are constants to keep changes in gradient and learning rate in check

In this section you will learn about some tips and tricks for getting the most out of the gradient descent algorithm for machine learning.

**Plot Cost versus Time:**It is suggested to collect and plot the cost values calculated by the algorithm for each iteration. It helps you keep track of the descent. For a well-performing gradient descent the cost always decreases in each iteration. If you see there is no decrease, reduce the learning rate.**Learning Rate:**The learning rate value is a small real value such as 0.1, 0.001 or 0.0001. Keep trying different values to check which works best for your algorithm.**Rescale Inputs:**Try to achieve a range such as [0, 1] or [-1, 1] by rescaling all the input variables. The algorithm reaches the minimum cost faster if the shape of the cost function is not distorted or skewed.**Few Passes:**Stochastic gradient descent often does not need more than 1-to-10 passes through the training dataset to converge on good or good enough coefficients.**Plot Mean Cost:**The updates for each training dataset instance can result in a noisy plot of cost over time when using stochastic gradient descent. Try to take the average over 10, 100, or 1000 updates. This will give you a better idea of the learning trend for the algorithm.

Now that we have gone through all the elements related to gradient descent, let us implement gradient descent in Python. A simple gradient Descent Algorithm is as follows:

- Obtain a function in order to minimize f(x)
- Initialize a value x from which you want to start the descent or optimization from
- Specify a learning rate which will determine how much of a step to descend by or how quickly you want to converge to the minimum value
- Find the derivative of that value x (the descent)
- Now proceed to descend by the derivative of that value and then multiply it by the learning rate
- Update the value of x with the new value descended to
- Check your stop condition in order to see whether to stop
- If condition satisfies, stop. If not, proceed to step 4 with the new x value and keep repeating the algorithm

Let us create an arbitrary loss function and try to find a local minimum value for that function by implementing a simple representation of gradient descent using Python.

import numpy as np import matplotlib.pyplot as plt %matplotlib inline

We will find the gradient descent of this function: x3 - 3x2 + 5

#creating the function and plotting it function = lambda x: (x ** 3)-(3*(x ** 2))+5 #Get 1000 evenly spaced numbers between -1 and 3 (arbitrarily chosen to ensure steep curve) x = np.linspace(-1,3,500) #Plot the curve plt.plot(x, function(x)) plt.show()

Here, we can see that our minimum value should be around 2.0

Let us now use the gradient descent to find the exact value

def deriv(x): ''' Description: This function takes in a value of x and returns its derivative based on the initial function we specified. Arguments: x - a numerical value of x Returns: x_deriv - a numerical value of the derivative of x ''' x_deriv = 3* (x**2) - (6 * (x)) return x_deriv def step(x_new, x_prev, precision, l_r): ''' Description: This function takes in an initial or previous value for x, updates it based on steps taken via the learning rate and outputs the minimum value of x that reaches the precision satisfaction. Arguments: x_new - a starting value of x that will get updated based on the learning rate x_prev - the previous value of x that is getting updated to the new one precision - a precision that determines the stop of the stepwise descent l_r - the learning rate (size of each descent step) Output: 1. Prints out the latest new value of x which equates to the minimum we are looking for 2. Prints out the number of x values which equates to the number of gradient descent steps 3. Plots a first graph of the function with the gradient descent path 4. Plots a second graph of the function with a zoomed in gradient descent path in the important area ''' # create empty lists where the updated values of x and y wil be appended during each iteration x_list, y_list = [x_new], [function(x_new)] # keep looping until your desired precision while abs(x_new - x_prev) > precision: # change the value of x x_prev = x_new # get the derivation of the old value of x d_x = - deriv(x_prev) # get your new value of x by adding the previous, the multiplication of the derivative and the learning rate x_new = x_prev + (l_r * d_x) # append the new value of x to a list of all x-s for later visualization of path x_list.append(x_new) # append the new value of y to a list of all y-s for later visualization of path y_list.append(function(x_new)) print ("Local minimum occurs at: "+ str(x_new)) print ("Number of steps: " + str(len(x_list))) plt.subplot(1,2,2) plt.scatter(x_list,y_list,c="g") plt.plot(x_list,y_list,c="g") plt.plot(x,function(x), c="r") plt.title("Gradient descent") plt.show() plt.subplot(1,2,1) plt.scatter(x_list,y_list,c="g") plt.plot(x_list,y_list,c="g") plt.plot(x,function(x), c="r") plt.xlim([1.0,2.1]) plt.title("Zoomed in Gradient descent to Key Area") plt.show()

#Implement gradient descent (all the arguments are arbitrarily chosen) step(0.5, 0, 0.001, 0.05)

Local minimum occurs at: 1.9980265135950486

Number of steps: 25

In this article, you have learned about gradient descent for machine learning. Here we tried to cover most of the topics. To learn more about machine learning algorithms in-depth, click here. Let us summarize all that we have covered in this article.

- Optimization is the heart and soul of machine learning.
- Gradient descent is a simple optimization technique which can be used with other machine learning algorithms.
- Batch gradient descent refers to calculating the derivative from all training data before calculating an update.
- Stochastic gradient descent refers to calculating the derivative from each training data instance and calculating the update immediately.

If you are inspired by the opportunities provided by Data Science, enrol in our Data Science and Machine Learning Courses for more lucrative career options in this landscape.

Rated 4.5/5
based on 12 customer reviews

7942

- by Priyankur Sarkar
- 22 Oct 2019
- 10 mins read

Ensemble learning is a strategy in which a group o... Read More

Rated 4.5/5
based on 12 customer reviews

16648

- by Priyankur Sarkar
- 14 Oct 2019
- 8 mins read

In today’s world, innovations happen on a daily ... Read More

Rated 4.0/5
based on 67 customer reviews

27729

- by Priyankur Sarkar
- 09 Oct 2019

While many classifiers exist that can classify lin... Read More

Subscribe to our newsletter.

## Join the Discussion

Your email address will not be published. Required fields are marked *