Search

Machine learning Filter

What is Gradient Descent For Machine Learning

In our day-to-day lives, we are optimizing variables based on our personal decisions and we don’t even recognize the process consciously. We are constantly using optimization techniques all day long, for example, while going to work, choosing a shorter route in order to minimize traffic woes, figuring out and managing a quick walk around the campus during a snack break, or scheduling a cab in advance to reach the airport on time.Optimization is the ultimate goal, whether you are dealing with actual events in real-life or creating a technology-based product. Optimization is at the heart of most of the statistical and machine learning techniques which are widely used in data science. To gain more knowledge and skills on data science and machine learning, join the  certification course now.Optimization for Machine LearningAccuracy is the word with which we are most concerned, while we are dealing with problems related to machine learning and artificial intelligence. Any rate of errors cannot be tolerated while dealing with real-world problems and neither should they be compromised.Let us consider a case of self-driving cars. The model fitted in the car detects any obstacles that come in the way and takes appropriate actions, which can be slowing down the speed or pulling on the brakes and so on. Now we need to keep this in mind that there is no human in the car to operate or withdraw the actions taken by the self-driving car. In such a scenario, suppose the model is not accurate. It will not be able to detect other cars or any pedestrians and end up crashing leading to several lives at risk.This is where we need optimization algorithms to evaluate our model and judge whether the model is performing according to our needs or not. The evaluation can be made easy by calculating the cost function (which we will look into in a while in this article in detail). It is basically a mapping function that tells us about the difference between the desired output and what our model is computing. We can accordingly correct the model and avoid any kind of undesired activities.Optimization may be defined as the process by which an optimum is achieved. It is all about designing an optimal output for your problems with the use of resources available. However, optimization in machine learning is slightly different. In most of the cases, we are aware of the data, the shape and size, which also helps us know the areas we need to improve. But in machine learning we do not know how the new data may look like, this is where optimization acts perfectly. Optimization techniques are performed on the training data and then the validation data set is used to check its performance.There are a lot of advanced applications of optimization which are widely used in airway routing, market basket analysis, face recognition and so on. Machine learning algorithms such as linear regression, KNN, neural networks completely depend on optimization techniques. Here, we are going to look into one such popular optimization technique called Gradient Descent.What is Gradient Descent?Gradient descent is an optimization algorithm which is mainly used to find the minimum of a function. In machine learning, gradient descent is used to update parameters in a model. Parameters can vary according to the algorithms, such as coefficients in Linear Regression and weights in Neural Networks.Let us relate gradient descent with a real-life analogy for better understanding. Think of a valley you would like to descend when you are blind-folded. Any sane human will take a step and look for the slope of the valley, whether it goes up or down. Once you are sure of the downward slope you will follow that and repeat the step again and again until you have descended completely (or reached the minima).Similarly, let us consider another analogy. Suppose you have a ball and you place it on an inclined plane (at position A). As per laws, it will start rolling until it travels to a gentle plane where it will be stationary (at position B as shown in the figure below).This is exactly what happens in gradient descent. The inclined and/or irregular is the cost function when it is plotted and the role of gradient descent is to provide direction and the velocity (learning rate)  of the movement in order to attain the minima of the function i.e where the cost is minimum.How does Gradient Descent work?The primary goal of machine learning algorithms is always to build a model, which is basically a hypothesis which can be used to find an estimation for Y based on X. Let us consider an example of a model based on certain housing data which comprises of the sale price of the house, the size of the house etc. Suppose we want to predict the pricing of the house based on its size. It is clearly a regression problem where given some inputs, we would like to predict a continuous output.The hypothesis is usually presented aswhere the theta values are the parameters.Let us look into some examples and visualize the hypothesis:This yields h(x) = 1.5 + 0x. 0x means no slope, and y will always be the constant 1.5. This looks like:Now let us consider,Where, h(x) = 1 + 0.5xCost FunctionThe objective in the case of gradient descent is to find a line of best fit for some given inputs, or X values, and any number of Y values, or outputs. A cost function is defined as “a function that maps an event or values of one or more variables onto a real number intuitively representing some “cost” associated with the event.”With a known set of inputs and their corresponding outputs, a machine learning model attempts to make predictions according to the new set of inputs.Machine Learning ProcessThe Error would be the difference between the two predictions.This relates to the idea of a Cost function or Loss function.A Cost Function/Loss Function tells us “how good” our model is at making predictions for a given set of parameters. The cost function has a curve and a gradient, the slope of this curve helps us to update our parameters and make an accurate model.Minimizing the Cost FunctionIt is always the primary goal of any Machine Learning Algorithm to minimize the Cost Function. Minimizing cost functions will also result in a lower error between the predicted values and the actual values which also denotes that the algorithm has performed well in learning. How do we actually minimize any function?Generally, the cost function is in the form of Y = X². In a Cartesian coordinate system, this represents an equation for a parabola which can be graphically represented as :ParabolaNow in order to minimize the function mentioned above, firstly we need to find the value of X which will produce the lowest value of Y (in this case it is the red dot). With lower dimensions (like 2D in this case) it becomes easier to locate the minima but it is not the same while dealing with higher dimensions. For such cases, we need to use the Gradient Descent algorithm to locate the minima.Now a function is required which will minimize the parameters over a dataset. The most common function which is often used is the  mean squared error. It measures the difference between the estimated value (the prediction) and the estimator (the dataset).Mean Squared ErrorIt turns out we can adjust the equation a little to make the calculation down the track a little more simple. Now a question may arise, Why do we take the squared differences and simply not the absolute differences? Because the squared differences make it easier to derive a regression line. Indeed, to find that line we need to compute the first derivative of the Cost function, and it is much harder to compute the derivative of absolute values than squared values. Also, the squared differences increase the error distance, thus, making the bad predictions more pronounced than the good ones.The equation looks like -Mean Squared ErrorLet us apply this cost function to the following data:Here we will calculate some of the theta values and then plot the cost function by hand. Since this function passes through (0, 0), we will look only at a single value of theta. Also, let us refer to the cost function as J(ϴ) from now on.When the value of ϴ is 1, for J(1), we get a 0. You will notice the value of J(1) gives a straight line which fits the data perfectly. Now let us try with ϴ = 0.5J(0.5)The MSE function gives us a value of 0.58. Let’s plot both our values so far:J(1) = 0J(0.5) = 0.58With J(1) and J(0.5)Let us go ahead and calculate some more values of J(ϴ).Now if we join the dots carefully, we will get -Visualizing the cost function J(ϴ)As we can see, the cost function is at a minimum when theta = 1, which means the initial data is a straight line with a slope or gradient of 1 as shown by the orange line in the above figure.Using a trial and error method, we minimized J(ϴ). We did all of these by trying out a lot of values and with the help of visualizations. Gradient Descent does the same thing in a much better way, by changing the theta values or parameters until it descends to the minimum value.You may refer below for the Python code to find out cost function:import matplotlib.pyplot as plt import numpy as np # original data set X = [1, 2, 3] y = [1, 2, 3] # slope of best_fit_1 is 0.5 # slope of best_fit_2 is 1.0 # slope of best_fit_3 is 1.5 hyps = [0.5, 1.0, 1.5] # multiply the original X values by the theta # to produce hypothesis values for each X def multiply_matrix(mat, theta): mutated = [] for i in range(len(mat)):     mutated.append(mat[i] * theta) return mutated # calculate cost by looping each sample # subtract hyp(x) from y # square the result # sum them all together def calc_cost(m, X, y): total = 0 for i in range(m):     squared_error = (y[i] - X[i]) ** 2     total += squared_error     return total * (1 / (2*m)) # calculate cost for each hypothesis for i in range(len(hyps)): hyp_values = multiply_matrix(X, hyps[i])   print("Cost for ", hyps[i], " is ", calc_cost(len(X), y, hyp_values))Cost for 0.5 is 0.5833333333333333 Cost for 1.0 is 0.0 Cost for 1.5 is 0.5833333333333333 Learning RateLet us now start by initializing theta0 and theta1 to any two values, say 0 for both, and go from there. The algorithm is as follows:Gradient Descentwhere α, alpha, is the learning rate, or how rapidly do we want to move towards the minimum. We can always overshoot if the value of α is too large.The derivative which refers to the slope of the function is calculated. Here we calculate the partial derivative of the cost function. It helps us to know the direction (sign) in which the coefficient values should move so that they attain a lower cost on the following iteration. Partial Derivative of the Cost Function which we need to calculateOnce we know the direction from the derivative, we can update the coefficient values. Now you need to specify a learning rate parameter which will control how much the coefficients can change on each update.coefficient = coefficient – (alpha * delta)This particular process is repeated as long as the cost of the coefficients is 0.0 or close enough to zero.This turns out to be:Image from Andrew Ng’s machine learning courseWhich gives us linear regression!Linear RegressionTypes of Gradient Descent AlgorithmsGradient descent variants’ trajectory towards the minimum1. Batch Gradient Descent: In this type of gradient descent, all the training examples are processed for each iteration of gradient descent. It gets computationally expensive if the number of training examples is large. This is when batch gradient descent is not preferred, rather a stochastic gradient descent or mini-batch gradient descent is used.Algorithm for batch gradient descent:Let hθ(x) be the hypothesis for linear regression. Then, the cost function is given by:Let Σ represents the sum of all training examples from i=1 to m.Repeat {For every j =0 …n}Where xj(i) represents the jth feature of the ith training example. So if m is very large, then the derivative term fails to converge at the global minimum.2. Stochastic Gradient Descent: The word stochastic is related to a system or a process that is linked with a random probability. Therefore, in Stochastic Gradient Descent (SGD) samples are selected at random for each iteration instead of selecting the entire data set. When the number of training examples is too large, it becomes computationally expensive to use batch gradient descent, however, Stochastic Gradient Descent uses only a single sample, i.e., a batch size of one, to perform each iteration. The sample is randomly shuffled and selected for performing the iteration. The parameters are updated even after one iteration where only one has been processed. Thus, it gets faster than batch gradient descent.Algorithm for stochastic gradient descent:Firstly shuffle the data set randomly in order to train the parameters evenly for each type of data.As mentioned above, it takes into consideration one example per iteration.Hence,Let (x(i),y(i)) be the training exampleRepeat {For i=1 to m{        For every j =0 …n              }}3. Mini Batch gradient descent: This type of gradient descent is considered to be faster than both batch gradient descent and stochastic gradient descent. Even if the number of training examples is large, it processes it in batches in one go. Also, the number of iterations are lesser in spite of working with larger training samples.Algorithm for mini-batch gradient descent:Let us consider b be the number of examples in one batch, where b<m. Now, assume b=10 and m=100.The batch size can be adjusted. It is generally kept as a power of 2. The reason behind it is because some hardware such as GPUs achieve better run time with common batch sizes such as a power of 2.Repeat { For i=1,11, 21,…..,91Let Σ be the summation from i to i+9 represented by k.  For every j =0 …n}Convergence trends in different variants of Gradient DescentFor Batch Gradient Descent, the algorithm traces a straight line towards the minimum. If the cost function is convex, then it converges to a global minimum and if the cost function is not convex, then it converges to a local minimum. The learning rate is typically held constant over here.For stochastic gradient descent and mini-batch gradient descent, the algorithm keeps on fluctuating around the global minimum instead of converging. In order to converge, the learning rate needs to be changed slowly.Challenges in executing Gradient DescentThere are many cases where gradient descent fails to perform well. There are mainly three reasons when this would happen:Data challengesGradient challengesImplementation challengesData ChallengesThe arrangement of data sometimes leads to challenges. If it is arranged in such a way that it poses a  non-convex optimization problem then it becomes difficult to perform optimization using gradient descent. Gradient descent works for problems which are arranged with a well-defined convex optimization problem.During the optimization of a convex optimization problem, you will come across several minimal points. The lowest among all the points is called the global minimum, and other points are called the local minima. You will have to make sure you go to the global minimum and avoid local minima.There is also a saddle point problem. This is a situation where the gradient is zero but is not an optimal point. It cannot be avoided and is still an active part of the research.Gradient ChallengesWhile using gradient descent, if the execution is not proper, it leads to certain problems like vanishing gradient. This happens when the gradient is either too small or too large which results in no convergence.Implementation ChallengesSmaller memory results in the failure of network. A lot of neural network practitioners do not pay attention but it is very important to look at the resource utilization by the network.Another important thing to look at is to keep track of things like floating point considerations and hardware/software prerequisites.Variants of Gradient Descent algorithmsLet us look at some of the most commonly used gradient descent algorithms and how they are implemented.Vanilla Gradient DescentOne of the simplest forms of gradient descent technique is the Vanilla Gradient Descent. Here, vanilla means pure / without any adulteration. In this algorithm, the main feature is that small steps are taken in the direction of minima by taking the gradient of cost function.The pseudocode for the same is mentioned below.update = learning_rate * gradient_of_parameters parameters = parameters - updateIf you see here, the parameters are updated by taking the gradient of the parameters and then the learning rate is multiplied which suggest how quickly we should go towards the minimum. Learning rate is a hyper-parameter and while choosing its value you should be careful.Gradient Descent with MomentumIn this case, we adjust the algorithm in such a manner that we are aware about the prior step before taking the next step.The pseudocode for the same is mentioned below.update = learning_rate * gradient velocity = previous_update * momentum parameter = parameter + velocity - updateHere, our update is the same as that of vanilla gradient descent. But we are introducing a new term called velocity, which considers the previous update and a constant which is called momentum.SourceADAGRADADAGRAD (Adaptive Gradient Algorithm) mainly uses an adaptive technique to learn rate updation. In this algorithm, we try to change the algorithm on the basis of how the gradient has been changing for all the previous iterations.The pseudocode for the same is mentioned below.grad_component = previous_grad_component + (gradient * gradient) rate_change = square_root(grad_component) + epsilon adapted_learning_rate = learning_rate * rate_change  update = adapted_learning_rate * gradient  parameter = parameter - updateIn the above code, epsilon is a constant which is used to keep the rate of change of learning rate in check.ADAMADAM is another adaptive technique which is built out of ADAGRAD and further reduces its downside. In simple words you can consider it to be ADAGRAD + momentum.The pseudocode for the same is mentioned below.adapted_gradient = previous_gradient + ((gradient - previous_gradient) * (1 - beta1)) gradient_component = (gradient_change - previous_learning_rate) adapted_learning_rate =  previous_learning_rate + (gradient_component * (1 - beta2)) update = adapted_learning_rate * adapted_gradient parameter = parameter - updateHere beta1 and beta2 are constants to keep changes in gradient and learning rate in checkTips for Gradient DescentIn this section you will learn about some tips and tricks for getting the most out of the gradient descent algorithm for machine learning.Plot Cost versus Time: It is suggested to collect and plot the cost values calculated by the algorithm for each iteration. It helps you keep track of the descent. For a well-performing gradient descent the cost always decreases in each iteration. If you see there is no decrease, reduce the learning rate.Learning Rate: The learning rate value is a small real value such as 0.1, 0.001 or 0.0001. Keep trying different values to check which works best for your algorithm.Rescale Inputs: Try to achieve a range such as [0, 1] or [-1, 1] by rescaling all the input variables. The algorithm reaches the minimum cost faster if the shape of the cost function is not distorted or skewed.Few Passes: Stochastic gradient descent often does not need more than 1-to-10 passes through the training dataset to converge on good or good enough coefficients.Plot Mean Cost: The updates for each training dataset instance can result in a noisy plot of cost over time when using stochastic gradient descent. Try to take the average over 10, 100, or 1000 updates. This will give you a better idea of the learning trend for the algorithm.Implementation of Gradient Descent in PythonNow that we have gone through all the elements related to gradient descent, let us implement gradient descent in Python. A simple gradient Descent Algorithm is as follows:Obtain a function in order to minimize f(x)Initialize a value x from which you want to start the descent or optimization fromSpecify a learning rate which will determine how much of a step to descend by or how quickly you want to converge to the minimum valueFind the derivative of that value x (the descent)Now proceed to descend by the derivative of that value and then multiply it by the learning rateUpdate the value of x with the new value descended toCheck your stop condition in order to see whether to stopIf condition satisfies, stop. If not, proceed to step 4 with the new x value and keep repeating the algorithmLet us create an arbitrary loss function and try to find a local minimum value for that function by implementing a simple representation of gradient descent using Python.import numpy as np import matplotlib.pyplot as plt %matplotlib inlineWe will find the gradient descent of this function: x3 - 3x2 + 5#creating the function and plotting it function = lambda x: (x ** 3)-(3*(x ** 2))+5 #Get 1000 evenly spaced numbers between -1 and 3 (arbitrarily chosen to ensure steep curve) x = np.linspace(-1,3,500) #Plot the curve plt.plot(x, function(x)) plt.show()Here, we can see that our minimum value should be around 2.0Let us now use the gradient descent to find the exact valuedef deriv(x):      ''' Description: This function takes in a value of x and returns its derivative based on the initial function we specified.      Arguments:      x - a numerical value of x      Returns:      x_deriv - a numerical value of the derivative of x      '''      x_deriv = 3* (x**2) - (6 * (x)) return x_deriv def step(x_new, x_prev, precision, l_r): ''' Description: This function takes in an initial or previous value for x, updates it based on steps taken via the learning rate and outputs the minimum value of x that reaches the precision satisfaction.      Arguments:      x_new - a starting value of x that will get updated based on the learning rate      x_prev - the previous value of x that is getting updated to the new one      precision - a precision that determines the stop of the stepwise descent      l_r - the learning rate (size of each descent step)      Output:      1. Prints out the latest new value of x which equates to the minimum we are looking for 2. Prints out the number of x values which equates to the number of gradient descent steps 3. Plots a first graph of the function with the gradient descent path 4. Plots a second graph of the function with a zoomed in gradient descent path in the important area      '''      # create empty lists where the updated values of x and y wil be appended during each iteration      x_list, y_list = [x_new], [function(x_new)] # keep looping until your desired precision while abs(x_new - x_prev) > precision:          # change the value of x     x_prev = x_new      # get the derivation of the old value of x     d_x = - deriv(x_prev)          # get your new value of x by adding the previous, the multiplication of the derivative and the learning rate     x_new = x_prev + (l_r * d_x)          # append the new value of x to a list of all x-s for later visualization of path     x_list.append(x_new)          # append the new value of y to a list of all y-s for later visualization of path     y_list.append(function(x_new)) print ("Local minimum occurs at: "+ str(x_new)) print ("Number of steps: " + str(len(x_list)))           plt.subplot(1,2,2) plt.scatter(x_list,y_list,c="g") plt.plot(x_list,y_list,c="g") plt.plot(x,function(x), c="r") plt.title("Gradient descent") plt.show() plt.subplot(1,2,1) plt.scatter(x_list,y_list,c="g") plt.plot(x_list,y_list,c="g") plt.plot(x,function(x), c="r") plt.xlim([1.0,2.1]) plt.title("Zoomed in Gradient descent to Key Area") plt.show() #Implement gradient descent (all the arguments are arbitrarily chosen) step(0.5, 0, 0.001, 0.05)Local minimum occurs at: 1.9980265135950486Number of steps: 25 SummaryIn this article, you have learned about gradient descent for machine learning. Here we tried to cover most of the topics. To learn more about machine learning algorithms in-depth,  click here. Let us summarize all that we have covered in this article.Optimization is the heart and soul of machine learning.Gradient descent is a simple optimization technique which can be used with other machine learning algorithms.Batch gradient descent refers to calculating the derivative from all training data before calculating an update.Stochastic gradient descent refers to calculating the derivative from each training data instance and calculating the update immediately.If you are inspired by the opportunities provided by Data Science, enrol in our  Data Science and Machine Learning Courses for more lucrative career options in this landscape.

What is Gradient Descent For Machine Learning

13768
What is Gradient Descent For Machine Learning

In our day-to-day lives, we are optimizing variables based on our personal decisions and we don’t even recognize the process consciously. We are constantly using optimization techniques all day long, for example, while going to work, choosing a shorter route in order to minimize traffic woes, figuring out and managing a quick walk around the campus during a snack break, or scheduling a cab in advance to reach the airport on time.

Optimization is the ultimate goal, whether you are dealing with actual events in real-life or creating a technology-based product. Optimization is at the heart of most of the statistical and machine learning techniques which are widely used in data science. To gain more knowledge and skills on data science and machine learning, join the  certification course now.

Optimization for Machine Learning

Accuracy is the word with which we are most concerned, while we are dealing with problems related to machine learning and artificial intelligence. Any rate of errors cannot be tolerated while dealing with real-world problems and neither should they be compromised.

Optimization for machine Learning

Let us consider a case of self-driving cars. The model fitted in the car detects any obstacles that come in the way and takes appropriate actions, which can be slowing down the speed or pulling on the brakes and so on. Now we need to keep this in mind that there is no human in the car to operate or withdraw the actions taken by the self-driving car. In such a scenario, suppose the model is not accurate. It will not be able to detect other cars or any pedestrians and end up crashing leading to several lives at risk.

This is where we need optimization algorithms to evaluate our model and judge whether the model is performing according to our needs or not. The evaluation can be made easy by calculating the cost function (which we will look into in a while in this article in detail). It is basically a mapping function that tells us about the difference between the desired output and what our model is computing. We can accordingly correct the model and avoid any kind of undesired activities.

Optimization may be defined as the process by which an optimum is achieved. It is all about designing an optimal output for your problems with the use of resources available. However, optimization in machine learning is slightly different. In most of the cases, we are aware of the data, the shape and size, which also helps us know the areas we need to improve. But in machine learning we do not know how the new data may look like, this is where optimization acts perfectly. Optimization techniques are performed on the training data and then the validation data set is used to check its performance.

There are a lot of advanced applications of optimization which are widely used in airway routing, market basket analysis, face recognition and so on. Machine learning algorithms such as linear regression, KNN, neural networks completely depend on optimization techniques. Here, we are going to look into one such popular optimization technique called Gradient Descent.

What is Gradient Descent?

Gradient descent is an optimization algorithm which is mainly used to find the minimum of a function. In machine learning, gradient descent is used to update parameters in a model. Parameters can vary according to the algorithms, such as coefficients in Linear Regression and weights in Neural Networks.

Let us relate gradient descent with a real-life analogy for better understanding. Think of a valley you would like to descend when you are blind-folded. Any sane human will take a step and look for the slope of the valley, whether it goes up or down. Once you are sure of the downward slope you will follow that and repeat the step again and again until you have descended completely (or reached the minima).

Gradient Descent in Machine Learning:- Valley, Slope

Similarly, let us consider another analogy. Suppose you have a ball and you place it on an inclined plane (at position A). As per laws, it will start rolling until it travels to a gentle plane where it will be stationary (at position B as shown in the figure below).

Gradient Descent in Machine Learning:- Ball placed on an inclined plane

This is exactly what happens in gradient descent. The inclined and/or irregular is the cost function when it is plotted and the role of gradient descent is to provide direction and the velocity (learning rate)  of the movement in order to attain the minima of the function i.e where the cost is minimum.

The graphical representation of Gradient Descent in Machine Learning

How does Gradient Descent work?

The primary goal of machine learning algorithms is always to build a model, which is basically a hypothesis which can be used to find an estimation for Y based on X. Let us consider an example of a model based on certain housing data which comprises of the sale price of the house, the size of the house etc. Suppose we want to predict the pricing of the house based on its size. It is clearly a regression problem where given some inputs, we would like to predict a continuous output.

The hypothesis is usually presented as

hypothesis formula

where the theta values are the parameters.

Let us look into some examples and visualize the hypothesis:

hypothesis values

This yields h(x) = 1.5 + 0x. 0x means no slope, and y will always be the constant 1.5. This looks like:

bar graph of hypothesis with no slope

Now let us consider,

hypothesis values -2

Bar Graph of Hypothesis with slope

Where, h(x) = 1 + 0.5x

Cost Function

The objective in the case of gradient descent is to find a line of best fit for some given inputs, or X values, and any number of Y values, or outputs. A cost function is defined as “a function that maps an event or values of one or more variables onto a real number intuitively representing some “cost” associated with the event.”

With a known set of inputs and their corresponding outputs, a machine learning model attempts to make predictions according to the new set of inputs.

Machine Learning Cost Function Process PredictionMachine Learning Process

The Error would be the difference between the two predictions.

The Error would be the difference between the two predictions.

This relates to the idea of a Cost function or Loss function.

A Cost Function/Loss Function tells us “how good” our model is at making predictions for a given set of parameters. The cost function has a curve and a gradient, the slope of this curve helps us to update our parameters and make an accurate model.

Minimizing the Cost Function

It is always the primary goal of any Machine Learning Algorithm to minimize the Cost Function. Minimizing cost functions will also result in a lower error between the predicted values and the actual values which also denotes that the algorithm has performed well in learning. 

How do we actually minimize any function?

Generally, the cost function is in the form of Y = X². In a Cartesian coordinate system, this represents an equation for a parabola which can be graphically represented as :

Parabola in Minimizing the Cost FunctionParabola

Now in order to minimize the function mentioned above, firstly we need to find the value of X which will produce the lowest value of Y (in this case it is the red dot). With lower dimensions (like 2D in this case) it becomes easier to locate the minima but it is not the same while dealing with higher dimensions. For such cases, we need to use the Gradient Descent algorithm to locate the minima.

Now a function is required which will minimize the parameters over a dataset. The most common function which is often used is the  mean squared error. It measures the difference between the estimated value (the prediction) and the estimator (the dataset).

Formula of Mean Squared ErrorMean Squared Error

It turns out we can adjust the equation a little to make the calculation down the track a little more simple. 

Now a question may arise, Why do we take the squared differences and simply not the absolute differences? Because the squared differences make it easier to derive a regression line. Indeed, to find that line we need to compute the first derivative of the Cost function, and it is much harder to compute the derivative of absolute values than squared values. Also, the squared differences increase the error distance, thus, making the bad predictions more pronounced than the good ones.

The equation looks like -

Mean Squared Error in Machine Learning with Squared DifferencesMean Squared Error

Let us apply this cost function to the following data:

Data Set before applying the cost function.

Here we will calculate some of the theta values and then plot the cost function by hand. Since this function passes through (0, 0), we will look only at a single value of theta. Also, let us refer to the cost function as J(ϴ) from now on.

When the value of ϴ is 1, for J(1), we get a 0. You will notice the value of J(1) gives a straight line which fits the data perfectly. Now let us try with ϴ = 0.5

Cost function applied data set graph -1J(0.5)

The MSE function gives us a value of 0.58. Let’s plot both our values so far:

J(1) = 0

J(0.5) = 0.58

Cost function applied data set graph-2With J(1) and J(0.5)

Let us go ahead and calculate some more values of J(ϴ).

Cost function applied data set graph-3

Now if we join the dots carefully, we will get -

Visualising the Cost Function GraphVisualizing the cost function J(ϴ)

As we can see, the cost function is at a minimum when theta = 1, which means the initial data is a straight line with a slope or gradient of 1 as shown by the orange line in the above figure.

Using a trial and error method, we minimized J(ϴ). We did all of these by trying out a lot of values and with the help of visualizations. Gradient Descent does the same thing in a much better way, by changing the theta values or parameters until it descends to the minimum value.

You may refer below for the Python code to find out cost function:

import matplotlib.pyplot as plt
import numpy as np

# original data set
X = [1, 2, 3]
y = [1, 2, 3]

# slope of best_fit_1 is 0.5
# slope of best_fit_2 is 1.0
# slope of best_fit_3 is 1.5

hyps = [0.5, 1.0, 1.5]

# multiply the original X values by the theta
# to produce hypothesis values for each X
def multiply_matrix(mat, theta):
mutated = []
for i in range(len(mat)):
    mutated.append(mat[i] * theta)

return mutated

# calculate cost by looping each sample
# subtract hyp(x) from y
# square the result
# sum them all together
def calc_cost(m, X, y):
total = 0
for i in range(m):
    squared_error = (y[i] - X[i]) ** 2
    total += squared_error
    
return total * (1 / (2*m))

# calculate cost for each hypothesis
for i in range(len(hyps)):
hyp_values = multiply_matrix(X, hyps[i])
 
print("Cost for ", hyps[i], " is ", calc_cost(len(X), y, hyp_values))
Cost for 0.5 is 0.5833333333333333
Cost for 1.0 is 0.0
Cost for 1.5 is 0.5833333333333333

Learning Rate

Let us now start by initializing theta0 and theta1 to any two values, say 0 for both, and go from there. The algorithm is as follows:

Learning RateGradient Descent


where α, alpha, is the learning rate, or how rapidly do we want to move towards the minimum. We can always overshoot if the value of α is too large.

Big Learning Rate vs Small Learning Rate

The derivative which refers to the slope of the function is calculated. Here we calculate the partial derivative of the cost function. It helps us to know the direction (sign) in which the coefficient values should move so that they attain a lower cost on the following iteration. 

Partial derivative of the Cost Function which we need to calculate.Partial Derivative of the Cost Function which we need to calculate

Once we know the direction from the derivative, we can update the coefficient values. Now you need to specify a learning rate parameter which will control how much the coefficients can change on each update.

coefficient = coefficient – (alpha * delta)

This particular process is repeated as long as the cost of the coefficients is 0.0 or close enough to zero.

This turns out to be:

Cost function formula 1Image from Andrew Ng’s machine learning course

Which gives us linear regression!

Linear Regression Formula in Machine LearningLinear Regression

Types of Gradient Descent Algorithms

Types of Gradient Descent Algorithms Graphical RepresentationGradient descent variants’ trajectory towards the minimum

1. Batch Gradient Descent: In this type of gradient descent, all the training examples are processed for each iteration of gradient descent. It gets computationally expensive if the number of training examples is large. This is when batch gradient descent is not preferred, rather a stochastic gradient descent or mini-batch gradient descent is used.

Batch Gradient Descent in Machine LearningAlgorithm for batch gradient descent:

Let hθ(x) be the hypothesis for linear regression. Then, the cost function is given by:

Let Σ represents the sum of all training examples from i=1 to m.

formula2

Repeat {

formula3For every j =0 …n

}

Where xj(i) represents the jth feature of the ith training example. So if m is very large, then the derivative term fails to converge at the global minimum.

2. Stochastic Gradient Descent: The word stochastic is related to a system or a process that is linked with a random probability. Therefore, in Stochastic Gradient Descent (SGD) samples are selected at random for each iteration instead of selecting the entire data set. When the number of training examples is too large, it becomes computationally expensive to use batch gradient descent, however, Stochastic Gradient Descent uses only a single sample, i.e., a batch size of one, to perform each iteration. The sample is randomly shuffled and selected for performing the iteration. The parameters are updated even after one iteration where only one has been processed. Thus, it gets faster than batch gradient descent.

Stochastic in Gradient Descent Graph in Machine Learning

Algorithm for stochastic gradient descent:

  1. Firstly shuffle the data set randomly in order to train the parameters evenly for each type of data.
  2. As mentioned above, it takes into consideration one example per iteration.

Hence,
Let (x(i),y(i)) be the training example

formula4

formula 5

Repeat {
For i=1 to m{

formula 6

        For every j =0 …n
              }
}

3. Mini Batch gradient descent: This type of gradient descent is considered to be faster than both batch gradient descent and stochastic gradient descent. Even if the number of training examples is large, it processes it in batches in one go. Also, the number of iterations are lesser in spite of working with larger training samples.

Mini Batch gradient descent graph in Machine Learning

Algorithm for mini-batch gradient descent:

Let us consider b be the number of examples in one batch, where b<m. Now, assume b=10 and m=100.
The batch size can be adjusted. It is generally kept as a power of 2. The reason behind it is because some hardware such as GPUs achieve better run time with common batch sizes such as a power of 2.

Repeat {

 For i=1,11, 21,…..,91

Let Σ be the summation from i to i+9 represented by k.

formula -6

  For every j =0 …n
}

Convergence trends in different variants of Gradient Descent

For Batch Gradient Descent, the algorithm traces a straight line towards the minimum. If the cost function is convex, then it converges to a global minimum and if the cost function is not convex, then it converges to a local minimum. The learning rate is typically held constant over here.

Convergence trends in different variants of Gradient Descent in Machine Learning

For stochastic gradient descent and mini-batch gradient descent, the algorithm keeps on fluctuating around the global minimum instead of converging. In order to converge, the learning rate needs to be changed slowly.

Challenges in executing Gradient Descent

There are many cases where gradient descent fails to perform well. There are mainly three reasons when this would happen:

  1. Data challenges
  2. Gradient challenges
  3. Implementation challenges

Data Challenges

  • The arrangement of data sometimes leads to challenges. If it is arranged in such a way that it poses a  non-convex optimization problem then it becomes difficult to perform optimization using gradient descent. Gradient descent works for problems which are arranged with a well-defined convex optimization problem.
  • During the optimization of a convex optimization problem, you will come across several minimal points. The lowest among all the points is called the global minimum, and other points are called the local minima. You will have to make sure you go to the global minimum and avoid local minima.
  • There is also a saddle point problem. This is a situation where the gradient is zero but is not an optimal point. It cannot be avoided and is still an active part of the research.

Gradient Challenges

  • While using gradient descent, if the execution is not proper, it leads to certain problems like vanishing gradient. This happens when the gradient is either too small or too large which results in no convergence.

Implementation Challenges

  • Smaller memory results in the failure of network. A lot of neural network practitioners do not pay attention but it is very important to look at the resource utilization by the network.
  • Another important thing to look at is to keep track of things like floating point considerations and hardware/software prerequisites.

Variants of Gradient Descent algorithms

Let us look at some of the most commonly used gradient descent algorithms and how they are implemented.

Vanilla Gradient Descent

One of the simplest forms of gradient descent technique is the Vanilla Gradient Descent. Here, vanilla means pure / without any adulteration. In this algorithm, the main feature is that small steps are taken in the direction of minima by taking the gradient of cost function.

The pseudocode for the same is mentioned below.

update = learning_rate * gradient_of_parameters
parameters = parameters - update

If you see here, the parameters are updated by taking the gradient of the parameters and then the learning rate is multiplied which suggest how quickly we should go towards the minimum. Learning rate is a hyper-parameter and while choosing its value you should be careful.

Vanilla Gradient Descent Graph in Machine Learning

Gradient Descent with Momentum

In this case, we adjust the algorithm in such a manner that we are aware about the prior step before taking the next step.

The pseudocode for the same is mentioned below.

update = learning_rate * gradient
velocity = previous_update * momentum
parameter = parameter + velocity - update

Here, our update is the same as that of vanilla gradient descent. But we are introducing a new term called velocity, which considers the previous update and a constant which is called momentum.

Gradient Descent with Momentum Update in machine LearningSource

ADAGRAD

ADAGRAD (Adaptive Gradient Algorithm) mainly uses an adaptive technique to learn rate updation. In this algorithm, we try to change the algorithm on the basis of how the gradient has been changing for all the previous iterations.

The pseudocode for the same is mentioned below.

grad_component = previous_grad_component + (gradient * gradient)
rate_change = square_root(grad_component) + epsilon
adapted_learning_rate = learning_rate * rate_change 
update = adapted_learning_rate * gradient 
parameter = parameter - update

In the above code, epsilon is a constant which is used to keep the rate of change of learning rate in check.

ADAM

ADAM is another adaptive technique which is built out of ADAGRAD and further reduces its downside. In simple words you can consider it to be ADAGRAD + momentum.

The pseudocode for the same is mentioned below.

adapted_gradient = previous_gradient + ((gradient - previous_gradient) * (1 - beta1))

gradient_component = (gradient_change - previous_learning_rate)
adapted_learning_rate =  previous_learning_rate + (gradient_component * (1 - beta2))
update = adapted_learning_rate * adapted_gradient
parameter = parameter - update

Here beta1 and beta2 are constants to keep changes in gradient and learning rate in check

Tips for Gradient Descent

In this section you will learn about some tips and tricks for getting the most out of the gradient descent algorithm for machine learning.

  • Plot Cost versus Time: It is suggested to collect and plot the cost values calculated by the algorithm for each iteration. It helps you keep track of the descent. For a well-performing gradient descent the cost always decreases in each iteration. If you see there is no decrease, reduce the learning rate.
  • Learning Rate: The learning rate value is a small real value such as 0.1, 0.001 or 0.0001. Keep trying different values to check which works best for your algorithm.
  • Rescale Inputs: Try to achieve a range such as [0, 1] or [-1, 1] by rescaling all the input variables. The algorithm reaches the minimum cost faster if the shape of the cost function is not distorted or skewed.
  • Few Passes: Stochastic gradient descent often does not need more than 1-to-10 passes through the training dataset to converge on good or good enough coefficients.
  • Plot Mean Cost: The updates for each training dataset instance can result in a noisy plot of cost over time when using stochastic gradient descent. Try to take the average over 10, 100, or 1000 updates. This will give you a better idea of the learning trend for the algorithm.

Implementation of Gradient Descent in Python

Now that we have gone through all the elements related to gradient descent, let us implement gradient descent in Python. A simple gradient Descent Algorithm is as follows:

  1. Obtain a function in order to minimize f(x)
  2. Initialize a value x from which you want to start the descent or optimization from
  3. Specify a learning rate which will determine how much of a step to descend by or how quickly you want to converge to the minimum value
  4. Find the derivative of that value x (the descent)
  5. Now proceed to descend by the derivative of that value and then multiply it by the learning rate
  6. Update the value of x with the new value descended to
  7. Check your stop condition in order to see whether to stop
  8. If condition satisfies, stop. If not, proceed to step 4 with the new x value and keep repeating the algorithm

Let us create an arbitrary loss function and try to find a local minimum value for that function by implementing a simple representation of gradient descent using Python.

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

We will find the gradient descent of this function: x3 - 3x2 + 5

#creating the function and plotting it

function = lambda x: (x ** 3)-(3*(x ** 2))+5

#Get 1000 evenly spaced numbers between -1 and 3 (arbitrarily chosen to ensure steep curve)
x = np.linspace(-1,3,500)

#Plot the curve
plt.plot(x, function(x))
plt.show()

plotting the data set

Here, we can see that our minimum value should be around 2.0
Let us now use the gradient descent to find the exact value

def deriv(x):
    
'''
Description: This function takes in a value of x and returns its derivative based on the
initial function we specified.
    
Arguments:
    
x - a numerical value of x
    
Returns:
    
x_deriv - a numerical value of the derivative of x
    
'''
    
x_deriv = 3* (x**2) - (6 * (x))
return x_deriv


def step(x_new, x_prev, precision, l_r):
'''
Description: This function takes in an initial or previous value for x, updates it based on
steps taken via the learning rate and outputs the minimum value of x that reaches the precision satisfaction.
    
Arguments:
    
x_new - a starting value of x that will get updated based on the learning rate
    
x_prev - the previous value of x that is getting updated to the new one
    
precision - a precision that determines the stop of the stepwise descent
    
l_r - the learning rate (size of each descent step)
    
Output:
    
1. Prints out the latest new value of x which equates to the minimum we are looking for
2. Prints out the number of x values which equates to the number of gradient descent steps
3. Plots a first graph of the function with the gradient descent path
4. Plots a second graph of the function with a zoomed in gradient descent path in the important area
    
'''
    
# create empty lists where the updated values of x and y wil be appended during each iteration
    
x_list, y_list = [x_new], [function(x_new)]
# keep looping until your desired precision
while abs(x_new - x_prev) > precision:
    
    # change the value of x
    x_prev = x_new
    
# get the derivation of the old value of x
    d_x = - deriv(x_prev)
    
    # get your new value of x by adding the previous, the multiplication of the derivative and the learning rate
    x_new = x_prev + (l_r * d_x)
    
    # append the new value of x to a list of all x-s for later visualization of path
    x_list.append(x_new)
    
    # append the new value of y to a list of all y-s for later visualization of path
    y_list.append(function(x_new))

print ("Local minimum occurs at: "+ str(x_new))
print ("Number of steps: " + str(len(x_list)))
    
    
plt.subplot(1,2,2)
plt.scatter(x_list,y_list,c="g")
plt.plot(x_list,y_list,c="g")
plt.plot(x,function(x), c="r")
plt.title("Gradient descent")
plt.show()

plt.subplot(1,2,1)
plt.scatter(x_list,y_list,c="g")
plt.plot(x_list,y_list,c="g")
plt.plot(x,function(x), c="r")
plt.xlim([1.0,2.1])
plt.title("Zoomed in Gradient descent to Key Area")
plt.show() 
#Implement gradient descent (all the arguments are arbitrarily chosen)
step(0.5, 0, 0.001, 0.05)

Local minimum occurs at: 1.9980265135950486
Number of steps: 25 

Gradient Descent Machine Learning Graph

Zoomed in Gradient Descent to Key Area in Machine Learning

Summary

In this article, you have learned about gradient descent for machine learning. Here we tried to cover most of the topics. To learn more about machine learning algorithms in-depth,  click here. Let us summarize all that we have covered in this article.

  • Optimization is the heart and soul of machine learning.
  • Gradient descent is a simple optimization technique which can be used with other machine learning algorithms.
  • Batch gradient descent refers to calculating the derivative from all training data before calculating an update.
  • Stochastic gradient descent refers to calculating the derivative from each training data instance and calculating the update immediately.

If you are inspired by the opportunities provided by Data Science, enrol in our  Data Science and Machine Learning Courses for more lucrative career options in this landscape.

Priyankur

Priyankur Sarkar

Data Science Enthusiast

Priyankur Sarkar loves to play with data and get insightful results out of it, then turn those data insights and results in business growth. He is an electronics engineer with a versatile experience as an individual contributor and leading teams, and has actively worked towards building Machine Learning capabilities for organizations.

Join the Discussion

Your email address will not be published. Required fields are marked *

Suggested Blogs

Top Data Analytics Certifications

What is data analytics?In the world of IT, every small bit of data count; even information that looks like pure nonsense has its significance. So, how do we retrieve the significance from this data? This is where Data Science and analytics comes into the picture.  Data Analytics is a process where data is inspected, transformed and interpreted to discover some useful bits of information from all the noise and make decisions accordingly. It forms the entire basis of the social media industry and finds a lot of use in IT, finance, hospitality and even social sciences. The scope in data analytics is nearly endless since all facets of life deal with the storage, processing and interpretation of data.Why data analytics? Data Analytics in this Information Age has nearly endless opportunities since literally everything in this era hinges on the importance of proper processing and data analysis. The insights from any data are crucial for any business. The field of data Analytics has grown more than 50 times from the early 2000s to 2021. Companies specialising in banking, healthcare, fraud detection, e-commerce, telecommunication, infrastructure and risk management hire data analysts and professionals every year in huge numbers.Need for certification:Skills are the first and foremost criteria for a job, but these skills need to be validated and recognised by reputed organisations for them to impress a potential employer. In the field of Data Analytics, it is pretty crucial to show your certifications. Hence, an employer knows you have hands-on experience in the field and can handle the workload of a real-world setting beyond just theoretical knowledge. Once you get a base certification, you can work your way up to higher and higher positions and enjoy lucrative pay packages. Top Data Analytics Certifications Certified Analytics Professional (CAP) Microsoft Certified Azure Data Scientist Associate Cloudera Certified Associate (CCA) Data Analyst Associate Certified Analytics Professional (aCAP) SAS Certified Data Analyst (Using SAS91. Certified Analytics Professional (CAP)A certification from an organisation called INFORMS, CAP is a notoriously rigorous certification and stands out like a star on an applicant's resume. Those who complete this program gain an invaluable credential and are able to distinguish themselves from the competition. It gives a candidate a comprehensive understanding of the analytical process's various fine aspects--from framing hypotheses and analytic problems to the proper methodology, along with acquisition, model building and deployment process with long-term life cycle management. It needs to be renewed after three years.The application process is in itself quite complex, and it also involves signing the CAP Code of Ethics before one is given the certification. The CAP panel reviews each application, and those who pass this review are the only ones who can give the exam.  Prerequisite: A bachelor’s degree with 5 years of professional experience or a master's degree with 3 years of professional experience.  Exam Fee & Format: The base price is $695. For individuals who are members of INFORMS the price is $495. (Source) The pass percentage is 70%. The format is a four option MCQ paper. Salary: $76808 per year (Source) 2. Cloudera Certified Associate (CCA) Data Analyst Cloudera has a well-earned reputation in the IT sector, and its Associate Data analyst certification can help bolster the resume of Business intelligence specialists, system architects, data analysts, database administrators as well as developers. It has a specific focus on SQL developers who aim to show their proficiency on the platform.This certificate validates an applicant's ability to operate in a CDH environment by Cloudera using Impala and Hive tools. One doesn't need to turn to expensive tuitions and academies as Cloudera offers an Analyst Training course with almost the same objectives as the exam, leaving one with a good grasp of the fundamentals.   Prerequisites: basic knowledge of SQL and Linux Command line Exam Fee & Format: The cost of the exam is $295 (Source), The test is a performance-based test containing 8-12 questions to be completed in a proctored environment under 129 minutes.  Expected Salary: You can earn the job title of Cloudera Data Analyst that pays up to $113,286 per year. (Source)3. Associate Certified Analytics Professional (aCAP)aCAP is an entry-level certification for Analytics professionals with lesser experience but effective knowledge, which helps in real-life situations. It is for those candidates who have a master’s degree in a field related to data analytics.  It is one of the few vendor-neutral certifications on the list and must be converted to CAP within 6 years, so it offers a good opportunity for those with a long term path in a Data Analytics career. It also needs to be renewed every three years, like the CAP certification. Like its professional counterpart, aCAP helps a candidate step out in a vendor-neutral manner and drastically increases their professional credibility.  Prerequisite: Master’s degree in any discipline related to data Analytics. Exam Fee: The base price is $300. For individuals who are members of INFORMS the price is $200. (Source). There is an extensive syllabus which covers: i. Business Problem Framing, ii. Analytics Problem Framing, iii. Data, iv. Methodology Selection, v. Model Building, vi. Deployment, vii. Lifecycle Management of the Analytics process, problem-solving, data science and visualisation and much more.4. SAS Certified Data Analyst (Using SAS9)From one of the pioneers in IT and Statistics - the SAS Institute of Data Management - a SAS Certified Data Scientist can gain insights and analyse various aspects of data from businesses using tools like the SAS software and other open-source methodology. It also validates competency in using complex machine learning models and inferring results to interpret future business strategy and release models using the SAS environment. SAS Academy for Data Science is a viable institute for those who want to receive proper training for the exam and use this as a basis for their career.  Prerequisites: To earn this credential, one needs to pass 5 exams, two from the SAS Certified Big Data Professional credential and three exams from the SAS Certified Advanced Analytics Professional Credential. Exam Fee: The cost for each exam is $180. (Source) An exception is Predictive Modelling using the SAS Enterprise Miner, costing $250, This exam can be taken in the English language. One can join the SAS Academy for Data Science and also take a practice exam beforehand. Salary: You can get a job as a SAS Data Analyst that pays up to $90,000 per year! (Source) 5. IBM Data Science Professional CertificateWhenever someone studies the history of a computer, IBM (International Business Machines) is the first brand that comes up. IBM is still alive and kicking, now having forayed into and becoming a major player in the Big Data segment. The IBM Data Science Professional certificate is one of the beginner-level certificates if you want to sink your hands into the world of data analysis. It shows a candidate's skills in various topics pertaining to data sciences, including various open-source tools, Python databases, SWL, data visualisation, and data methodologies.  One needs to complete nine courses to earn the certificate. It takes around three months if one works twelve hours per week. It also involves the completion of various hands-on assignments and building a portfolio. A candidate earns the Professional certificate from Coursera and a badge from IBM that recognises a candidate's proficiency in the area. Prerequisites: It is the optimal course for freshers since it requires no requisite programming knowledge or proficiency in Analytics. Exam Fee: It costs $39 per month (Source) to access the course materials and the certificate. The course is handled by the Coursera organisation. Expected Salary: This certification can earn you the title of IBM Data Scientist and help you earn a salary of $134,846 per annum. (Source) 6. Microsoft Certified Azure Data Scientist AssociateIt's one of the most well-known certifications for newcomers to step into the field of Big Data and Data analytics. This credential is offered by the leader in the industry, Microsoft Azure. This credential validates a candidate's ability to work with Microsoft Azure developing environment and proficiency in analysing big data, preparing data for the modelling process, and then progressing to designing models. One advantage of this credential is that it has no expiry date and does not need renewal; it also authorises the candidate’s extensive knowledge in predictive Analytics. Prerequisites: knowledge and experience in data science and using Azure Machine Learning and Azure Databricks. Exam Fee: It costs $165 to (Source) register for the exam. One advantage is that there is no need to attend proxy institutions to prepare for this exam, as Microsoft offers free training materials as well as an instructor-led course that is paid. There is a comprehensive collection of resources available to a candidate. Expected Salary: The job title typically offered is Microsoft Data Scientist and it typically fetches a yearly pay of $130,993.(Source) Why be a Data Analytics professional? For those already working in the field of data, being a Data Analyst is one of the most viable options. The salary of a data analyst ranges from $65,000 to $85,000 depending on number of years of experience. This lucrative salary makes it worth the investment to get a certification and advance your skills to the next level so that you can work for multinational companies by interpreting and organising data and using this analysis to accelerate businesses. These certificates demonstrate that you have the required knowledge needed to operate data models of the volumes needed by big organizations. 1. Demand is more than supply With the advent of the Information Age, there has been a huge boom in companies that either entirely or partially deal with IT. For many companies IT forms the core of their business. Every business has to deal with data, and it is crucial to get accurate insights from this data and use it to further business interests and expand profits. The interpretation of data also aims to guide them in the future to make the best business decisions.  Complex business intelligence algorithms are in place these days. They need trained professionals to operate them; since this field is relatively new, there is a shortage of experts. Thus, there are vacancies for data analyst positions with lucrative pay if one is qualified enough.2. Good pay with benefitsA data analyst is an extremely lucrative profession, with an average base pay of $71,909 (Source), employee benefits, a good work-home balance, and other perks. It has been consistently rated as being among the hottest careers of the decade and allows professionals to have a long and satisfying career.   Companies Hiring Certified Data Analytics Professionals Oracle A California based brand, Oracle is a software company that is most famous for its data solutions. With over 130000 employees and a revenue of 39 billion, it is surely one of the bigger players in Data Analytics.  MicroStrategy   Unlike its name, this company is anything but micro, with more than 400 million worth of revenue. It provides a suite of analytical products along with business mobility solutions. It is a key player in the mobile space, working natively with Android and iOS.   SAS   One of the companies in the list which provides certifications and is also without a doubt one of the largest names in the field of Big Data, machine learning and Data Analytics, is SAS. The name SAS is derived from Statistical Analysis System. This company is trusted and has a solid reputation. It is also behind the SAS Institute for Data Science. Hence, SAS is the organisation you would want to go to if you're aiming for a long-term career in data science.    Conclusion To conclude, big data and data Analytics are a field of endless opportunities. By investing in the right credential, one can pave the way to a viable and lucrative career path. Beware though, there are lots of companies that provide certifications, but only recognised and reputed credentials will give you the opportunities you are seeking. Hiring companies look for these certifications as a mark of authenticity of your hands-on experience and the amount of work you can handle effectively. Therefore, the credential you choose for yourself plays a vital role in the career you can have in the field of Data analytics.  Happy learning!    
5631
Top Data Analytics Certifications

What is data analytics?In the world of IT, every s... Read More

Why Should You Start a Career in Machine Learning?

If you are even remotely interested in technology you would have heard of machine learning. In fact machine learning is now a buzzword and there are dozens of articles and research papers dedicated to it.  Machine learning is a technique which makes the machine learn from past experiences. Complex domain problems can be resolved quickly and efficiently using Machine Learning techniques.  We are living in an age where huge amounts of data are produced every second. This explosion of data has led to creation of machine learning models which can be used to analyse data and to benefit businesses.  This article tries to answer a few important concepts related to Machine Learning and informs you about the career path in this prestigious and important domain.What is Machine Learning?So, here’s your introduction to Machine Learning. This term was coined in the year 1997. “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at the tasks improves with the experiences.”, as defined in the book on ML written by Mitchell in 1997. The difference between a traditional programming and programming using Machine Learning is depicted here, the first Approach (a) is a traditional approach, and second approach (b) is a Machine Learning based approach.Machine Learning encompasses the techniques in AI which allow the system to learn automatically looking at the data available. While learning, the system tries to improve the experience without making any explicit efforts in programming. Any machine learning application follows the following steps broadlySelecting the training datasetAs the definition indicates, machine learning algorithms require past experience, that is data, for learning. So, selection of appropriate data is the key for any machine learning application.Preparing the dataset by preprocessing the dataOnce the decision about the data is made, it needs to be prepared for use. Machine learning algorithms are very susceptible to the small changes in data. To get the right insights, data must be preprocessed which includes data cleaning and data transformation.  Exploring the basic statistics and properties of dataTo understand what the data wishes to convey, the data engineer or Machine Learning engineer needs to understand the properties of data in detail. These details are understood by studying the statistical properties of data. Visualization is an important process to understand the data in detail.Selecting the appropriate algorithm to apply on the datasetOnce the data is ready and understood in detail, then appropriate Machine Learning algorithms or models are selected. The choice of algorithm depends on characteristics of data as well as type of task to be performed on the data. The choice also depends on what kind of output is required from the data.Checking the performance and fine-tuning the parameters of the algorithmThe model or algorithm chosen is fine-tuned to get improved performance. If multiple models are applied, then they are weighed against the performance. The final algorithm is again fine-tuned to get appropriate output and performance.Why Pursue a Career in Machine Learning in 2021?A recent survey has estimated that the jobs in AI and ML have grown by more than 300%. Even before the pandemic struck, Machine Learning skills were in high demand and the demand is expected to increase two-fold in the near future.A career in machine learning gives you the opportunity to make significant contributions in AI, the future of technology. All the big and small businesses are adopting Machine Learning models to improve their bottom-line margins and return on investment.  The use of Machine Learning has gone beyond just technology and it is now used in diverse industries including healthcare, automobile, manufacturing, government and more. This has greatly enhanced the value of Machine Learning experts who can earn an average salary of $112,000.  Huge numbers of jobs are expected to be created in the coming years.  Here are a few reasons why one should pursue a career in Machine Learning:The global machine learning market is expected to touch $20.83B in 2024, according to Forbes.  We are living in a digital age and this explosion of data has made the use of machine learning models a necessity. Machine Learning is the only way to extract meaning out of data and businesses need Machine Learning engineers to analyze huge data and gain insights from them to improve their businesses.If you like numbers, if you like research, if you like to read and test and if you have a passion to analyse, then machine learning is the career for you. Learning the right tools and programming languages will help you use machine learning to provide appropriate solutions to complex problems, overcome challenges and grow the business.Machine Learning is a great career option for those interested in computer science and mathematics. They can come up with new Machine Learning algorithms and techniques to cater to the needs of various business domains.As explained above, a career in machine learning is both rewarding and lucrative. There are huge number of opportunities available if you have the right expertise and knowledge. On an average, Machine Learning engineers get higher salaries, than other software developers.Years of experience in the Machine Learning domain, helps you break into data scientist roles, which is not just among the hottest careers of our generation but also a highly respected and lucrative career. Right skills in the right business domain helps you progress and make a mark for yourself in your organization. For example, if you have expertise in pharmaceutical industries and experience working in Machine learning, then you may land job roles as a data scientist consultant in big pharmaceutical companies.Statistics on Machine learning growth and the industries that use MLAccording to a research paper in AI Multiple (https://research.aimultiple.com/ml-stats/), the Machine Learning market will grow to 9 Billion USD by the end of 2022. There are various areas where Machine Learning models and solutions are getting deployed, and businesses see an overall increase of 44% investments in this area. North America is one of the leading regions in the adoption of Machine Learning followed by Asia.The Global Machine Learning market will grow by 42% which is evident from the following graph. Image sourceThere is a huge demand for Machine Learning modelling because of the large use of Cloud Based Applications and Services. The pandemic has changed the face of businesses, making them heavily dependent on Cloud and AI based services. Google, IBM, and Amazon are just some of the companies that have invested heavily in AI and Machine Learning based application development, to provide robust solutions for problems faced by small to large scale businesses. Machine Learning and Cloud based solutions are scalable and secure for all types of business.ML analyses and interprets data patterns, computing and developing algorithms for various business purposes.Advantages of Machine Learning courseNow that we have established the advantages of perusing a career in Machine Learning, let’s understand from where to start our machine learning journey. The best option would be to start with a Machine Learning course. There are various platforms which offer popular Machine Learning courses. One can always start with an online course which is both effective and safe in these COVID times.These courses start with an introduction to Machine Learning and then slowly help you to build your skills in the domain. Many courses even start with the basics of programming languages such as Python, which are important for building Machine Learning models. Courses from reputed institutions will hand hold you through the basics. Once the basics are clear, you may switch to an offline course and get the required certification.Online certifications have the same value as offline classes. They are a great way to clear your doubts and get personalized help to grow your knowledge. These courses can be completed along with your normal job or education, as most are self-paced and can be taken at a time of your convenience. There are plenty of online blogs and articles to aid you in completion of your certification.Machine Learning courses include many real time case studies which help you in understanding the basics and application aspects. Learning and applying are both important and are covered in good Machine Learning Courses. So, do your research and pick an online tutorial that is from a reputable institute.What Does the Career Path in Machine Learning Look Like?One can start their career in Machine Learning domain as a developer or application programmer. But the acquisition of the right skills and experience can lead you to various career paths. Following are some of the career options in Machine Learning (not an exhaustive list):Data ScientistA data scientist is a person with rich experience in a particular business field. A person who has a knowledge of domain, as well as machine learning modelling, is a data scientist. Data Scientists’ job is to study the data carefully and suggest accurate models to improve the business.AI and Machine Learning EngineerAn AI engineer is responsible for choosing the proper Machine Learning Algorithm based on natural language processing and neural network. They are responsible for applying it in AI applications like personalized advertising.  A Machine Learning Engineer is responsible for creating the appropriate models for improvement of the businessData EngineerA Data Engineer, as the name suggests, is responsible to collect data and make it ready for the application of Machine Learning models. Identification of the right data and making it ready for extraction of further insights is the main work of a data engineer.Business AnalystA person who studies the business and analyzes the data to get insights from it is a Business Analyst. He or she is responsible for extracting the insights from the data at hand.Business Intelligence (BI) DeveloperA BI developer uses Machine Learning and Data Analytics techniques to work on a large amount of data. Proper representation of data to suit business decisions, using the latest tools for creation of intuitive dashboards is the role of a BI developer.  Human Machine Interface learning engineerCreating tools using machine learning techniques to ease the human machine interaction or automate decisions, is the role of a Human Machine Interface learning engineer. This person helps in generating choices for users to ease their work.Natural Language Processing (NLP) engineer or developerAs the name suggests, this person develops various techniques to process Natural Language constructs. Building applications or systems using machine learning techniques to build Natural Language based applications is their main task. They create multilingual Chatbots for use in websites and other applications.Why are Machine Learning Roles so popular?As mentioned above, the market growth of AI and ML has increased tremendously over the past years. The Machine Learning Techniques are applied in every domain including marketing, sales, product recommendations, brand retention, creating advertising, understanding the sentiments of customer, security, banking and more. Machine learning algorithms are also used in emails to ease the users work. This says a lot, and proves that a career in Machine Learning is in high demand as all businesses are incorporating various machine learning techniques and are improving their business.One can harness this popularity by skilling up with Machine Learning skills. Machine Learning models are now being used by every company, irrespective of their size--small or big, to get insights on their data and use these insights to improve the business. As every company wishes to grow faster, they are deploying more machine learning engineers to get their work done on time. Also, the migration of businesses to Cloud services for better security and scalability, has increased their requirement for more Machine Learning algorithms and models to cater to their needs.Introducing the Machine learning techniques and solutions has brought huge returns for businesses.  Machine Learning solution providers like Google, IBM, Microsoft etc. are investing in human resources for development of Machine Learning models and algorithms. The tools developed by them are popularly used by businesses to get early returns. It has been observed that there is significant increase in patents in Machine Learning domains since the past few years, indicating the quantum of work happening in this domain.Machine Learning SkillsLet’s visit a few important skills one must acquire to work in the domain of Machine Learning.Programming languagesKnowledge of programming is very important for a career in Machine Learning. Languages like Python and R are popularly used to develop applications using Machine Learning models and algorithms. Python, being the simplest and most flexible language, is very popular for AI and Machine Learning applications. These languages provide rich support of libraries for implementation of Machine Learning Algorithms. A person who is good in programming can work very efficiently in this domain.Mathematics and StatisticsThe base for Machine Learning is mathematics and statistics. Statistics applied to data help in understanding it in micro detail. Many machine learning models are based on the probability theory and require knowledge of linear algebra, transformations etc. A good understanding of statistics and probability increases the early adoption to Machine Learning domain.Analytical toolsA plethora of analytical tools are available where machine learning models are already implemented and made available for use. Also, these tools are very good for visualization purposes. Tools like IBM Cognos, PowerBI, Tableue etc are important to pursue a career as a  Machine Learning engineer.Machine Learning Algorithms and librariesTo become a master in this domain, one must master the libraries which are provided with various programming languages. The basic understanding of how machine learning algorithms work and are implemented is crucial.Data Modelling for Machine Learning based systemsData lies at the core of any Machine Learning application. So, modelling the data to suit the application of Machine Learning algorithms is an important task. Data modelling experts are the heart of development teams that develop machine learning based systems. SQL based solutions like Oracle, SQL Server, and NoSQL solutions are important for modelling data required for Machine Learning applications. MongoDB, DynamoDB, Riak are some important NOSQL based solutions available to process unstructured data for Machine Learning applications.Other than these skills, there are two other skills that may prove to be beneficial for those planning on a career in the Machine Learning domain:Natural Language processing techniquesFor E-commerce sites, customer feedback is very important and crucial in determining the roadmap of future products. Many customers give reviews for the products that they have used or give suggestions for improvement. These feedbacks and opinions are analyzed to gain more insights about the customers buying habits as well as about the products. This is part of natural language processing using Machine Learning. The likes of Google, Facebook, Twitter are developing machine learning algorithms for Natural Language Processing and are constantly working on improving their solutions. Knowledge of basics of Natural Language Processing techniques and libraries is must in the domain of Machine Learning.Image ProcessingKnowledge of Image and Video processing is very crucial when a solution is required to be developed in the area of security, weather forecasting, crop prediction etc. Machine Learning based solutions are very effective in these domains. Tools like Matlab, Octave, OpenCV are some important tools available to develop Machine Learning based solutions which require image or video processing.ConclusionMachine Learning is a technique to automate the tasks based on past experiences. This is among the most lucrative career choices right now and will continue to remain so in the future. Job opportunities are increasing day by day in this domain. Acquiring the right skills by opting for a proper Machine Learning course is important to grow in this domain. You can have an impressive career trajectory as a machine learning expert, provided you have the right skills and expertise.
5684
Why Should You Start a Career in Machine Learning?

If you are even remotely interested in technology ... Read More

Types of Probability Distributions Every Data Science Expert Should know

Data Science has become one of the most popular interdisciplinary fields. It uses scientific approaches, methods, algorithms, and operations to obtain facts and insights from unstructured, semi-structured, and structured datasets. Organizations use these collected facts and insights for efficient production, business growth, and to predict user requirements. Probability distribution plays a significant role in performing data analysis equipping a dataset for training a model. In this article, you will learn about the types of Probability Distribution, random variables, types of discrete distributions, and continuous distribution.  What is Probability Distribution? A Probability Distribution is a statistical method that determines all the probable values and possibilities that a random variable can deliver from a particular range. This range of values will have a lower bound and an upper bound, which we call the minimum and the maximum possible values.  Various factors on which plotting of a value depends are standard deviation, mean (or average), skewness, and kurtosis. All of these play a significant role in Data science as well. We can use probability distribution in physics, engineering, finance, data analysis, machine learning, etc. Significance of Probability distributions in Data Science In a way, most of the data science and machine learning operations are dependent on several assumptions about the probability of your data. Probability distribution allows a skilled data analyst to recognize and comprehend patterns from large data sets; that is, otherwise, entirely random variables and values. Thus, it makes probability distribution a toolkit based on which we can summarize a large data set. The density function and distribution techniques can also help in plotting data, thus supporting data analysts to visualize data and extract meaning. General Properties of Probability Distributions Probability distribution determines the likelihood of any outcome. The mathematical expression takes a specific value of x and shows the possibility of a random variable with p(x). Some general properties of the probability distribution are – The total of all probabilities for any possible value becomes equal to 1. In a probability distribution, the possibility of finding any specific value or a range of values must lie between 0 and 1. Probability distributions tell us the dispersal of the values from the random variable. Consequently, the type of variable also helps determine the type of probability distribution.Common Data Types Before jumping directly into explaining the different probability distributions, let us first understand the different types of probability distributions or the main categories of the probability distribution. Data analysts and data engineers have to deal with a broad spectrum of data, such as text, numerical, image, audio, voice, and many more. Each of these have a specific means to be represented and analyzed. Data in a probability distribution can either be discrete or continuous. Numerical data especially takes one of the two forms. Discrete data: They take specific values where the outcome of the data remains fixed. Like, for example, the consequence of rolling two dice or the number of overs in a T-20 match. In the first case, the result lies between 2 and 12. In the second case, the event will be less than 20. Different types of discrete distributions that use discrete data are: Binomial Distribution Hypergeometric Distribution Geometric Distribution Poisson Distribution Negative Binomial Distribution Multinomial Distribution  Continuous data: It can obtain any value irrespective of bound or limit. Example: weight, height, any trigonometric value, age, etc. Different types of continuous distributions that use continuous data are: Beta distribution Cauchy distribution Exponential distribution Gamma distribution Logistic distribution Weibull distribution Types of Probability Distribution explained Here are some of the popular types of Probability distributions used by data science professionals. (Try all the code using Jupyter Notebook) Normal Distribution: It is also known as Gaussian distribution. It is one of the simplest types of continuous distribution. This probability distribution is symmetrical around its mean value. It also shows that data at close proximity of the mean is frequently occurring, compared to data that is away from it. Here, mean = 0, variance = finite valueHere, you can see 0 at the center is the Normal Distribution for different mean and variance values. Here is a code example showing the use of Normal Distribution: from scipy.stats import norm  import matplotlib.pyplot as mpl  import numpy as np  def normalDist() -> None:      fig, ax = mpl.subplots(1, 1)      mean, var, skew, kurt = norm.stats(moments = 'mvsk')      x = np.linspace(norm.ppf(0.01),  norm.ppf(0.99), 100)      ax.plot(x, norm.pdf(x),          'r-', lw = 5, alpha = 0.6, label = 'norm pdf')      ax.plot(x, norm.cdf(x),          'b-', lw = 5, alpha = 0.6, label = 'norm cdf')      vals = norm.ppf([0.001, 0.5, 0.999])      np.allclose([0.001, 0.5, 0.999], norm.cdf(vals))      r = norm.rvs(size = 1000)      ax.hist(r, normed = True, histtype = 'stepfilled', alpha = 0.2)      ax.legend(loc = 'best', frameon = False)      mpl.show()  normalDist() Output: Bernoulli Distribution: It is the simplest type of probability distribution. It is a particular case of Binomial distribution, where n=1. It means a binomial distribution takes 'n' number of trials, where n > 1 whereas, the Bernoulli distribution takes only a single trial.   Probability Mass Function of a Bernoulli’s Distribution is:  where p = probability of success and q = probability of failureHere is a code example showing the use of Bernoulli Distribution: from scipy.stats import bernoulli  import seaborn as sb    def bernoulliDist():      data_bern = bernoulli.rvs(size=1200, p = 0.7)      ax = sb.distplot(          data_bern,           kde = True,           color = 'g',           hist_kws = {'alpha' : 1},          kde_kws = {'color': 'y', 'lw': 3, 'label': 'KDE'})      ax.set(xlabel = 'Bernouli Values', ylabel = 'Frequency Distribution')  bernoulliDist() Output:Continuous Uniform Distribution: In this type of continuous distribution, all outcomes are equally possible; each variable gets the same probability of hit as a consequence. This symmetric probabilistic distribution has random variables at an equal interval, with the probability of 1/(b-a). Here is a code example showing the use of Uniform Distribution: from numpy import random  import matplotlib.pyplot as mpl  import seaborn as sb  def uniformDist():      sb.distplot(random.uniform(size = 1200), hist = True)      mpl.show()  uniformDist() Output: Log-Normal Distribution: A Log-Normal distribution is another type of continuous distribution of logarithmic values that form a normal distribution. We can transform a log-normal distribution into a normal distribution. Here is a code example showing the use of Log-Normal Distribution import matplotlib.pyplot as mpl  def lognormalDist():      muu, sig = 3, 1      s = np.random.lognormal(muu, sig, 1000)      cnt, bins, ignored = mpl.hist(s, 80, normed = True, align ='mid', color = 'y')      x = np.linspace(min(bins), max(bins), 10000)      calc = (np.exp( -(np.log(x) - muu) **2 / (2 * sig**2))             / (x * sig * np.sqrt(2 * np.pi)))      mpl.plot(x, calc, linewidth = 2.5, color = 'g')      mpl.axis('tight')      mpl.show()  lognormalDist() Output: Pareto Distribution: It is one of the most critical types of continuous distribution. The Pareto Distribution is a skewed statistical distribution that uses power-law to describe quality control, scientific, social, geophysical, actuarial, and many other types of observable phenomena. The distribution shows slow or heavy-decaying tails in the plot, where much of the data reside at its extreme end. Here is a code example showing the use of Pareto Distribution – import numpy as np  from matplotlib import pyplot as plt  from scipy.stats import pareto  def paretoDist():      xm = 1.5        alp = [2, 4, 6]       x = np.linspace(0, 4, 800)      output = np.array([pareto.pdf(x, scale = xm, b = a) for a in alp])      plt.plot(x, output.T)      plt.show()  paretoDist() Output:Exponential Distribution: It is a type of continuous distribution that determines the time elapsed between events (in a Poisson process). Let’s suppose, that you have the Poisson distribution model that holds the number of events happening in a given period. We can model the time between each birth using an exponential distribution.Here is a code example showing the use of Pareto Distribution – from numpy import random  import matplotlib.pyplot as mpl  import seaborn as sb  def expDist():      sb.distplot(random.exponential(size = 1200), hist = True)      mpl.show()   expDist()Output:Types of the Discrete probability distribution – There are various types of Discrete Probability Distribution a Data science aspirant should know about. Some of them are – Binomial Distribution: It is one of the popular discrete distributions that determine the probability of x success in the 'n' trial. We can use Binomial distribution in situations where we want to extract the probability of SUCCESS or FAILURE from an experiment or survey which went through multiple repetitions. A Binomial distribution holds a fixed number of trials. Also, a binomial event should be independent, and the probability of obtaining failure or success should remain the same. Here is a code example showing the use of Binomial Distribution – from numpy import random  import matplotlib.pyplot as mpl  import seaborn as sb    def binomialDist():      sb.distplot(random.normal(loc = 50, scale = 6, size = 1200), hist = False, label = 'normal')      sb.distplot(random.binomial(n = 100, p = 0.6, size = 1200), hist = False, label = 'binomial')      plt.show()    binomialDist() Output:Geometric Distribution: The geometric probability distribution is one of the crucial types of continuous distributions that determine the probability of any event having likelihood ‘p’ and will happen (occur) after 'n' number of Bernoulli trials. Here 'n' is a discrete random variable. In this distribution, the experiment goes on until we encounter either a success or a failure. The experiment does not depend on the number of trials. Here is a code example showing the use of Geometric Distribution – import matplotlib.pyplot as mpl  def probability_to_occur_at(attempt, probability):      return (1-p)**(attempt - 1) * probability  p = 0.3  attempt = 4  attempts_to_show = range(21)[1:]  print('Possibility that this event will occur on the 7th try: ', probability_to_occur_at(attempt, p))  mpl.xlabel('Number of Trials')  mpl.ylabel('Probability of the Event')  barlist = mpl.bar(attempts_to_show, height=[probability_to_occur_at(x, p) for x in attempts_to_show], tick_label=attempts_to_show)  barlist[attempt].set_color('g')  mpl.show() Output:Poisson Distribution: Poisson distribution is one of the popular types of discrete distribution that shows how many times an event has the possibility of occurrence in a specific set of time. We can obtain this by limiting the Bernoulli distribution from 0 to infinity. Data analysts often use the Poisson distributions to comprehend independent events occurring at a steady rate in a given time interval. Here is a code example showing the use of Poisson Distribution from scipy.stats import poisson  import seaborn as sb  import numpy as np  import matplotlib.pyplot as mpl  def poissonDist():       mpl.figure(figsize = (10, 10))      data_binom = poisson.rvs(mu = 3, size = 5000)      ax = sb.distplot(data_binom, kde=True, color = 'g',                       bins=np.arange(data_binom.min(), data_binom.max() + 1),                       kde_kws={'color': 'y', 'lw': 4, 'label': 'KDE'})      ax.set(xlabel = 'Poisson Distribution', ylabel='Data Frequency')      mpl.show()      poissonDist() Output:Multinomial Distribution: A multinomial distribution is another popular type of discrete probability distribution that calculates the outcome of an event having two or more variables. The term multi means more than one. The Binomial distribution is a particular type of multinomial distribution with two possible outcomes - true/false or heads/tails. Here is a code example showing the use of Multinomial Distribution – import numpy as np  import matplotlib.pyplot as mpl  np.random.seed(99)   n = 12                      pvalue = [0.3, 0.46, 0.22]     s = []  p = []     for size in np.logspace(2, 3):      outcomes = np.random.multinomial(n, pvalue, size=int(size))        prob = sum((outcomes[:,0] == 7) & (outcomes[:,1] == 2) & (outcomes[:,2] == 3))/len(outcomes)      p.append(prob)      s.append(int(size))  fig1 = mpl.figure()  mpl.plot(s, p, 'o-')  mpl.plot(s, [0.0248]*len(s), '--r')  mpl.grid()  mpl.xlim(xmin = 0)  mpl.xlabel('Number of Events')  mpl.ylabel('Function p(X = K)') Output:Negative Binomial Distribution: It is also a type of discrete probability distribution for random variables having negative binomial events. It is also known as the Pascal distribution, where the random variable tells us the number of repeated trials produced during a specific number of experiments.  Here is a code example showing the use of Negative Binomial Distribution – import matplotlib.pyplot as mpl   import numpy as np   from scipy.stats import nbinom    x = np.linspace(0, 6, 70)   gr, kr = 0.3, 0.7        g = nbinom.ppf(x, gr, kr)   s = nbinom.pmf(x, gr, kr)   mpl.plot(x, g, "*", x, s, "r--") Output: Apart from these mentioned distribution types, various other types of probability distributions exist that data science professionals can use to extract reliable datasets. In the next topic, we will understand some interconnections & relationships between various types of probability distributions. Relationship between various Probability distributions – It is surprising to see that different types of probability distributions are interconnected. In the chart shown below, the dashed line is for limited connections between two families of distribution, whereas the solid lines show the exact relationship between them in terms of transformation, variable, type, etc. Conclusion  Probability distributions are prevalent among data analysts and data science professionals because of their wide usage. Today, companies and enterprises hire data science professionals in many sectors, namely, computer science, health, insurance, engineering, and even social science, where probability distributions appear as fundamental tools for application. It is essential for Data analysts and data scientists. to know the core of statistics. Probability Distributions perform a requisite role in analyzing data and cooking a dataset to train the algorithms efficiently. If you want to learn more about data science - particularly probability distributions and their uses, check out KnowledgeHut's comprehensive Data science course. 
9687
Types of Probability Distributions Every Data Scie...

Data Science has become one of the most popular in... Read More