Search

Machine learning Filter

What is Gradient Descent For Machine Learning

In our day-to-day lives, we are optimizing variables based on our personal decisions and we don’t even recognize the process consciously. We are constantly using optimization techniques all day long, for example, while going to work, choosing a shorter route in order to minimize traffic woes, figuring out and managing a quick walk around the campus during a snack break, or scheduling a cab in advance to reach the airport on time.Optimization is the ultimate goal, whether you are dealing with actual events in real-life or creating a technology-based product. Optimization is at the heart of most of the statistical and machine learning techniques which are widely used in data science. To gain more knowledge and skills on data science and machine learning, join the  certification course now.Optimization for Machine LearningAccuracy is the word with which we are most concerned, while we are dealing with problems related to machine learning and artificial intelligence. Any rate of errors cannot be tolerated while dealing with real-world problems and neither should they be compromised.Let us consider a case of self-driving cars. The model fitted in the car detects any obstacles that come in the way and takes appropriate actions, which can be slowing down the speed or pulling on the brakes and so on. Now we need to keep this in mind that there is no human in the car to operate or withdraw the actions taken by the self-driving car. In such a scenario, suppose the model is not accurate. It will not be able to detect other cars or any pedestrians and end up crashing leading to several lives at risk.This is where we need optimization algorithms to evaluate our model and judge whether the model is performing according to our needs or not. The evaluation can be made easy by calculating the cost function (which we will look into in a while in this article in detail). It is basically a mapping function that tells us about the difference between the desired output and what our model is computing. We can accordingly correct the model and avoid any kind of undesired activities.Optimization may be defined as the process by which an optimum is achieved. It is all about designing an optimal output for your problems with the use of resources available. However, optimization in machine learning is slightly different. In most of the cases, we are aware of the data, the shape and size, which also helps us know the areas we need to improve. But in machine learning we do not know how the new data may look like, this is where optimization acts perfectly. Optimization techniques are performed on the training data and then the validation data set is used to check its performance.There are a lot of advanced applications of optimization which are widely used in airway routing, market basket analysis, face recognition and so on. Machine learning algorithms such as linear regression, KNN, neural networks completely depend on optimization techniques. Here, we are going to look into one such popular optimization technique called Gradient Descent.What is Gradient Descent?Gradient descent is an optimization algorithm which is mainly used to find the minimum of a function. In machine learning, gradient descent is used to update parameters in a model. Parameters can vary according to the algorithms, such as coefficients in Linear Regression and weights in Neural Networks.Let us relate gradient descent with a real-life analogy for better understanding. Think of a valley you would like to descend when you are blind-folded. Any sane human will take a step and look for the slope of the valley, whether it goes up or down. Once you are sure of the downward slope you will follow that and repeat the step again and again until you have descended completely (or reached the minima).Similarly, let us consider another analogy. Suppose you have a ball and you place it on an inclined plane (at position A). As per laws, it will start rolling until it travels to a gentle plane where it will be stationary (at position B as shown in the figure below).This is exactly what happens in gradient descent. The inclined and/or irregular is the cost function when it is plotted and the role of gradient descent is to provide direction and the velocity (learning rate)  of the movement in order to attain the minima of the function i.e where the cost is minimum.How does Gradient Descent work?The primary goal of machine learning algorithms is always to build a model, which is basically a hypothesis which can be used to find an estimation for Y based on X. Let us consider an example of a model based on certain housing data which comprises of the sale price of the house, the size of the house etc. Suppose we want to predict the pricing of the house based on its size. It is clearly a regression problem where given some inputs, we would like to predict a continuous output.The hypothesis is usually presented aswhere the theta values are the parameters.Let us look into some examples and visualize the hypothesis:This yields h(x) = 1.5 + 0x. 0x means no slope, and y will always be the constant 1.5. This looks like:Now let us consider,Where, h(x) = 1 + 0.5xCost FunctionThe objective in the case of gradient descent is to find a line of best fit for some given inputs, or X values, and any number of Y values, or outputs. A cost function is defined as “a function that maps an event or values of one or more variables onto a real number intuitively representing some “cost” associated with the event.”With a known set of inputs and their corresponding outputs, a machine learning model attempts to make predictions according to the new set of inputs.Machine Learning ProcessThe Error would be the difference between the two predictions.This relates to the idea of a Cost function or Loss function.A Cost Function/Loss Function tells us “how good” our model is at making predictions for a given set of parameters. The cost function has a curve and a gradient, the slope of this curve helps us to update our parameters and make an accurate model.Minimizing the Cost FunctionIt is always the primary goal of any Machine Learning Algorithm to minimize the Cost Function. Minimizing cost functions will also result in a lower error between the predicted values and the actual values which also denotes that the algorithm has performed well in learning. How do we actually minimize any function?Generally, the cost function is in the form of Y = X². In a Cartesian coordinate system, this represents an equation for a parabola which can be graphically represented as :ParabolaNow in order to minimize the function mentioned above, firstly we need to find the value of X which will produce the lowest value of Y (in this case it is the red dot). With lower dimensions (like 2D in this case) it becomes easier to locate the minima but it is not the same while dealing with higher dimensions. For such cases, we need to use the Gradient Descent algorithm to locate the minima.Now a function is required which will minimize the parameters over a dataset. The most common function which is often used is the  mean squared error. It measures the difference between the estimated value (the prediction) and the estimator (the dataset).Mean Squared ErrorIt turns out we can adjust the equation a little to make the calculation down the track a little more simple. Now a question may arise, Why do we take the squared differences and simply not the absolute differences? Because the squared differences make it easier to derive a regression line. Indeed, to find that line we need to compute the first derivative of the Cost function, and it is much harder to compute the derivative of absolute values than squared values. Also, the squared differences increase the error distance, thus, making the bad predictions more pronounced than the good ones.The equation looks like -Mean Squared ErrorLet us apply this cost function to the following data:Here we will calculate some of the theta values and then plot the cost function by hand. Since this function passes through (0, 0), we will look only at a single value of theta. Also, let us refer to the cost function as J(ϴ) from now on.When the value of ϴ is 1, for J(1), we get a 0. You will notice the value of J(1) gives a straight line which fits the data perfectly. Now let us try with ϴ = 0.5J(0.5)The MSE function gives us a value of 0.58. Let’s plot both our values so far:J(1) = 0J(0.5) = 0.58With J(1) and J(0.5)Let us go ahead and calculate some more values of J(ϴ).Now if we join the dots carefully, we will get -Visualizing the cost function J(ϴ)As we can see, the cost function is at a minimum when theta = 1, which means the initial data is a straight line with a slope or gradient of 1 as shown by the orange line in the above figure.Using a trial and error method, we minimized J(ϴ). We did all of these by trying out a lot of values and with the help of visualizations. Gradient Descent does the same thing in a much better way, by changing the theta values or parameters until it descends to the minimum value.You may refer below for the Python code to find out cost function:import matplotlib.pyplot as plt import numpy as np # original data set X = [1, 2, 3] y = [1, 2, 3] # slope of best_fit_1 is 0.5 # slope of best_fit_2 is 1.0 # slope of best_fit_3 is 1.5 hyps = [0.5, 1.0, 1.5] # multiply the original X values by the theta # to produce hypothesis values for each X def multiply_matrix(mat, theta): mutated = [] for i in range(len(mat)):     mutated.append(mat[i] * theta) return mutated # calculate cost by looping each sample # subtract hyp(x) from y # square the result # sum them all together def calc_cost(m, X, y): total = 0 for i in range(m):     squared_error = (y[i] - X[i]) ** 2     total += squared_error     return total * (1 / (2*m)) # calculate cost for each hypothesis for i in range(len(hyps)): hyp_values = multiply_matrix(X, hyps[i])   print("Cost for ", hyps[i], " is ", calc_cost(len(X), y, hyp_values))Cost for 0.5 is 0.5833333333333333 Cost for 1.0 is 0.0 Cost for 1.5 is 0.5833333333333333 Learning RateLet us now start by initializing theta0 and theta1 to any two values, say 0 for both, and go from there. The algorithm is as follows:Gradient Descentwhere α, alpha, is the learning rate, or how rapidly do we want to move towards the minimum. We can always overshoot if the value of α is too large.The derivative which refers to the slope of the function is calculated. Here we calculate the partial derivative of the cost function. It helps us to know the direction (sign) in which the coefficient values should move so that they attain a lower cost on the following iteration. Partial Derivative of the Cost Function which we need to calculateOnce we know the direction from the derivative, we can update the coefficient values. Now you need to specify a learning rate parameter which will control how much the coefficients can change on each update.coefficient = coefficient – (alpha * delta)This particular process is repeated as long as the cost of the coefficients is 0.0 or close enough to zero.This turns out to be:Image from Andrew Ng’s machine learning courseWhich gives us linear regression!Linear RegressionTypes of Gradient Descent AlgorithmsGradient descent variants’ trajectory towards the minimum1. Batch Gradient Descent: In this type of gradient descent, all the training examples are processed for each iteration of gradient descent. It gets computationally expensive if the number of training examples is large. This is when batch gradient descent is not preferred, rather a stochastic gradient descent or mini-batch gradient descent is used.Algorithm for batch gradient descent:Let hθ(x) be the hypothesis for linear regression. Then, the cost function is given by:Let Σ represents the sum of all training examples from i=1 to m.Repeat {For every j =0 …n}Where xj(i) represents the jth feature of the ith training example. So if m is very large, then the derivative term fails to converge at the global minimum.2. Stochastic Gradient Descent: The word stochastic is related to a system or a process that is linked with a random probability. Therefore, in Stochastic Gradient Descent (SGD) samples are selected at random for each iteration instead of selecting the entire data set. When the number of training examples is too large, it becomes computationally expensive to use batch gradient descent, however, Stochastic Gradient Descent uses only a single sample, i.e., a batch size of one, to perform each iteration. The sample is randomly shuffled and selected for performing the iteration. The parameters are updated even after one iteration where only one has been processed. Thus, it gets faster than batch gradient descent.Algorithm for stochastic gradient descent:Firstly shuffle the data set randomly in order to train the parameters evenly for each type of data.As mentioned above, it takes into consideration one example per iteration.Hence,Let (x(i),y(i)) be the training exampleRepeat {For i=1 to m{        For every j =0 …n              }}3. Mini Batch gradient descent: This type of gradient descent is considered to be faster than both batch gradient descent and stochastic gradient descent. Even if the number of training examples is large, it processes it in batches in one go. Also, the number of iterations are lesser in spite of working with larger training samples.Algorithm for mini-batch gradient descent:Let us consider b be the number of examples in one batch, where b<m. Now, assume b=10 and m=100.The batch size can be adjusted. It is generally kept as a power of 2. The reason behind it is because some hardware such as GPUs achieve better run time with common batch sizes such as a power of 2.Repeat { For i=1,11, 21,…..,91Let Σ be the summation from i to i+9 represented by k.  For every j =0 …n}Convergence trends in different variants of Gradient DescentFor Batch Gradient Descent, the algorithm traces a straight line towards the minimum. If the cost function is convex, then it converges to a global minimum and if the cost function is not convex, then it converges to a local minimum. The learning rate is typically held constant over here.For stochastic gradient descent and mini-batch gradient descent, the algorithm keeps on fluctuating around the global minimum instead of converging. In order to converge, the learning rate needs to be changed slowly.Challenges in executing Gradient DescentThere are many cases where gradient descent fails to perform well. There are mainly three reasons when this would happen:Data challengesGradient challengesImplementation challengesData ChallengesThe arrangement of data sometimes leads to challenges. If it is arranged in such a way that it poses a  non-convex optimization problem then it becomes difficult to perform optimization using gradient descent. Gradient descent works for problems which are arranged with a well-defined convex optimization problem.During the optimization of a convex optimization problem, you will come across several minimal points. The lowest among all the points is called the global minimum, and other points are called the local minima. You will have to make sure you go to the global minimum and avoid local minima.There is also a saddle point problem. This is a situation where the gradient is zero but is not an optimal point. It cannot be avoided and is still an active part of the research.Gradient ChallengesWhile using gradient descent, if the execution is not proper, it leads to certain problems like vanishing gradient. This happens when the gradient is either too small or too large which results in no convergence.Implementation ChallengesSmaller memory results in the failure of network. A lot of neural network practitioners do not pay attention but it is very important to look at the resource utilization by the network.Another important thing to look at is to keep track of things like floating point considerations and hardware/software prerequisites.Variants of Gradient Descent algorithmsLet us look at some of the most commonly used gradient descent algorithms and how they are implemented.Vanilla Gradient DescentOne of the simplest forms of gradient descent technique is the Vanilla Gradient Descent. Here, vanilla means pure / without any adulteration. In this algorithm, the main feature is that small steps are taken in the direction of minima by taking the gradient of cost function.The pseudocode for the same is mentioned below.update = learning_rate * gradient_of_parameters parameters = parameters - updateIf you see here, the parameters are updated by taking the gradient of the parameters and then the learning rate is multiplied which suggest how quickly we should go towards the minimum. Learning rate is a hyper-parameter and while choosing its value you should be careful.Gradient Descent with MomentumIn this case, we adjust the algorithm in such a manner that we are aware about the prior step before taking the next step.The pseudocode for the same is mentioned below.update = learning_rate * gradient velocity = previous_update * momentum parameter = parameter + velocity - updateHere, our update is the same as that of vanilla gradient descent. But we are introducing a new term called velocity, which considers the previous update and a constant which is called momentum.SourceADAGRADADAGRAD (Adaptive Gradient Algorithm) mainly uses an adaptive technique to learn rate updation. In this algorithm, we try to change the algorithm on the basis of how the gradient has been changing for all the previous iterations.The pseudocode for the same is mentioned below.grad_component = previous_grad_component + (gradient * gradient) rate_change = square_root(grad_component) + epsilon adapted_learning_rate = learning_rate * rate_change  update = adapted_learning_rate * gradient  parameter = parameter - updateIn the above code, epsilon is a constant which is used to keep the rate of change of learning rate in check.ADAMADAM is another adaptive technique which is built out of ADAGRAD and further reduces its downside. In simple words you can consider it to be ADAGRAD + momentum.The pseudocode for the same is mentioned below.adapted_gradient = previous_gradient + ((gradient - previous_gradient) * (1 - beta1)) gradient_component = (gradient_change - previous_learning_rate) adapted_learning_rate =  previous_learning_rate + (gradient_component * (1 - beta2)) update = adapted_learning_rate * adapted_gradient parameter = parameter - updateHere beta1 and beta2 are constants to keep changes in gradient and learning rate in checkTips for Gradient DescentIn this section you will learn about some tips and tricks for getting the most out of the gradient descent algorithm for machine learning.Plot Cost versus Time: It is suggested to collect and plot the cost values calculated by the algorithm for each iteration. It helps you keep track of the descent. For a well-performing gradient descent the cost always decreases in each iteration. If you see there is no decrease, reduce the learning rate.Learning Rate: The learning rate value is a small real value such as 0.1, 0.001 or 0.0001. Keep trying different values to check which works best for your algorithm.Rescale Inputs: Try to achieve a range such as [0, 1] or [-1, 1] by rescaling all the input variables. The algorithm reaches the minimum cost faster if the shape of the cost function is not distorted or skewed.Few Passes: Stochastic gradient descent often does not need more than 1-to-10 passes through the training dataset to converge on good or good enough coefficients.Plot Mean Cost: The updates for each training dataset instance can result in a noisy plot of cost over time when using stochastic gradient descent. Try to take the average over 10, 100, or 1000 updates. This will give you a better idea of the learning trend for the algorithm.Implementation of Gradient Descent in PythonNow that we have gone through all the elements related to gradient descent, let us implement gradient descent in Python. A simple gradient Descent Algorithm is as follows:Obtain a function in order to minimize f(x)Initialize a value x from which you want to start the descent or optimization fromSpecify a learning rate which will determine how much of a step to descend by or how quickly you want to converge to the minimum valueFind the derivative of that value x (the descent)Now proceed to descend by the derivative of that value and then multiply it by the learning rateUpdate the value of x with the new value descended toCheck your stop condition in order to see whether to stopIf condition satisfies, stop. If not, proceed to step 4 with the new x value and keep repeating the algorithmLet us create an arbitrary loss function and try to find a local minimum value for that function by implementing a simple representation of gradient descent using Python.import numpy as np import matplotlib.pyplot as plt %matplotlib inlineWe will find the gradient descent of this function: x3 - 3x2 + 5#creating the function and plotting it function = lambda x: (x ** 3)-(3*(x ** 2))+5 #Get 1000 evenly spaced numbers between -1 and 3 (arbitrarily chosen to ensure steep curve) x = np.linspace(-1,3,500) #Plot the curve plt.plot(x, function(x)) plt.show()Here, we can see that our minimum value should be around 2.0Let us now use the gradient descent to find the exact valuedef deriv(x):      ''' Description: This function takes in a value of x and returns its derivative based on the initial function we specified.      Arguments:      x - a numerical value of x      Returns:      x_deriv - a numerical value of the derivative of x      '''      x_deriv = 3* (x**2) - (6 * (x)) return x_deriv def step(x_new, x_prev, precision, l_r): ''' Description: This function takes in an initial or previous value for x, updates it based on steps taken via the learning rate and outputs the minimum value of x that reaches the precision satisfaction.      Arguments:      x_new - a starting value of x that will get updated based on the learning rate      x_prev - the previous value of x that is getting updated to the new one      precision - a precision that determines the stop of the stepwise descent      l_r - the learning rate (size of each descent step)      Output:      1. Prints out the latest new value of x which equates to the minimum we are looking for 2. Prints out the number of x values which equates to the number of gradient descent steps 3. Plots a first graph of the function with the gradient descent path 4. Plots a second graph of the function with a zoomed in gradient descent path in the important area      '''      # create empty lists where the updated values of x and y wil be appended during each iteration      x_list, y_list = [x_new], [function(x_new)] # keep looping until your desired precision while abs(x_new - x_prev) > precision:          # change the value of x     x_prev = x_new      # get the derivation of the old value of x     d_x = - deriv(x_prev)          # get your new value of x by adding the previous, the multiplication of the derivative and the learning rate     x_new = x_prev + (l_r * d_x)          # append the new value of x to a list of all x-s for later visualization of path     x_list.append(x_new)          # append the new value of y to a list of all y-s for later visualization of path     y_list.append(function(x_new)) print ("Local minimum occurs at: "+ str(x_new)) print ("Number of steps: " + str(len(x_list)))           plt.subplot(1,2,2) plt.scatter(x_list,y_list,c="g") plt.plot(x_list,y_list,c="g") plt.plot(x,function(x), c="r") plt.title("Gradient descent") plt.show() plt.subplot(1,2,1) plt.scatter(x_list,y_list,c="g") plt.plot(x_list,y_list,c="g") plt.plot(x,function(x), c="r") plt.xlim([1.0,2.1]) plt.title("Zoomed in Gradient descent to Key Area") plt.show() #Implement gradient descent (all the arguments are arbitrarily chosen) step(0.5, 0, 0.001, 0.05)Local minimum occurs at: 1.9980265135950486Number of steps: 25 SummaryIn this article, you have learned about gradient descent for machine learning. Here we tried to cover most of the topics. To learn more about machine learning algorithms in-depth,  click here. Let us summarize all that we have covered in this article.Optimization is the heart and soul of machine learning.Gradient descent is a simple optimization technique which can be used with other machine learning algorithms.Batch gradient descent refers to calculating the derivative from all training data before calculating an update.Stochastic gradient descent refers to calculating the derivative from each training data instance and calculating the update immediately.If you are inspired by the opportunities provided by Data Science, enrol in our  Data Science and Machine Learning Courses for more lucrative career options in this landscape.

What is Gradient Descent For Machine Learning

13656
What is Gradient Descent For Machine Learning

In our day-to-day lives, we are optimizing variables based on our personal decisions and we don’t even recognize the process consciously. We are constantly using optimization techniques all day long, for example, while going to work, choosing a shorter route in order to minimize traffic woes, figuring out and managing a quick walk around the campus during a snack break, or scheduling a cab in advance to reach the airport on time.

Optimization is the ultimate goal, whether you are dealing with actual events in real-life or creating a technology-based product. Optimization is at the heart of most of the statistical and machine learning techniques which are widely used in data science. To gain more knowledge and skills on data science and machine learning, join the  certification course now.

Optimization for Machine Learning

Accuracy is the word with which we are most concerned, while we are dealing with problems related to machine learning and artificial intelligence. Any rate of errors cannot be tolerated while dealing with real-world problems and neither should they be compromised.

Optimization for machine Learning

Let us consider a case of self-driving cars. The model fitted in the car detects any obstacles that come in the way and takes appropriate actions, which can be slowing down the speed or pulling on the brakes and so on. Now we need to keep this in mind that there is no human in the car to operate or withdraw the actions taken by the self-driving car. In such a scenario, suppose the model is not accurate. It will not be able to detect other cars or any pedestrians and end up crashing leading to several lives at risk.

This is where we need optimization algorithms to evaluate our model and judge whether the model is performing according to our needs or not. The evaluation can be made easy by calculating the cost function (which we will look into in a while in this article in detail). It is basically a mapping function that tells us about the difference between the desired output and what our model is computing. We can accordingly correct the model and avoid any kind of undesired activities.

Optimization may be defined as the process by which an optimum is achieved. It is all about designing an optimal output for your problems with the use of resources available. However, optimization in machine learning is slightly different. In most of the cases, we are aware of the data, the shape and size, which also helps us know the areas we need to improve. But in machine learning we do not know how the new data may look like, this is where optimization acts perfectly. Optimization techniques are performed on the training data and then the validation data set is used to check its performance.

There are a lot of advanced applications of optimization which are widely used in airway routing, market basket analysis, face recognition and so on. Machine learning algorithms such as linear regression, KNN, neural networks completely depend on optimization techniques. Here, we are going to look into one such popular optimization technique called Gradient Descent.

What is Gradient Descent?

Gradient descent is an optimization algorithm which is mainly used to find the minimum of a function. In machine learning, gradient descent is used to update parameters in a model. Parameters can vary according to the algorithms, such as coefficients in Linear Regression and weights in Neural Networks.

Let us relate gradient descent with a real-life analogy for better understanding. Think of a valley you would like to descend when you are blind-folded. Any sane human will take a step and look for the slope of the valley, whether it goes up or down. Once you are sure of the downward slope you will follow that and repeat the step again and again until you have descended completely (or reached the minima).

Gradient Descent in Machine Learning:- Valley, Slope

Similarly, let us consider another analogy. Suppose you have a ball and you place it on an inclined plane (at position A). As per laws, it will start rolling until it travels to a gentle plane where it will be stationary (at position B as shown in the figure below).

Gradient Descent in Machine Learning:- Ball placed on an inclined plane

This is exactly what happens in gradient descent. The inclined and/or irregular is the cost function when it is plotted and the role of gradient descent is to provide direction and the velocity (learning rate)  of the movement in order to attain the minima of the function i.e where the cost is minimum.

The graphical representation of Gradient Descent in Machine Learning

How does Gradient Descent work?

The primary goal of machine learning algorithms is always to build a model, which is basically a hypothesis which can be used to find an estimation for Y based on X. Let us consider an example of a model based on certain housing data which comprises of the sale price of the house, the size of the house etc. Suppose we want to predict the pricing of the house based on its size. It is clearly a regression problem where given some inputs, we would like to predict a continuous output.

The hypothesis is usually presented as

hypothesis formula

where the theta values are the parameters.

Let us look into some examples and visualize the hypothesis:

hypothesis values

This yields h(x) = 1.5 + 0x. 0x means no slope, and y will always be the constant 1.5. This looks like:

bar graph of hypothesis with no slope

Now let us consider,

hypothesis values -2

Bar Graph of Hypothesis with slope

Where, h(x) = 1 + 0.5x

Cost Function

The objective in the case of gradient descent is to find a line of best fit for some given inputs, or X values, and any number of Y values, or outputs. A cost function is defined as “a function that maps an event or values of one or more variables onto a real number intuitively representing some “cost” associated with the event.”

With a known set of inputs and their corresponding outputs, a machine learning model attempts to make predictions according to the new set of inputs.

Machine Learning Cost Function Process PredictionMachine Learning Process

The Error would be the difference between the two predictions.

The Error would be the difference between the two predictions.

This relates to the idea of a Cost function or Loss function.

A Cost Function/Loss Function tells us “how good” our model is at making predictions for a given set of parameters. The cost function has a curve and a gradient, the slope of this curve helps us to update our parameters and make an accurate model.

Minimizing the Cost Function

It is always the primary goal of any Machine Learning Algorithm to minimize the Cost Function. Minimizing cost functions will also result in a lower error between the predicted values and the actual values which also denotes that the algorithm has performed well in learning. 

How do we actually minimize any function?

Generally, the cost function is in the form of Y = X². In a Cartesian coordinate system, this represents an equation for a parabola which can be graphically represented as :

Parabola in Minimizing the Cost FunctionParabola

Now in order to minimize the function mentioned above, firstly we need to find the value of X which will produce the lowest value of Y (in this case it is the red dot). With lower dimensions (like 2D in this case) it becomes easier to locate the minima but it is not the same while dealing with higher dimensions. For such cases, we need to use the Gradient Descent algorithm to locate the minima.

Now a function is required which will minimize the parameters over a dataset. The most common function which is often used is the  mean squared error. It measures the difference between the estimated value (the prediction) and the estimator (the dataset).

Formula of Mean Squared ErrorMean Squared Error

It turns out we can adjust the equation a little to make the calculation down the track a little more simple. 

Now a question may arise, Why do we take the squared differences and simply not the absolute differences? Because the squared differences make it easier to derive a regression line. Indeed, to find that line we need to compute the first derivative of the Cost function, and it is much harder to compute the derivative of absolute values than squared values. Also, the squared differences increase the error distance, thus, making the bad predictions more pronounced than the good ones.

The equation looks like -

Mean Squared Error in Machine Learning with Squared DifferencesMean Squared Error

Let us apply this cost function to the following data:

Data Set before applying the cost function.

Here we will calculate some of the theta values and then plot the cost function by hand. Since this function passes through (0, 0), we will look only at a single value of theta. Also, let us refer to the cost function as J(ϴ) from now on.

When the value of ϴ is 1, for J(1), we get a 0. You will notice the value of J(1) gives a straight line which fits the data perfectly. Now let us try with ϴ = 0.5

Cost function applied data set graph -1J(0.5)

The MSE function gives us a value of 0.58. Let’s plot both our values so far:

J(1) = 0

J(0.5) = 0.58

Cost function applied data set graph-2With J(1) and J(0.5)

Let us go ahead and calculate some more values of J(ϴ).

Cost function applied data set graph-3

Now if we join the dots carefully, we will get -

Visualising the Cost Function GraphVisualizing the cost function J(ϴ)

As we can see, the cost function is at a minimum when theta = 1, which means the initial data is a straight line with a slope or gradient of 1 as shown by the orange line in the above figure.

Using a trial and error method, we minimized J(ϴ). We did all of these by trying out a lot of values and with the help of visualizations. Gradient Descent does the same thing in a much better way, by changing the theta values or parameters until it descends to the minimum value.

You may refer below for the Python code to find out cost function:

import matplotlib.pyplot as plt
import numpy as np

# original data set
X = [1, 2, 3]
y = [1, 2, 3]

# slope of best_fit_1 is 0.5
# slope of best_fit_2 is 1.0
# slope of best_fit_3 is 1.5

hyps = [0.5, 1.0, 1.5]

# multiply the original X values by the theta
# to produce hypothesis values for each X
def multiply_matrix(mat, theta):
mutated = []
for i in range(len(mat)):
    mutated.append(mat[i] * theta)

return mutated

# calculate cost by looping each sample
# subtract hyp(x) from y
# square the result
# sum them all together
def calc_cost(m, X, y):
total = 0
for i in range(m):
    squared_error = (y[i] - X[i]) ** 2
    total += squared_error
    
return total * (1 / (2*m))

# calculate cost for each hypothesis
for i in range(len(hyps)):
hyp_values = multiply_matrix(X, hyps[i])
 
print("Cost for ", hyps[i], " is ", calc_cost(len(X), y, hyp_values))
Cost for 0.5 is 0.5833333333333333
Cost for 1.0 is 0.0
Cost for 1.5 is 0.5833333333333333

Learning Rate

Let us now start by initializing theta0 and theta1 to any two values, say 0 for both, and go from there. The algorithm is as follows:

Learning RateGradient Descent


where α, alpha, is the learning rate, or how rapidly do we want to move towards the minimum. We can always overshoot if the value of α is too large.

Big Learning Rate vs Small Learning Rate

The derivative which refers to the slope of the function is calculated. Here we calculate the partial derivative of the cost function. It helps us to know the direction (sign) in which the coefficient values should move so that they attain a lower cost on the following iteration. 

Partial derivative of the Cost Function which we need to calculate.Partial Derivative of the Cost Function which we need to calculate

Once we know the direction from the derivative, we can update the coefficient values. Now you need to specify a learning rate parameter which will control how much the coefficients can change on each update.

coefficient = coefficient – (alpha * delta)

This particular process is repeated as long as the cost of the coefficients is 0.0 or close enough to zero.

This turns out to be:

Cost function formula 1Image from Andrew Ng’s machine learning course

Which gives us linear regression!

Linear Regression Formula in Machine LearningLinear Regression

Types of Gradient Descent Algorithms

Types of Gradient Descent Algorithms Graphical RepresentationGradient descent variants’ trajectory towards the minimum

1. Batch Gradient Descent: In this type of gradient descent, all the training examples are processed for each iteration of gradient descent. It gets computationally expensive if the number of training examples is large. This is when batch gradient descent is not preferred, rather a stochastic gradient descent or mini-batch gradient descent is used.

Batch Gradient Descent in Machine LearningAlgorithm for batch gradient descent:

Let hθ(x) be the hypothesis for linear regression. Then, the cost function is given by:

Let Σ represents the sum of all training examples from i=1 to m.

formula2

Repeat {

formula3For every j =0 …n

}

Where xj(i) represents the jth feature of the ith training example. So if m is very large, then the derivative term fails to converge at the global minimum.

2. Stochastic Gradient Descent: The word stochastic is related to a system or a process that is linked with a random probability. Therefore, in Stochastic Gradient Descent (SGD) samples are selected at random for each iteration instead of selecting the entire data set. When the number of training examples is too large, it becomes computationally expensive to use batch gradient descent, however, Stochastic Gradient Descent uses only a single sample, i.e., a batch size of one, to perform each iteration. The sample is randomly shuffled and selected for performing the iteration. The parameters are updated even after one iteration where only one has been processed. Thus, it gets faster than batch gradient descent.

Stochastic in Gradient Descent Graph in Machine Learning

Algorithm for stochastic gradient descent:

  1. Firstly shuffle the data set randomly in order to train the parameters evenly for each type of data.
  2. As mentioned above, it takes into consideration one example per iteration.

Hence,
Let (x(i),y(i)) be the training example

formula4

formula 5

Repeat {
For i=1 to m{

formula 6

        For every j =0 …n
              }
}

3. Mini Batch gradient descent: This type of gradient descent is considered to be faster than both batch gradient descent and stochastic gradient descent. Even if the number of training examples is large, it processes it in batches in one go. Also, the number of iterations are lesser in spite of working with larger training samples.

Mini Batch gradient descent graph in Machine Learning

Algorithm for mini-batch gradient descent:

Let us consider b be the number of examples in one batch, where b<m. Now, assume b=10 and m=100.
The batch size can be adjusted. It is generally kept as a power of 2. The reason behind it is because some hardware such as GPUs achieve better run time with common batch sizes such as a power of 2.

Repeat {

 For i=1,11, 21,…..,91

Let Σ be the summation from i to i+9 represented by k.

formula -6

  For every j =0 …n
}

Convergence trends in different variants of Gradient Descent

For Batch Gradient Descent, the algorithm traces a straight line towards the minimum. If the cost function is convex, then it converges to a global minimum and if the cost function is not convex, then it converges to a local minimum. The learning rate is typically held constant over here.

Convergence trends in different variants of Gradient Descent in Machine Learning

For stochastic gradient descent and mini-batch gradient descent, the algorithm keeps on fluctuating around the global minimum instead of converging. In order to converge, the learning rate needs to be changed slowly.

Challenges in executing Gradient Descent

There are many cases where gradient descent fails to perform well. There are mainly three reasons when this would happen:

  1. Data challenges
  2. Gradient challenges
  3. Implementation challenges

Data Challenges

  • The arrangement of data sometimes leads to challenges. If it is arranged in such a way that it poses a  non-convex optimization problem then it becomes difficult to perform optimization using gradient descent. Gradient descent works for problems which are arranged with a well-defined convex optimization problem.
  • During the optimization of a convex optimization problem, you will come across several minimal points. The lowest among all the points is called the global minimum, and other points are called the local minima. You will have to make sure you go to the global minimum and avoid local minima.
  • There is also a saddle point problem. This is a situation where the gradient is zero but is not an optimal point. It cannot be avoided and is still an active part of the research.

Gradient Challenges

  • While using gradient descent, if the execution is not proper, it leads to certain problems like vanishing gradient. This happens when the gradient is either too small or too large which results in no convergence.

Implementation Challenges

  • Smaller memory results in the failure of network. A lot of neural network practitioners do not pay attention but it is very important to look at the resource utilization by the network.
  • Another important thing to look at is to keep track of things like floating point considerations and hardware/software prerequisites.

Variants of Gradient Descent algorithms

Let us look at some of the most commonly used gradient descent algorithms and how they are implemented.

Vanilla Gradient Descent

One of the simplest forms of gradient descent technique is the Vanilla Gradient Descent. Here, vanilla means pure / without any adulteration. In this algorithm, the main feature is that small steps are taken in the direction of minima by taking the gradient of cost function.

The pseudocode for the same is mentioned below.

update = learning_rate * gradient_of_parameters
parameters = parameters - update

If you see here, the parameters are updated by taking the gradient of the parameters and then the learning rate is multiplied which suggest how quickly we should go towards the minimum. Learning rate is a hyper-parameter and while choosing its value you should be careful.

Vanilla Gradient Descent Graph in Machine Learning

Gradient Descent with Momentum

In this case, we adjust the algorithm in such a manner that we are aware about the prior step before taking the next step.

The pseudocode for the same is mentioned below.

update = learning_rate * gradient
velocity = previous_update * momentum
parameter = parameter + velocity - update

Here, our update is the same as that of vanilla gradient descent. But we are introducing a new term called velocity, which considers the previous update and a constant which is called momentum.

Gradient Descent with Momentum Update in machine LearningSource

ADAGRAD

ADAGRAD (Adaptive Gradient Algorithm) mainly uses an adaptive technique to learn rate updation. In this algorithm, we try to change the algorithm on the basis of how the gradient has been changing for all the previous iterations.

The pseudocode for the same is mentioned below.

grad_component = previous_grad_component + (gradient * gradient)
rate_change = square_root(grad_component) + epsilon
adapted_learning_rate = learning_rate * rate_change 
update = adapted_learning_rate * gradient 
parameter = parameter - update

In the above code, epsilon is a constant which is used to keep the rate of change of learning rate in check.

ADAM

ADAM is another adaptive technique which is built out of ADAGRAD and further reduces its downside. In simple words you can consider it to be ADAGRAD + momentum.

The pseudocode for the same is mentioned below.

adapted_gradient = previous_gradient + ((gradient - previous_gradient) * (1 - beta1))

gradient_component = (gradient_change - previous_learning_rate)
adapted_learning_rate =  previous_learning_rate + (gradient_component * (1 - beta2))
update = adapted_learning_rate * adapted_gradient
parameter = parameter - update

Here beta1 and beta2 are constants to keep changes in gradient and learning rate in check

Tips for Gradient Descent

In this section you will learn about some tips and tricks for getting the most out of the gradient descent algorithm for machine learning.

  • Plot Cost versus Time: It is suggested to collect and plot the cost values calculated by the algorithm for each iteration. It helps you keep track of the descent. For a well-performing gradient descent the cost always decreases in each iteration. If you see there is no decrease, reduce the learning rate.
  • Learning Rate: The learning rate value is a small real value such as 0.1, 0.001 or 0.0001. Keep trying different values to check which works best for your algorithm.
  • Rescale Inputs: Try to achieve a range such as [0, 1] or [-1, 1] by rescaling all the input variables. The algorithm reaches the minimum cost faster if the shape of the cost function is not distorted or skewed.
  • Few Passes: Stochastic gradient descent often does not need more than 1-to-10 passes through the training dataset to converge on good or good enough coefficients.
  • Plot Mean Cost: The updates for each training dataset instance can result in a noisy plot of cost over time when using stochastic gradient descent. Try to take the average over 10, 100, or 1000 updates. This will give you a better idea of the learning trend for the algorithm.

Implementation of Gradient Descent in Python

Now that we have gone through all the elements related to gradient descent, let us implement gradient descent in Python. A simple gradient Descent Algorithm is as follows:

  1. Obtain a function in order to minimize f(x)
  2. Initialize a value x from which you want to start the descent or optimization from
  3. Specify a learning rate which will determine how much of a step to descend by or how quickly you want to converge to the minimum value
  4. Find the derivative of that value x (the descent)
  5. Now proceed to descend by the derivative of that value and then multiply it by the learning rate
  6. Update the value of x with the new value descended to
  7. Check your stop condition in order to see whether to stop
  8. If condition satisfies, stop. If not, proceed to step 4 with the new x value and keep repeating the algorithm

Let us create an arbitrary loss function and try to find a local minimum value for that function by implementing a simple representation of gradient descent using Python.

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

We will find the gradient descent of this function: x3 - 3x2 + 5

#creating the function and plotting it

function = lambda x: (x ** 3)-(3*(x ** 2))+5

#Get 1000 evenly spaced numbers between -1 and 3 (arbitrarily chosen to ensure steep curve)
x = np.linspace(-1,3,500)

#Plot the curve
plt.plot(x, function(x))
plt.show()

plotting the data set

Here, we can see that our minimum value should be around 2.0
Let us now use the gradient descent to find the exact value

def deriv(x):
    
'''
Description: This function takes in a value of x and returns its derivative based on the
initial function we specified.
    
Arguments:
    
x - a numerical value of x
    
Returns:
    
x_deriv - a numerical value of the derivative of x
    
'''
    
x_deriv = 3* (x**2) - (6 * (x))
return x_deriv


def step(x_new, x_prev, precision, l_r):
'''
Description: This function takes in an initial or previous value for x, updates it based on
steps taken via the learning rate and outputs the minimum value of x that reaches the precision satisfaction.
    
Arguments:
    
x_new - a starting value of x that will get updated based on the learning rate
    
x_prev - the previous value of x that is getting updated to the new one
    
precision - a precision that determines the stop of the stepwise descent
    
l_r - the learning rate (size of each descent step)
    
Output:
    
1. Prints out the latest new value of x which equates to the minimum we are looking for
2. Prints out the number of x values which equates to the number of gradient descent steps
3. Plots a first graph of the function with the gradient descent path
4. Plots a second graph of the function with a zoomed in gradient descent path in the important area
    
'''
    
# create empty lists where the updated values of x and y wil be appended during each iteration
    
x_list, y_list = [x_new], [function(x_new)]
# keep looping until your desired precision
while abs(x_new - x_prev) > precision:
    
    # change the value of x
    x_prev = x_new
    
# get the derivation of the old value of x
    d_x = - deriv(x_prev)
    
    # get your new value of x by adding the previous, the multiplication of the derivative and the learning rate
    x_new = x_prev + (l_r * d_x)
    
    # append the new value of x to a list of all x-s for later visualization of path
    x_list.append(x_new)
    
    # append the new value of y to a list of all y-s for later visualization of path
    y_list.append(function(x_new))

print ("Local minimum occurs at: "+ str(x_new))
print ("Number of steps: " + str(len(x_list)))
    
    
plt.subplot(1,2,2)
plt.scatter(x_list,y_list,c="g")
plt.plot(x_list,y_list,c="g")
plt.plot(x,function(x), c="r")
plt.title("Gradient descent")
plt.show()

plt.subplot(1,2,1)
plt.scatter(x_list,y_list,c="g")
plt.plot(x_list,y_list,c="g")
plt.plot(x,function(x), c="r")
plt.xlim([1.0,2.1])
plt.title("Zoomed in Gradient descent to Key Area")
plt.show() 
#Implement gradient descent (all the arguments are arbitrarily chosen)
step(0.5, 0, 0.001, 0.05)

Local minimum occurs at: 1.9980265135950486
Number of steps: 25 

Gradient Descent Machine Learning Graph

Zoomed in Gradient Descent to Key Area in Machine Learning

Summary

In this article, you have learned about gradient descent for machine learning. Here we tried to cover most of the topics. To learn more about machine learning algorithms in-depth,  click here. Let us summarize all that we have covered in this article.

  • Optimization is the heart and soul of machine learning.
  • Gradient descent is a simple optimization technique which can be used with other machine learning algorithms.
  • Batch gradient descent refers to calculating the derivative from all training data before calculating an update.
  • Stochastic gradient descent refers to calculating the derivative from each training data instance and calculating the update immediately.

If you are inspired by the opportunities provided by Data Science, enrol in our  Data Science and Machine Learning Courses for more lucrative career options in this landscape.

Priyankur

Priyankur Sarkar

Data Science Enthusiast

Priyankur Sarkar loves to play with data and get insightful results out of it, then turn those data insights and results in business growth. He is an electronics engineer with a versatile experience as an individual contributor and leading teams, and has actively worked towards building Machine Learning capabilities for organizations.

Join the Discussion

Your email address will not be published. Required fields are marked *

Suggested Blogs

Types of Probability Distributions Every Data Science Expert Should know

Data Science has become one of the most popular interdisciplinary fields. It uses scientific approaches, methods, algorithms, and operations to obtain facts and insights from unstructured, semi-structured, and structured datasets. Organizations use these collected facts and insights for efficient production, business growth, and to predict user requirements. Probability distribution plays a significant role in performing data analysis equipping a dataset for training a model. In this article, you will learn about the types of Probability Distribution, random variables, types of discrete distributions, and continuous distribution.  What is Probability Distribution? A Probability Distribution is a statistical method that determines all the probable values and possibilities that a random variable can deliver from a particular range. This range of values will have a lower bound and an upper bound, which we call the minimum and the maximum possible values.  Various factors on which plotting of a value depends are standard deviation, mean (or average), skewness, and kurtosis. All of these play a significant role in Data science as well. We can use probability distribution in physics, engineering, finance, data analysis, machine learning, etc. Significance of Probability distributions in Data Science In a way, most of the data science and machine learning operations are dependent on several assumptions about the probability of your data. Probability distribution allows a skilled data analyst to recognize and comprehend patterns from large data sets; that is, otherwise, entirely random variables and values. Thus, it makes probability distribution a toolkit based on which we can summarize a large data set. The density function and distribution techniques can also help in plotting data, thus supporting data analysts to visualize data and extract meaning. General Properties of Probability Distributions Probability distribution determines the likelihood of any outcome. The mathematical expression takes a specific value of x and shows the possibility of a random variable with p(x). Some general properties of the probability distribution are – The total of all probabilities for any possible value becomes equal to 1. In a probability distribution, the possibility of finding any specific value or a range of values must lie between 0 and 1. Probability distributions tell us the dispersal of the values from the random variable. Consequently, the type of variable also helps determine the type of probability distribution.Common Data Types Before jumping directly into explaining the different probability distributions, let us first understand the different types of probability distributions or the main categories of the probability distribution. Data analysts and data engineers have to deal with a broad spectrum of data, such as text, numerical, image, audio, voice, and many more. Each of these have a specific means to be represented and analyzed. Data in a probability distribution can either be discrete or continuous. Numerical data especially takes one of the two forms. Discrete data: They take specific values where the outcome of the data remains fixed. Like, for example, the consequence of rolling two dice or the number of overs in a T-20 match. In the first case, the result lies between 2 and 12. In the second case, the event will be less than 20. Different types of discrete distributions that use discrete data are: Binomial Distribution Hypergeometric Distribution Geometric Distribution Poisson Distribution Negative Binomial Distribution Multinomial Distribution  Continuous data: It can obtain any value irrespective of bound or limit. Example: weight, height, any trigonometric value, age, etc. Different types of continuous distributions that use continuous data are: Beta distribution Cauchy distribution Exponential distribution Gamma distribution Logistic distribution Weibull distribution Types of Probability Distribution explained Here are some of the popular types of Probability distributions used by data science professionals. (Try all the code using Jupyter Notebook) Normal Distribution: It is also known as Gaussian distribution. It is one of the simplest types of continuous distribution. This probability distribution is symmetrical around its mean value. It also shows that data at close proximity of the mean is frequently occurring, compared to data that is away from it. Here, mean = 0, variance = finite valueHere, you can see 0 at the center is the Normal Distribution for different mean and variance values. Here is a code example showing the use of Normal Distribution: from scipy.stats import norm  import matplotlib.pyplot as mpl  import numpy as np  def normalDist() -> None:      fig, ax = mpl.subplots(1, 1)      mean, var, skew, kurt = norm.stats(moments = 'mvsk')      x = np.linspace(norm.ppf(0.01),  norm.ppf(0.99), 100)      ax.plot(x, norm.pdf(x),          'r-', lw = 5, alpha = 0.6, label = 'norm pdf')      ax.plot(x, norm.cdf(x),          'b-', lw = 5, alpha = 0.6, label = 'norm cdf')      vals = norm.ppf([0.001, 0.5, 0.999])      np.allclose([0.001, 0.5, 0.999], norm.cdf(vals))      r = norm.rvs(size = 1000)      ax.hist(r, normed = True, histtype = 'stepfilled', alpha = 0.2)      ax.legend(loc = 'best', frameon = False)      mpl.show()  normalDist() Output: Bernoulli Distribution: It is the simplest type of probability distribution. It is a particular case of Binomial distribution, where n=1. It means a binomial distribution takes 'n' number of trials, where n > 1 whereas, the Bernoulli distribution takes only a single trial.   Probability Mass Function of a Bernoulli’s Distribution is:  where p = probability of success and q = probability of failureHere is a code example showing the use of Bernoulli Distribution: from scipy.stats import bernoulli  import seaborn as sb    def bernoulliDist():      data_bern = bernoulli.rvs(size=1200, p = 0.7)      ax = sb.distplot(          data_bern,           kde = True,           color = 'g',           hist_kws = {'alpha' : 1},          kde_kws = {'color': 'y', 'lw': 3, 'label': 'KDE'})      ax.set(xlabel = 'Bernouli Values', ylabel = 'Frequency Distribution')  bernoulliDist() Output:Continuous Uniform Distribution: In this type of continuous distribution, all outcomes are equally possible; each variable gets the same probability of hit as a consequence. This symmetric probabilistic distribution has random variables at an equal interval, with the probability of 1/(b-a). Here is a code example showing the use of Uniform Distribution: from numpy import random  import matplotlib.pyplot as mpl  import seaborn as sb  def uniformDist():      sb.distplot(random.uniform(size = 1200), hist = True)      mpl.show()  uniformDist() Output: Log-Normal Distribution: A Log-Normal distribution is another type of continuous distribution of logarithmic values that form a normal distribution. We can transform a log-normal distribution into a normal distribution. Here is a code example showing the use of Log-Normal Distribution import matplotlib.pyplot as mpl  def lognormalDist():      muu, sig = 3, 1      s = np.random.lognormal(muu, sig, 1000)      cnt, bins, ignored = mpl.hist(s, 80, normed = True, align ='mid', color = 'y')      x = np.linspace(min(bins), max(bins), 10000)      calc = (np.exp( -(np.log(x) - muu) **2 / (2 * sig**2))             / (x * sig * np.sqrt(2 * np.pi)))      mpl.plot(x, calc, linewidth = 2.5, color = 'g')      mpl.axis('tight')      mpl.show()  lognormalDist() Output: Pareto Distribution: It is one of the most critical types of continuous distribution. The Pareto Distribution is a skewed statistical distribution that uses power-law to describe quality control, scientific, social, geophysical, actuarial, and many other types of observable phenomena. The distribution shows slow or heavy-decaying tails in the plot, where much of the data reside at its extreme end. Here is a code example showing the use of Pareto Distribution – import numpy as np  from matplotlib import pyplot as plt  from scipy.stats import pareto  def paretoDist():      xm = 1.5        alp = [2, 4, 6]       x = np.linspace(0, 4, 800)      output = np.array([pareto.pdf(x, scale = xm, b = a) for a in alp])      plt.plot(x, output.T)      plt.show()  paretoDist() Output:Exponential Distribution: It is a type of continuous distribution that determines the time elapsed between events (in a Poisson process). Let’s suppose, that you have the Poisson distribution model that holds the number of events happening in a given period. We can model the time between each birth using an exponential distribution.Here is a code example showing the use of Pareto Distribution – from numpy import random  import matplotlib.pyplot as mpl  import seaborn as sb  def expDist():      sb.distplot(random.exponential(size = 1200), hist = True)      mpl.show()   expDist()Output:Types of the Discrete probability distribution – There are various types of Discrete Probability Distribution a Data science aspirant should know about. Some of them are – Binomial Distribution: It is one of the popular discrete distributions that determine the probability of x success in the 'n' trial. We can use Binomial distribution in situations where we want to extract the probability of SUCCESS or FAILURE from an experiment or survey which went through multiple repetitions. A Binomial distribution holds a fixed number of trials. Also, a binomial event should be independent, and the probability of obtaining failure or success should remain the same. Here is a code example showing the use of Binomial Distribution – from numpy import random  import matplotlib.pyplot as mpl  import seaborn as sb    def binomialDist():      sb.distplot(random.normal(loc = 50, scale = 6, size = 1200), hist = False, label = 'normal')      sb.distplot(random.binomial(n = 100, p = 0.6, size = 1200), hist = False, label = 'binomial')      plt.show()    binomialDist() Output:Geometric Distribution: The geometric probability distribution is one of the crucial types of continuous distributions that determine the probability of any event having likelihood ‘p’ and will happen (occur) after 'n' number of Bernoulli trials. Here 'n' is a discrete random variable. In this distribution, the experiment goes on until we encounter either a success or a failure. The experiment does not depend on the number of trials. Here is a code example showing the use of Geometric Distribution – import matplotlib.pyplot as mpl  def probability_to_occur_at(attempt, probability):      return (1-p)**(attempt - 1) * probability  p = 0.3  attempt = 4  attempts_to_show = range(21)[1:]  print('Possibility that this event will occur on the 7th try: ', probability_to_occur_at(attempt, p))  mpl.xlabel('Number of Trials')  mpl.ylabel('Probability of the Event')  barlist = mpl.bar(attempts_to_show, height=[probability_to_occur_at(x, p) for x in attempts_to_show], tick_label=attempts_to_show)  barlist[attempt].set_color('g')  mpl.show() Output:Poisson Distribution: Poisson distribution is one of the popular types of discrete distribution that shows how many times an event has the possibility of occurrence in a specific set of time. We can obtain this by limiting the Bernoulli distribution from 0 to infinity. Data analysts often use the Poisson distributions to comprehend independent events occurring at a steady rate in a given time interval. Here is a code example showing the use of Poisson Distribution from scipy.stats import poisson  import seaborn as sb  import numpy as np  import matplotlib.pyplot as mpl  def poissonDist():       mpl.figure(figsize = (10, 10))      data_binom = poisson.rvs(mu = 3, size = 5000)      ax = sb.distplot(data_binom, kde=True, color = 'g',                       bins=np.arange(data_binom.min(), data_binom.max() + 1),                       kde_kws={'color': 'y', 'lw': 4, 'label': 'KDE'})      ax.set(xlabel = 'Poisson Distribution', ylabel='Data Frequency')      mpl.show()      poissonDist() Output:Multinomial Distribution: A multinomial distribution is another popular type of discrete probability distribution that calculates the outcome of an event having two or more variables. The term multi means more than one. The Binomial distribution is a particular type of multinomial distribution with two possible outcomes - true/false or heads/tails. Here is a code example showing the use of Multinomial Distribution – import numpy as np  import matplotlib.pyplot as mpl  np.random.seed(99)   n = 12                      pvalue = [0.3, 0.46, 0.22]     s = []  p = []     for size in np.logspace(2, 3):      outcomes = np.random.multinomial(n, pvalue, size=int(size))        prob = sum((outcomes[:,0] == 7) & (outcomes[:,1] == 2) & (outcomes[:,2] == 3))/len(outcomes)      p.append(prob)      s.append(int(size))  fig1 = mpl.figure()  mpl.plot(s, p, 'o-')  mpl.plot(s, [0.0248]*len(s), '--r')  mpl.grid()  mpl.xlim(xmin = 0)  mpl.xlabel('Number of Events')  mpl.ylabel('Function p(X = K)') Output:Negative Binomial Distribution: It is also a type of discrete probability distribution for random variables having negative binomial events. It is also known as the Pascal distribution, where the random variable tells us the number of repeated trials produced during a specific number of experiments.  Here is a code example showing the use of Negative Binomial Distribution – import matplotlib.pyplot as mpl   import numpy as np   from scipy.stats import nbinom    x = np.linspace(0, 6, 70)   gr, kr = 0.3, 0.7        g = nbinom.ppf(x, gr, kr)   s = nbinom.pmf(x, gr, kr)   mpl.plot(x, g, "*", x, s, "r--") Output: Apart from these mentioned distribution types, various other types of probability distributions exist that data science professionals can use to extract reliable datasets. In the next topic, we will understand some interconnections & relationships between various types of probability distributions. Relationship between various Probability distributions – It is surprising to see that different types of probability distributions are interconnected. In the chart shown below, the dashed line is for limited connections between two families of distribution, whereas the solid lines show the exact relationship between them in terms of transformation, variable, type, etc. Conclusion  Probability distributions are prevalent among data analysts and data science professionals because of their wide usage. Today, companies and enterprises hire data science professionals in many sectors, namely, computer science, health, insurance, engineering, and even social science, where probability distributions appear as fundamental tools for application. It is essential for Data analysts and data scientists. to know the core of statistics. Probability Distributions perform a requisite role in analyzing data and cooking a dataset to train the algorithms efficiently. If you want to learn more about data science - particularly probability distributions and their uses, check out KnowledgeHut's comprehensive Data science course. 
9626
Types of Probability Distributions Every Data Scie...

Data Science has become one of the most popular in... Read More

Role of Unstructured Data in Data Science

Data has become the new game changer for businesses. Typically, data scientists categorize data into three broad divisions - structured, semi-structured, and unstructured data. In this article, you will get to know about unstructured data, sources of unstructured data, unstructured data vs. structured data, the use of structured and unstructured data in machine learning, and the difference between structured and unstructured data. Let us first understand what is unstructured data with examples. What is unstructured data? Unstructured data is a kind of data format where there is no organized form or type of data. Videos, texts, images, document files, audio materials, email contents and more are considered to be unstructured data. It is the most copious form of business data, and cannot be stored in a structured database or relational database. Some examples of unstructured data are the photos we post on social media platforms, the tagging we do, the multimedia files we upload, and the documents we share. Seagate predicts that the global data-sphere will expand to 163 zettabytes by 2025, where most of the data will be in the unstructured format. Characteristics of Unstructured DataUnstructured data cannot be organized in a predefined fashion, and is not a homogenous data model. This makes it difficult to manage. Apart from that, these are the other characteristics of unstructured data. You cannot store unstructured data in the form of rows and columns as we do in a database table. Unstructured data is heterogeneous in structure and does not have any specific data model. The creation of such data does not follow any semantics or habits. Due to the lack of any particular sequence or format, it is difficult to manage. Such data does not have an identifiable structure. Sources of Unstructured Data There are various sources of unstructured data. Some of them are: Content websites Social networking sites Online images Memos Reports and research papers Documents, spreadsheets, and presentations Audio mining, chatbots Surveys Feedback systems Advantages of Unstructured Data Unstructured data has become exceptionally easy to store because of MongoDB, Cassandra, or even using JSON. Modern NoSQL databases and software allows data engineers to collect and extract data from various sources. There are numerous benefits that enterprises and businesses can gain from unstructured data. These are: With the advent of unstructured data, we can store data that lacks a proper format or structure. There is no fixed schema or data structure for storing such data, which gives flexibility in storing data of different genres. Unstructured data is much more portable by nature. Unstructured data is scalable and flexible to store. Database systems like MongoDB, Cassandra, etc., can easily handle the heterogeneous properties of unstructured data. Different applications and platforms produce unstructured data that becomes useful in business intelligence, unstructured data analytics, and various other fields. Unstructured data analysis allows finding comprehensive data stories from data like email contents, website information, social media posts, mobile data, cache files and more. Unstructured data, along with data analytics, helps companies improve customer experience. Detection of the taste of consumers and their choices becomes easy because of unstructured data analysis. Disadvantages of Unstructured data Storing and managing unstructured data is difficult because there is no proper structure or schema. Data indexing is also a substantial challenge and hence becomes unclear due to its disorganized nature. Search results from an unstructured dataset are also not accurate because it does not have predefined attributes. Data security is also a challenge due to the heterogeneous form of data. Problems faced and solutions for storing unstructured data. Until recently, it was challenging to store, evaluate, and manage unstructured data. But with the advent of modern data analysis tools, algorithms, CAS (content addressable storage system), and big data technologies, storage and evaluation became easy. Let us first take a look at the various challenges used for storing unstructured data. Storing unstructured data requires a large amount of space. Indexing of unstructured data is a hectic task. Database operations such as deleting and updating become difficult because of the disorganized nature of the data. Storing and managing video, audio, image file, emails, social media data is also challenging. Unstructured data increases the storage cost. For solving such issues, there are some particular approaches. These are: CAS system helps in storing unstructured data efficiently. We can preserve unstructured data in XML format. Developers can store unstructured data in an RDBMS system supporting BLOB. We can convert unstructured data into flexible formats so that evaluating and storage becomes easy. Let us now understand the differences between unstructured data vs. structured data. Unstructured Data Vs. Structured Data In this section, we will understand the difference between structured and unstructured data with examples. STRUCTUREDUNSTRUCTUREDStructured data resides in an organized format in a typical database.Unstructured data cannot reside in an organized format, and hence we cannot store it in a typical database.We can store structured data in SQL database tables having rows and columns.Storing and managing unstructured data requires specialized databases, along with a variety of business intelligence and analytics applications.It is tough to scale a database schema.It is highly scalable.Structured data gets generated in colleges, universities, banks, companies where people have to deal with names, date of birth, salary, marks and so on.We generate or find unstructured data in social media platforms, emails, analyzed data for business intelligence, call centers, chatbots and so on.Queries in structured data allow complex joining.Unstructured data allows only textual queries.The schema of a structured dataset is less flexible and dependent.An unstructured dataset is flexible but does not have any particular schema.It has various concurrency techniques.It has no concurrency techniques.We can use SQL, MySQL, SQLite, Oracle DB, Teradata to store structured data.We can use NoSQL (Not Only SQL) to store unstructured data.Types of Unstructured Data Do you have any idea just how much of unstructured data we produce and from what sources? Unstructured data includes all those forms of data that we cannot actively manage in an RDBMS system that is a transactional system. We can store structured data in the form of records. But this is not the case with unstructured data. Before the advent of object-based storage, most of the unstructured data was stored in file-based systems. Here are some of the types of unstructured data. Rich media content: Entertainment files, surveillance data, multimedia email attachments, geospatial data, audio files (call center and other recorded audio), weather reports (graphical), etc., comes under this genre. Document data: Invoices, text-file records, email contents, productivity applications, etc., are included under this genre. Internet of Things (IoT) data: Ticker data, sensor data, data from other IoT devices come under this genre. Apart from all these, data from business intelligence and analysis, machine learning datasets, and artificial intelligence data training datasets are also a separate genre of unstructured data. Examples of Unstructured Data There are various sources from where we can obtain unstructured data. The prominent use of this data is in unstructured data analytics. Let us now understand what are some examples of unstructured data and their sources – Healthcare industries generate a massive volume of human as well as machine-generated unstructured data. Human-generated unstructured data could be in the form of patient-doctor or patient-nurse conversations, which are usually recorded in audio or text formats. Unstructured data generated by machines includes emergency video camera footage, surgical robots, data accumulated from medical imaging devices like endoscopes, laparoscopes and more.  Social Media is an intrinsic entity of our daily life. Billions of people come together to join channels, share different thoughts, and exchange information with their loved ones. They create and share such data over social media platforms in the form of images, video clips, audio messages, tagging people (this helps companies to map relations between two or more people), entertainment data, educational data, geolocations, texts, etc. Other spectra of data generated from social media platforms are behavior patterns, perceptions, influencers, trends, news, and events. Business and corporate documents generate a multitude of unstructured data such as emails, presentations, reports containing texts, images, presentation reports, video contents, feedback and much more. These documents help to create knowledge repositories within an organization to make better implicit operations. Live chat, video conferencing, web meeting, chatbot-customer messages, surveillance data are other prominent examples of unstructured data that companies can cultivate to get more insights into the details of a person. Some prominent examples of unstructured data used in enterprises and organizations are: Reports and documents, like Word files or PDF files Multimedia files, such as audio, images, designed texts, themes, and videos System logs Medical images Flat files Scanned documents (which are images that hold numbers and text – for example, OCR) Biometric data Unstructured Data Analytics Tools  You might be wondering what tools can come into use to gather and analyze information that does not have a predefined structure or model. Various tools and programming languages use structured and unstructured data for machine learning and data analysis. These are: Tableau MonkeyLearn Apache Spark SAS Python MS. Excel RapidMiner KNIME QlikView Python programming R programming Many cloud services (like Amazon AWS, Microsoft Azure, IBM Cloud, Google Cloud) also offer unstructured data analysis solutions bundled with their services. How to analyze unstructured data? In the past, the process of storage and analysis of unstructured data was not well defined. Enterprises used to carry out this kind of analysis manually. But with the advent of modern tools and programming languages, most of the unstructured data analysis methods became highly advanced. AI-powered tools use algorithms designed precisely to help to break down unstructured data for analysis. Unstructured data analytics tools, along with Natural language processing (NLP) and machine learning algorithms, help advanced software tools analyze and extract analytical data from the unstructured datasets. Before using these tools for analyzing unstructured data, you must properly go through a few steps and keep these points in mind. Set a clear goal for analyzing the data: It is essential to clear your intention about what insights you want to extract from your unstructured data. Knowing this will help you distinguish what type of data you are planning to accumulate. Collect relevant data: Unstructured data is available everywhere, whether it's a social media platform, online feedback or reviews, or a survey form. Depending on the previous point, that is your goal - you have to be precise about what data you want to collect in real-time. Also, keep in mind whether your collected details are relevant or not. Clean your data: Data cleaning or data cleansing is a significant process to detect corrupt or irrelevant data from the dataset, followed by modifying or deleting the coarse and sloppy data. This phase is also known as the data-preprocessing phase, where you have to reduce the noise, carry out data slicing for meaningful representation, and remove unnecessary data. Use Technology and tools: Once you perform the data cleaning, it is time to utilize unstructured data analysis tools to prepare and cultivate the insights from your data. Technologies used for unstructured data storage (NoSQL) can help in managing your flow of data. Other tools and programming libraries like Tableau, Matplotlib, Pandas, and Google Data Studio allows us to extract and visualize unstructured data. Data can be visualized and presented in the form of compelling graphs, plots, and charts. How to Extract information from Unstructured Data? With the growth in digitization during the information era, repetitious transactions in data cause data flooding. The exponential accretion in the speed of digital data creation has brought a whole new domain of understanding user interaction with the online world. According to Gartner, 80% of the data created by an organization or its application is unstructured. While extracting exact information through appropriate analysis of organized data is not yet possible, even obtaining a decent sense of this unstructured data is quite tough. Until now, there are no perfect tools to analyze unstructured data. But algorithms and tools designed using machine learning, Natural language processing, Deep learning, and Graph Analysis (a mathematical method for estimating graph structures) help us to get the upper hand in extracting information from unstructured data. Other neural network models like modern linguistic models follow unsupervised learning techniques to gain a good 'knowledge' about the unstructured dataset before going into a specific supervised learning step. AI-based algorithms and technologies are capable enough to extract keywords, locations, phone numbers, analyze image meaning (through digital image processing). We can then understand what to evaluate and identify information that is essential to your business. ConclusionUnstructured data is found abundantly from sources like documents, records, emails, social media posts, feedbacks, call-records, log-in session data, video, audio, and images. Manually analyzing unstructured data is very time-consuming and can be very boring at the same time. With the growth of data science and machine learning algorithms and models, it has become easy to gather and analyze insights from unstructured information.  According to some research, data analytics tools like MonkeyLearn Studio, Tableau, RapidMiner help analyze unstructured data 1200x faster than the manual approach. Analyzing such data will help you learn more about your customers as well as competitors. Text analysis software, along with machine learning models, will help you dig deep into such datasets and make you gain an in-depth understanding of the overall scenario with fine-grained analyses.
5795
Role of Unstructured Data in Data Science

Data has become the new game changer for busines... Read More

What Is Statistical Analysis and Its Business Applications?

Statistics is a science concerned with collection, analysis, interpretation, and presentation of data. In Statistics, we generally want to study a population. You may consider a population as a collection of things, persons, or objects under experiment or study. It is usually not possible to gain access to all of the information from the entire population due to logistical reasons. So, when we want to study a population, we generally select a sample. In sampling, we select a portion (or subset) of the larger population and then study the portion (or the sample) to learn about the population. Data is the result of sampling from a population.Major ClassificationThere are two basic branches of Statistics – Descriptive and Inferential statistics. Let us understand the two branches in brief. Descriptive statistics Descriptive statistics involves organizing and summarizing the data for better and easier understanding. Unlike Inferential statistics, Descriptive statistics seeks to describe the data, however, it does not attempt to draw inferences from the sample to the whole population. We simply describe the data in a sample. It is not developed on the basis of probability unlike Inferential statistics. Descriptive statistics is further broken into two categories – Measure of Central Tendency and Measures of Variability. Inferential statisticsInferential statistics is the method of estimating the population parameter based on the sample information. It applies dimensions from sample groups in an experiment to contrast the conduct group and make overviews on the large population sample. Please note that the inferential statistics are effective and valuable only when examining each member of the group is difficult. Let us understand Descriptive and Inferential statistics with the help of an example. Task – Suppose, you need to calculate the score of the players who scored a century in a cricket tournament.  Solution: Using Descriptive statistics you can get the desired results.   Task – Now, you need the overall score of the players who scored a century in the cricket tournament.  Solution: Applying the knowledge of Inferential statistics will help you in getting your desired results.  Top Five Considerations for Statistical Data AnalysisData can be messy. Even a small blunder may cost you a fortune. Therefore, special care when working with statistical data is of utmost importance. Here are a few key takeaways you must consider to minimize errors and improve accuracy. Define the purpose and determine the location where the publication will take place.  Understand the assets to undertake the investigation. Understand the individual capability of appropriately managing and understanding the analysis.  Determine whether there is a need to repeat the process.  Know the expectation of the individuals evaluating reviewing, committee, and supervision. Statistics and ParametersDetermining the sample size requires understanding statistics and parameters. The two being very closely related are often confused and sometimes hard to distinguish.  StatisticsA statistic is merely a portion of a target sample. It refers to the measure of the values calculated from the population.  A parameter is a fixed and unknown numerical value used for describing the entire population. The most commonly used parameters are: Mean Median Mode Mean :  The mean is the average or the most common value in a data sample or a population. It is also referred to as the expected value. Formula: Sum of the total number of observations/the number of observations. Experimental data set: 2, 4, 6, 8, 10, 12, 14, 16, 18, 20  Calculating mean:   (2 + 4 + 6 + 8 + 10 + 12 + 14 + 16 + 18 + 20)/10  = 110/10   = 11 Median:  In statistics, the median is the value separating the higher half from the lower half of a data sample, a population, or a probability distribution. It’s the mid-value obtained by arranging the data in increasing order or descending order. Formula:  Let n be the data set (increasing order) When data set is odd: Median = n+1/2th term Case-I: (n is odd)  Experimental data set = 1, 2, 3, 4, 5  Median (n = 5) = [(5 +1)/2]th term      = 6/2 term       = 3rd term   Therefore, the median is 3 When data set is even: Median = [n/2th + (n/2 + 1)th] /2 Case-II: (n is even)  Experimental data set = 1, 2, 3, 4, 5, 6   Median (n = 6) = [n/2th + (n/2 + 1)th]/2     = ( 6/2th + (6/2 +1)th]/2     = (3rd + 4th)/2      = (3 + 4)/2      = 7/2      = 3.5  Therefore, the median is 3.5 Mode: The mode is the value that appears most often in a set of data or a population. Experimental data set= 1, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4,4,5, 6  Mode = 3 (Since 3 is the most repeated element in the sequence.) Terms Used to Describe DataWhen working with data, you will need to search, inspect, and characterize them. To understand the data in a tech-savvy and straightforward way, we use a few statistical terms to denote them individually or in groups.  The most frequently used terms used to describe data include data point, quantitative variables, indicator, statistic, time-series data, variable, data aggregation, time series, dataset, and database. Let us define each one of them in brief: Data points: These are the numerical files formed and organized for interpretations. Quantitative variables: These variables present the information in digit form.  Indicator: An indicator explains the action of a community's social-economic surroundings.  Time-series data: The time-series defines the sequential data.  Data aggregation: A group of data points and data set. Database: A group of arranged information for examination and recovery.  Time-series: A set of measures of a variable documented over a specified time. Step-by-Step Statistical Analysis ProcessThe statistical analysis process involves five steps followed one after another. Step 1: Design the study and find the population of the study. Step 2: Collect data as samples. Step 3: Describe the data in the sample. Step 4: Make inferences with the help of samples and calculations Step 5: Take action Data distributionData distribution is an entry that displays entire imaginable readings of data. It shows how frequently a value occurs. Distributed data is always in ascending order, charts, and graphs enabling visibility of measurements and frequencies. The distribution function displaying the density of values of reading is known as the probability density function. Percentiles in data distributionA percentile is the reading in a distribution with a specified percentage of clarifications under it.  Let us understand percentiles with the help of an example.  Suppose you have scored 90th percentile on a math test. A basic interpretation is that merely 4-5% of the scores were higher than your scores. Right? The median is 50th percentile because the assumed 50% of the values are higher than the median. Dispersion Dispersion explains the magnitude of distribution readings anticipated for a specific variable and multiple unique statistics like range, variance, and standard deviation. For instance, high values of a data set are widely scattered while small values of data are firmly clustered. Histogram The histogram is a pictorial display that arranges a group of data facts into user detailed ranges. A histogram summarizes a data series into a simple interpreted graphic by obtaining many data facts and combining them into reasonable ranges. It contains a variety of results into columns on the x-axis. The y axis displays percentages of data for each column and is applied to picture data distributions. Bell Curve distribution Bell curve distribution is a pictorial representation of a probability distribution whose fundamental standard deviation obtained from the mean makes the bell, shaped curving. The peak point on the curve symbolizes the maximum likely occasion in a pattern of data. The other possible outcomes are symmetrically dispersed around the mean, making a descending sloping curve on both sides of the peak. The curve breadth is therefore known as the standard deviation. Hypothesis testingHypothesis testing is a process where experts experiment with a theory of a population parameter. It aims to evaluate the credibility of a hypothesis using sample data. The five steps involved in hypothesis testing are:  Identify the no outcome hypothesis.  (A worthless or a no-output hypothesis has no outcome, connection, or dissimilarities amongst many factors.) Identify the alternative hypothesis.  Establish the importance level of the hypothesis.  Estimate the experiment statistic and equivalent P-value. P-value explains the possibility of getting a sample statistic.  Sketch a conclusion to interpret into a report about the alternate hypothesis. Types of variablesA variable is any digit, amount, or feature that is countable or measurable. Simply put, it is a variable characteristic that varies. The six types of variables include the following: Dependent variableA dependent variable has values that vary according to the value of another variable known as the independent variable.  Independent variableAn independent variable on the other side is controllable by experts. Its reports are recorded and equated.  Intervening variableAn intervening variable explicates fundamental relations between variables. Moderator variableA moderator variable upsets the power of the connection between dependent and independent variables.  Control variableA control variable is anything restricted to a research study. The values are constant throughout the experiment. Extraneous variableExtraneous variable refers to the entire variables that are dependent but can upset experimental outcomes. Chi-square testChi-square test records the contrast of a model to actual experimental data. Data is unsystematic, underdone, equally limited, obtained from independent variables, and a sufficient sample. It relates the size of any inconsistencies among the expected outcomes and the actual outcomes, provided with the sample size and the number of variables in the connection. Types of FrequenciesFrequency refers to the number of repetitions of reading in an experiment in a given time. Three types of frequency distribution include the following: Grouped, ungrouped Cumulative, relative Relative cumulative frequency distribution. Features of FrequenciesThe calculation of central tendency and position (median, mean, and mode). The measure of dispersion (range, variance, and standard deviation). Degree of symmetry (skewness). Peakedness (kurtosis). Correlation MatrixThe correlation matrix is a table that shows the correlation coefficients of unique variables. It is a powerful tool that summarises datasets points and picture sequences in the provided data. A correlation matrix includes rows and columns that display variables. Additionally, the correlation matrix exploits in aggregation with other varieties of statistical analysis. Inferential StatisticsInferential statistics use random data samples for demonstration and to create inferences. They are measured when analysis of each individual of a whole group is not likely to happen. Applications of Inferential StatisticsInferential statistics in educational research is not likely to sample the entire population that has summaries. For instance, the aim of an investigation study may be to obtain whether a new method of learning mathematics develops mathematical accomplishment for all students in a class. Marketing organizations: Marketing organizations use inferential statistics to dispute a survey and request inquiries. It is because carrying out surveys for all the individuals about merchandise is not likely. Finance departments: Financial departments apply inferential statistics for expected financial plan and resources expenses, especially when there are several indefinite aspects. However, economists cannot estimate all that use possibility. Economic planning: In economic planning, there are potent methods like index figures, time series investigation, and estimation. Inferential statistics measures national income and its components. It gathers info about revenue, investment, saving, and spending to establish links among them. Key TakeawaysStatistical analysis is the gathering and explanation of data to expose sequences and tendencies.   Two divisions of statistical analysis are statistical and non-statistical analyses.  Descriptive and Inferential statistics are the two main categories of statistical analysis. Descriptive statistics describe data, whereas Inferential statistics equate dissimilarities between the sample groups.  Statistics aims to teach individuals how to use restricted samples to generate intellectual and precise results for a large group.   Mean, median, and mode are the statistical analysis parameters used to measure central tendency.   Conclusion Statistical analysis is the procedure of gathering and examining data to recognize sequences and trends. It uses random samples of data obtained from a population to demonstrate and create inferences on a group. Inferential statistics applies economic planning with potent methods like index figures, time series investigation, and estimation.  Statistical analysis finds its applications in all the major sectors – marketing, finance, economic, operations, and data mining. Statistical analysis aids marketing organizations in disputing a survey and requesting inquiries concerning their merchandise. 
5886
What Is Statistical Analysis and Its Business Appl...

Statistics is a science concerned with collection,... Read More