Mr. Gaurav is a cybersecurity engineer, developer, researcher, and Book-Author who did his B.S.-Cybersecurity from EC-Council University & Masters from LPU. He is an India Book of Record holder, Guest speaker with 7+ years of experience in IT.

Share

Read it in 13 Mins

The Only Way to Learn Coding

Is by Doing

Data Science has become one of the most popular interdisciplinary fields. It uses scientific approaches, methods, algorithms, and operations to obtain facts and insights from unstructured, semi-structured, and structured datasets. Organizations use these collected facts and insights for efficient production, business growth, and to predict user requirements. Probability distribution plays a significant role in performing data analysis equipping a dataset for training a model. In this article, you will learn about the types of Probability Distribution, random variables, types of discrete distributions, and continuous distribution. To know more about linear discriminant analysis in machine learning, click here.

A Probability Distribution is a statistical method that determines all the probable values and possibilities that a random variable can deliver from a particular range. This range of values will have a lower bound and an upper bound, which we call the minimum and the maximum possible values.

Various factors on which plotting of a value depends are standard deviation, mean (or average), skewness, and kurtosis. All of these play a significant role in Data science as well. We can use probability distribution in physics, engineering, finance, data analysis, machine learning, etc.

Get to know more about measures of dispersion.

In a way, most of the data science and machine learning operations are dependent on several assumptions about the probability of your data. Probability distribution allows a skilled data analyst to recognize and comprehend patterns from large data sets; that is, otherwise, entirely random variables and values. Thus, it makes probability distribution a toolkit based on which we can summarize a large data set. The density function and distribution techniques can also help in plotting data, thus supporting data analysts to visualize data and extract meaning.

Probability distribution determines the likelihood of any outcome. The mathematical expression takes a specific value of x and shows the possibility of a random variable with p(x). Some general properties of the probability distribution are –

- The total of all probabilities for any possible value becomes equal to 1.
- In a probability distribution, the possibility of finding any specific value or a range of values must lie between 0 and 1.
- Probability distributions tell us the dispersal of the values from the random variable. Consequently, the type of variable also helps determine the type of probability distribution.

Before jumping directly into explaining the different probability distributions, let us first understand the different types of probability distributions or the main categories of the probability distribution. Data analysts and data engineers have to deal with a broad spectrum of data, such as text, numerical, image, audio, voice, and many more. Each of these have a specific means to be represented and analyzed. Data in a probability distribution can either be discrete or continuous. Numerical data especially takes one of the two forms.

**Discrete data:**They take specific values where the outcome of the data remains fixed. Like, for example, the consequence of rolling two dice or the number of overs in a T-20 match. In the first case, the result lies between 2 and 12. In the second case, the event will be less than 20. Different types of discrete distributions that use discrete data are:- Binomial Distribution
- Hypergeometric Distribution
- Geometric Distribution
- Poisson Distribution
- Negative Binomial Distribution
- Multinomial Distribution

**Continuous data:**It can obtain any value irrespective of bound or limit. Example: weight, height, any trigonometric value, age, etc. Different types of continuous distributions that use continuous data are:- Beta distribution
- Cauchy distribution
- Exponential distribution
- Gamma distribution
- Logistic distribution
- Weibull distribution

Here are some of the popular types of Probability distributions used by data science professionals. (Try all the code using Jupyter Notebook)

**Normal Distribution:**

Here, you can see 0 at the center is the Normal Distribution for different mean and variance values.

Here is a code example showing the use of Normal Distribution:

from scipy.stats import norm import matplotlib.pyplot as mpl import numpy as np def normalDist() -> None: fig, ax = mpl.subplots(1, 1) mean, var, skew, kurt = norm.stats(moments = 'mvsk') x = np.linspace(norm.ppf(0.01), norm.ppf(0.99), 100) ax.plot(x, norm.pdf(x), 'r-', lw = 5, alpha = 0.6, label = 'norm pdf') ax.plot(x, norm.cdf(x), 'b-', lw = 5, alpha = 0.6, label = 'norm cdf') vals = norm.ppf([0.001, 0.5, 0.999]) np.allclose([0.001, 0.5, 0.999], norm.cdf(vals)) r = norm.rvs(size = 1000) ax.hist(r, normed = True, histtype = 'stepfilled', alpha = 0.2) ax.legend(loc = 'best', frameon = False) mpl.show() normalDist()

**Output:**** **

**Bernoulli Distribution:**It is the simplest type of probability distribution. It is a particular case of Binomial distribution, where n=1. It means a binomial distribution takes 'n' number of trials, where n > 1 whereas, the Bernoulli distribution takes only a single trial.

Probability Mass Function of a Bernoulli’s Distribution is:

where p = probability of success and q = probability of failure

Here is a code example showing the use of Bernoulli Distribution:

from scipy.stats import bernoulli import seaborn as sb def bernoulliDist(): data_bern = bernoulli.rvs(size=1200, p = 0.7) ax = sb.distplot( data_bern, kde = True, color = 'g', hist_kws = {'alpha' : 1}, kde_kws = {'color': 'y', 'lw': 3, 'label': 'KDE'}) ax.set(xlabel = 'Bernouli Values', ylabel = 'Frequency Distribution') bernoulliDist()

**Output:**

**Continuous Uniform Distribution:**

Here is a code example showing the use of Uniform Distribution:

from numpy import random import matplotlib.pyplot as mpl import seaborn as sb def uniformDist(): sb.distplot(random.uniform(size = 1200), hist = True) mpl.show() uniformDist()

**Output:**** **

**Log-Normal Distribution:**A Log-Normal distribution is another type of continuous distribution of logarithmic values that form a normal distribution. We can transform a log-normal distribution into a normal distribution.

Here is a code example showing the use of Log-Normal Distribution

import matplotlib.pyplot as mpl def lognormalDist(): muu, sig = 3, 1 s = np.random.lognormal(muu, sig, 1000) cnt, bins, ignored = mpl.hist(s, 80, normed = True, align ='mid', color = 'y') x = np.linspace(min(bins), max(bins), 10000) calc = (np.exp( -(np.log(x) - muu) **2 / (2 * sig**2)) / (x * sig * np.sqrt(2 * np.pi))) mpl.plot(x, calc, linewidth = 2.5, color = 'g') mpl.axis('tight') mpl.show() lognormalDist()

**Output:**** **

**Pareto Distribution:**It is one of the most critical types of continuous distribution. The Pareto Distribution is a skewed statistical distribution that uses power-law to describe quality control, scientific, social, geophysical, actuarial, and many other types of observable phenomena. The distribution shows slow or heavy-decaying tails in the plot, where much of the data reside at its extreme end.

Here is a code example showing the use of Pareto Distribution –

import numpy as np from matplotlib import pyplot as plt from scipy.stats import pareto def paretoDist(): xm = 1.5 alp = [2, 4, 6] x = np.linspace(0, 4, 800) output = np.array([pareto.pdf(x, scale = xm, b = a) for a in alp]) plt.plot(x, output.T) plt.show() paretoDist()

**Output:**

**Exponential Distribution:**It is a type of continuous distribution that determines the time elapsed between events (in a Poisson process). Let’s suppose, that you have the Poisson distribution model that holds the number of events happening in a given period. We can model the time between each birth using an exponential distribution.

Here is a code example showing the use of Pareto Distribution –

from numpy import random import matplotlib.pyplot as mpl import seaborn as sb def expDist(): sb.distplot(random.exponential(size = 1200), hist = True) mpl.show() expDist()

**Output:**

There are various types of Discrete Probability Distribution a Data science aspirant should know about. Some of them are –

**Binomial Distribution:**It is one of the popular discrete distributions that determine the probability of x success in the 'n' trial. We can use Binomial distribution in situations where we want to extract the probability of SUCCESS or FAILURE from an experiment or survey which went through multiple repetitions. A Binomial distribution holds a fixed number of trials. Also, a binomial event should be independent, and the probability of obtaining failure or success should remain the same.

Here is a code example showing the use of Binomial Distribution –

from numpy import random import matplotlib.pyplot as mpl import seaborn as sb def binomialDist(): sb.distplot(random.normal(loc = 50, scale = 6, size = 1200), hist = False, label = 'normal') sb.distplot(random.binomial(n = 100, p = 0.6, size = 1200), hist = False, label = 'binomial') plt.show() binomialDist()

**Output:**

**Geometric Distribution:**The geometric probability distribution is one of the crucial types of continuous distributions that determine the probability of any event having likelihood ‘p’ and will happen (occur) after 'n' number of Bernoulli trials. Here 'n' is a discrete random variable. In this distribution, the experiment goes on until we encounter either a success or a failure. The experiment does not depend on the number of trials.

Here is a code example showing the use of Geometric Distribution –

import matplotlib.pyplot as mpl def probability_to_occur_at(attempt, probability): return (1-p)**(attempt - 1) * probability p = 0.3 attempt = 4 attempts_to_show = range(21)[1:] print('Possibility that this event will occur on the 7th try: ', probability_to_occur_at(attempt, p)) mpl.xlabel('Number of Trials') mpl.ylabel('Probability of the Event') barlist = mpl.bar(attempts_to_show, height=[probability_to_occur_at(x, p) for x in attempts_to_show], tick_label=attempts_to_show) barlist[attempt].set_color('g') mpl.show()

**Output:**

**Poisson Distribution:**Poisson distribution is one of the popular types of discrete distribution that shows how many times an event has the possibility of occurrence in a specific set of time. We can obtain this by limiting the Bernoulli distribution from 0 to infinity. Data analysts often use the Poisson distributions to comprehend independent events occurring at a steady rate in a given time interval.

Here is a code example showing the use of Poisson Distribution

from scipy.stats import poisson import seaborn as sb import numpy as np import matplotlib.pyplot as mpl def poissonDist(): mpl.figure(figsize = (10, 10)) data_binom = poisson.rvs(mu = 3, size = 5000) ax = sb.distplot(data_binom, kde=True, color = 'g', bins=np.arange(data_binom.min(), data_binom.max() + 1), kde_kws={'color': 'y', 'lw': 4, 'label': 'KDE'}) ax.set(xlabel = 'Poisson Distribution', ylabel='Data Frequency') mpl.show() poissonDist()

**Output:**

**Multinomial Distribution:**A multinomial distribution is another popular type of discrete probability distribution that calculates the outcome of an event having two or more variables. The term multi means more than one. The Binomial distribution is a particular type of multinomial distribution with two possible outcomes - true/false or heads/tails.

Here is a code example showing the use of Multinomial Distribution –

import numpy as np import matplotlib.pyplot as mpl np.random.seed(99) n = 12 pvalue = [0.3, 0.46, 0.22] s = [] p = [] for size in np.logspace(2, 3): outcomes = np.random.multinomial(n, pvalue, size=int(size)) prob = sum((outcomes[:,0] == 7) & (outcomes[:,1] == 2) & (outcomes[:,2] == 3))/len(outcomes) p.append(prob) s.append(int(size)) fig1 = mpl.figure() mpl.plot(s, p, 'o-') mpl.plot(s, [0.0248]*len(s), '--r') mpl.grid() mpl.xlim(xmin = 0) mpl.xlabel('Number of Events') mpl.ylabel('Function p(X = K)')

Output:

**Negative Binomial Distribution:**

Here is a code example showing the use of Negative Binomial Distribution –

import matplotlib.pyplot as mpl import numpy as np from scipy.stats import nbinom x = np.linspace(0, 6, 70) gr, kr = 0.3, 0.7 g = nbinom.ppf(x, gr, kr) s = nbinom.pmf(x, gr, kr) mpl.plot(x, g, "*", x, s, "r--")

**Output:**** **

Apart from these mentioned distribution types, various other types of probability distributions exist that data science professionals can use to extract reliable datasets. In the next topic, we will understand some interconnections & relationships between various types of probability distributions.

It is surprising to see that different types of probability distributions are interconnected. In the chart shown below, the dashed line is for limited connections between two families of distribution, whereas the solid lines show the exact relationship between them in terms of transformation, variable, type, etc.

**Conclusion **** **

Probability distributions are prevalent among data analysts and data science professionals because of their wide usage. Today, companies and enterprises hire data science professionals in many sectors, namely, computer science, health, insurance, engineering, and even social science, where probability distributions appear as fundamental tools for application. It is essential for Data analysts and data scientists. to know the core of statistics. Probability Distributions perform a requisite role in analyzing data and cooking a dataset to train the algorithms efficiently. If you want to learn more about data science - particularly probability distributions and their uses, check out KnowledgeHut's comprehensive Data science course.

Name | Date | Fee | Know more |
---|