For enquiries call:

Phone

+1-469-442-0620

April flash sale-mobile

HomeBlogData ScienceTypes of Probability Distributions Every Data Science Expert Should know

Types of Probability Distributions Every Data Science Expert Should know

Published
05th Sep, 2023
Views
view count loader
Read it in
13 Mins
In this article
    Types of Probability Distributions Every Data Science Expert Should know

    Data Science has become one of the most popular interdisciplinary fields. It uses scientific approaches, methods, algorithms, and operations to obtain facts and insights from unstructured, semi-structured, and structured datasets. Organizations use these collected facts and insights for efficient production, business growth, and to predict user requirements. Probability distribution plays a significant role in performing data analysis equipping a dataset for training a model. In this article, you will learn about the types of Probability Distribution, random variables, types of discrete distributions, and continuous distribution. To know more about linear discriminant analysis in machine learning, click here.    

    What is Probability Distribution? 

    A Probability Distribution is a statistical method that determines all the probable values and possibilities that a random variable can deliver from a particular range. This range of values will have a lower bound and an upper bound, which we call the minimum and the maximum possible values.  

    Various factors on which plotting of a value depends are standard deviation, mean (or average), skewness, and kurtosis. All of these play a significant role in Data science as well. We can use probability distribution in physics, engineering, finance, data analysis, machine learning, etc. 

    Get to know more about measures of dispersion  

    Significance of Probability distributions in Data Science 

    In a way, most of the data science and machine learning operations are dependent on several assumptions about the probability of your data. Probability distribution allows a skilled data analyst to recognize and comprehend patterns from large data sets; that is, otherwise, entirely random variables and values. Thus, it makes probability distribution a toolkit based on which we can summarize a large data set. The density function and distribution techniques can also help in plotting data, thus supporting data analysts to visualize data and extract meaning. 

    General Properties of Probability Distributions 

    Probability distribution determines the likelihood of any outcome. The mathematical expression takes a specific value of x and shows the possibility of a random variable with p(x). Some general properties of the probability distribution are – 

    1. The total of all probabilities for any possible value becomes equal to 1. 
    2. In a probability distribution, the possibility of finding any specific value or a range of values must lie between 0 and 1. 
    3. Probability distributions tell us the dispersal of the values from the random variable. Consequently, the type of variable also helps determine the type of probability distribution.

    Common Data Types 

    Before jumping directly into explaining the different probability distributions, let us first understand the different types of probability distributions or the main categories of the probability distribution. Data analysts and data engineers have to deal with a broad spectrum of data, such as text, numerical, image, audio, voice, and many more. Each of these have a specific means to be represented and analyzed. Data in a probability distribution can either be discrete or continuous. Numerical data especially takes one of the two forms. 

    • Discrete data: They take specific values where the outcome of the data remains fixed. Like, for example, the consequence of rolling two dice or the number of overs in a T-20 match. In the first case, the result lies between 2 and 12. In the second case, the event will be less than 20. Different types of discrete distributions that use discrete data are: 
      • Binomial Distribution 
      • Hypergeometric Distribution 
      • Geometric Distribution 
      • Poisson Distribution 
      • Negative Binomial Distribution 
      • Multinomial Distribution  
    • Continuous data: It can obtain any value irrespective of bound or limit. Example: weight, height, any trigonometric value, age, etc. Different types of continuous distributions that use continuous data are: 
      • Beta distribution 
      • Cauchy distribution 
      • Exponential distribution 
      • Gamma distribution 
      • Logistic distribution 
      • Weibull distribution 

    Types of Probability Distribution explained 

    Here are some of the popular types of Probability distributions used by data science professionals. (Try all the code using Jupyter Notebook) 

    • Normal Distribution: It is also known as Gaussian distribution. It is one of the simplest types of continuous distribution. This probability distribution is symmetrical around its mean value. It also shows that data at close proximity of the mean is frequently occurring, compared to data that is away from it. Here, mean = 0, variance = finite value

    Here, you can see 0 at the center is the Normal Distribution for different mean and variance values. 

    Here is a code example showing the use of Normal Distribution: 

    from scipy.stats import norm 
    import matplotlib.pyplot as mpl 
    import numpy as np 
    def normalDist() -> None: 
        fig, ax = mpl.subplots(1, 1) 
        mean, var, skew, kurt = norm.stats(moments = 'mvsk') 
        x = np.linspace(norm.ppf(0.01),  norm.ppf(0.99), 100) 
        ax.plot(x, norm.pdf(x), 
            'r-', lw = 5, alpha = 0.6, label = 'norm pdf') 
        ax.plot(x, norm.cdf(x), 
            'b-', lw = 5, alpha = 0.6, label = 'norm cdf') 
        vals = norm.ppf([0.001, 0.5, 0.999]) 
        np.allclose([0.001, 0.5, 0.999], norm.cdf(vals)) 
        r = norm.rvs(size = 1000) 
        ax.hist(r, normed = True, histtype = 'stepfilled', alpha = 0.2) 
        ax.legend(loc = 'best', frameon = False) 
        mpl.show() 
    normalDist() 

    Output: 

    • Bernoulli Distribution: It is the simplest type of probability distribution. It is a particular case of Binomial distribution, where n=1. It means a binomial distribution takes 'n' number of trials, where n > 1 whereas, the Bernoulli distribution takes only a single trial.  

     Probability Mass Function of a Bernoulli’s Distribution is:  

    where p = probability of success and q = probability of failure

    Here is a code example showing the use of Bernoulli Distribution: 

    from scipy.stats import bernoulli 
    import seaborn as sb 
     
    def bernoulliDist(): 
        data_bern = bernoulli.rvs(size=1200, p = 0.7) 
        ax = sb.distplot( 
            data_bern,  
            kde = True,  
            color = 'g',  
            hist_kws = {'alpha' : 1}, 
            kde_kws = {'color': 'y', 'lw': 3, 'label': 'KDE'}) 
        ax.set(xlabel = 'Bernouli Values', ylabel = 'Frequency Distribution') 
    bernoulliDist() 

    Output:

    • Continuous Uniform Distribution: In this type of continuous distribution, all outcomes are equally possible; each variable gets the same probability of hit as a consequence. This symmetric probabilistic distribution has random variables at an equal interval, with the probability of 1/(b-a). 

    Here is a code example showing the use of Uniform Distribution: 

    from numpy import random 
    import matplotlib.pyplot as mpl 
    import seaborn as sb 
    def uniformDist(): 
        sb.distplot(random.uniform(size = 1200), hist = True) 
        mpl.show() 
    
    uniformDist() 

    Output: 

    • Log-Normal Distribution: A Log-Normal distribution is another type of continuous distribution of logarithmic values that form a normal distribution. We can transform a log-normal distribution into a normal distribution. 

    Here is a code example showing the use of Log-Normal Distribution 

    import matplotlib.pyplot as mpl 
    def lognormalDist(): 
        muu, sig = 3, 1 
        s = np.random.lognormal(muu, sig, 1000) 
        cnt, bins, ignored = mpl.hist(s, 80, normed = True, align ='mid', color = 'y') 
        x = np.linspace(min(bins), max(bins), 10000) 
        calc = (np.exp( -(np.log(x) - muu) **2 / (2 * sig**2)) 
               / (x * sig * np.sqrt(2 * np.pi))) 
        mpl.plot(x, calc, linewidth = 2.5, color = 'g') 
        mpl.axis('tight') 
        mpl.show() 
    lognormalDist() 

    Output: 

    • Pareto Distribution: It is one of the most critical types of continuous distribution. The Pareto Distribution is a skewed statistical distribution that uses power-law to describe quality control, scientific, social, geophysical, actuarial, and many other types of observable phenomena. The distribution shows slow or heavy-decaying tails in the plot, where much of the data reside at its extreme end. 

    Here is a code example showing the use of Pareto Distribution – 

    import numpy as np 
    from matplotlib import pyplot as plt 
    from scipy.stats import pareto 
    def paretoDist(): 
        xm = 1.5   
        alp = [2, 4, 6]  
        x = np.linspace(0, 4, 800) 
        output = np.array([pareto.pdf(x, scale = xm, b = a) for a in alp]) 
        plt.plot(x, output.T) 
        plt.show() 
    paretoDist() 

    Output:

    • Exponential Distribution: It is a type of continuous distribution that determines the time elapsed between events (in a Poisson process). Let’s suppose, that you have the Poisson distribution model that holds the number of events happening in a given period. We can model the time between each birth using an exponential distribution.

    Here is a code example showing the use of Pareto Distribution – 

    from numpy import random 
    import matplotlib.pyplot as mpl 
    import seaborn as sb 
    def expDist(): 
        sb.distplot(random.exponential(size = 1200), hist = True) 
        mpl.show() 
     expDist()

    Output:

    Types of the Discrete probability distribution – 

    There are various types of Discrete Probability Distribution a Data science aspirant should know about. Some of them are – 

    • Binomial Distribution: It is one of the popular discrete distributions that determine the probability of x success in the 'n' trial. We can use Binomial distribution in situations where we want to extract the probability of SUCCESS or FAILURE from an experiment or survey which went through multiple repetitions. A Binomial distribution holds a fixed number of trials. Also, a binomial event should be independent, and the probability of obtaining failure or success should remain the same. 

    Here is a code example showing the use of Binomial Distribution – 

    from numpy import random 
    import matplotlib.pyplot as mpl 
    import seaborn as sb 
     
    def binomialDist(): 
        sb.distplot(random.normal(loc = 50, scale = 6, size = 1200), hist = False, label = 'normal') 
        sb.distplot(random.binomial(n = 100, p = 0.6, size = 1200), hist = False, label = 'binomial') 
        plt.show() 
     
    binomialDist() 

    Output:

    • Geometric Distribution: The geometric probability distribution is one of the crucial types of continuous distributions that determine the probability of any event having likelihood ‘p’ andwill happen (occur) after 'n' number of Bernoulli trials. Here 'n' is a discrete random variable. In this distribution, the experiment goes on until we encounter either a success or a failure. The experiment does not depend on the number of trials. 

    Here is a code example showing the use of Geometric Distribution – 

    import matplotlib.pyplot as mpl 
    def probability_to_occur_at(attempt, probability): 
        return (1-p)**(attempt - 1) * probability 
    p = 0.3 
    attempt = 4 
    attempts_to_show = range(21)[1:] 
    print('Possibility that this event will occur on the 7th try: ', probability_to_occur_at(attempt, p)) 
    mpl.xlabel('Number of Trials') 
    mpl.ylabel('Probability of the Event') 
    barlist = mpl.bar(attempts_to_show, height=[probability_to_occur_at(x, p) for x in attempts_to_show], tick_label=attempts_to_show) 
    barlist[attempt].set_color('g') 
    mpl.show() 

    Output:

    • Poisson Distribution: Poisson distribution is one of the popular types of discrete distribution that shows how many times an event has the possibility of occurrence in a specific set of time. We can obtain this by limiting the Bernoulli distribution from 0 to infinity. Data analysts often use the Poisson distributions to comprehend independent events occurring at a steady rate in a given time interval. 

    Here is a code example showing the use of Poisson Distribution 

    from scipy.stats import poisson 
    import seaborn as sb 
    import numpy as np 
    import matplotlib.pyplot as mpl 
    def poissonDist():  
        mpl.figure(figsize = (10, 10)) 
        data_binom = poisson.rvs(mu = 3, size = 5000) 
        ax = sb.distplot(data_binom, kde=True, color = 'g',  
                        bins=np.arange(data_binom.min(), data_binom.max() + 1),  
                        kde_kws={'color': 'y', 'lw': 4, 'label': 'KDE'}) 
        ax.set(xlabel = 'Poisson Distribution', ylabel='Data Frequency') 
        mpl.show()     
    poissonDist() 

    Output:

    • Multinomial Distribution: A multinomial distribution is another popular type of discrete probability distribution that calculates the outcome of an event having two or more variables. The term multi means more than one. The Binomial distribution is a particular type of multinomial distribution with two possible outcomes - true/false or heads/tails. 

    Here is a code example showing the use of Multinomial Distribution – 

    import numpy as np 
    import matplotlib.pyplot as mpl 
    np.random.seed(99)  
    n = 12                     
    pvalue = [0.3, 0.46, 0.22]    
    s = [] 
    p = []    
    for size in np.logspace(2, 3): 
        outcomes = np.random.multinomial(n, pvalue, size=int(size)) 
     
        prob = sum((outcomes[:,0] == 7) & (outcomes[:,1] == 2) & (outcomes[:,2] == 3))/len(outcomes) 
        p.append(prob) 
        s.append(int(size)) 
    fig1 = mpl.figure() 
    mpl.plot(s, p, 'o-') 
    mpl.plot(s, [0.0248]*len(s), '--r') 
    mpl.grid() 
    mpl.xlim(xmin = 0) 
    mpl.xlabel('Number of Events') 
    mpl.ylabel('Function p(X = K)') 

    Output:

    • Negative Binomial Distribution: It is also a type of discrete probability distribution for random variables having negative binomial events. It is also known as the Pascal distribution, where the random variable tells us the number of repeated trials produced during a specific number of experiments.  

    Here is a code example showing the use of Negative Binomial Distribution – 

    import matplotlib.pyplot as mpl  
    import numpy as np  
    from scipy.stats import nbinom 
     
    x = np.linspace(0, 6, 70)  
    gr, kr = 0.3, 0.7       
    g = nbinom.ppf(x, gr, kr)  
    s = nbinom.pmf(x, gr, kr)  
    mpl.plot(x, g, "*", x, s, "r--") 

    Output: 

    Apart from these mentioned distribution types, various other types of probability distributions exist that data science professionals can use to extract reliable datasets. In the next topic, we will understand some interconnections & relationships between various types of probability distributions. 

    Relationship between various Probability distributions – 

    It is surprising to see that different types of probability distributions are interconnected. In the chart shown below, the dashed line is for limited connections between two families of distribution, whereas the solid lines show the exact relationship between them in terms of transformation, variable, type, etc. 

    Conclusion  

    Probability distributions are prevalent among data analysts and data science professionals because of their wide usage. Today, companies and enterprises hire data science professionals in many sectors, namely, computer science, health, insurance, engineering, and even social science, where probability distributions appear as fundamental tools for application. It is essential for Data analysts and data scientists. to know the core of statistics. Probability Distributions perform a requisite role in analyzing data and cooking a dataset to train the algorithms efficiently. If you want to learn more about data science - particularly probability distributions and their uses, check out KnowledgeHut's comprehensive Data science course. 

    Profile

    Gaurav Kr. Roy

    Author

    Mr. Gaurav is a cybersecurity engineer, developer, researcher, and Book-Author who did his B.S.-Cybersecurity from EC-Council University & Masters from LPU. He is an India Book of Record holder, Guest speaker with 7+ years of experience in IT. 

    Share This Article
    Ready to Master the Skills that Drive Your Career?

    Avail your free 1:1 mentorship session.

    Select
    Your Message (Optional)

    Upcoming Data Science Batches & Dates

    NameDateFeeKnow more
    Course advisor icon
    Course Advisor
    Whatsapp/Chat icon