upGrad KnowledgeHut SkillFest Sale!-mobile

HomeBlogData ScienceStatistics for Data Science with Python [Beginner’s Guide]

Statistics for Data Science with Python [Beginner’s Guide]

Published
12th Sep, 2023
Views
view count loader
Read it in
18 Mins
In this article
    Statistics for Data Science with Python [Beginner’s Guide]

    Statistics is a well-known field primarily concerned with data collecting, data organization, data analysis, data interpretation, and data visualization. In the past, statisticians, economists, and business leaders used statistics to calculate and portray relevant information in their respective fields. Statistics has assumed a crucial role in several domains today, including data science, machine learning, business intelligence, computer science, and many more. One of the first steps in learning data science is to become familiar with statistics and maths. If you recall correctly, the next phase is to learn how to code. In this blog, we will discuss about statistics for data science using Python. Let’s start!

    Why Python for Statistics?

    Python's ease of use and straightforward syntax are two of the key reasons it is so popular in the scientific and research fields. A Python is an important tool in the data analyst's toolkit since it is designed for doing repetitive activities and data processing. Anyone who has worked with big volumes of data understands how often repetition occurs. probability and statistics in data science using Python is very easy to implement on our datasets. Because a tool performs the menial labor, data analysts may focus on the more intriguing and rewarding aspects of their jobs. statistics for data science Python and applied statistics with Python play a vital role in paving the path of a data scientist.

    Some of the primary reasons for using Python for statistical analysis are as follows:

    1. Python statistics library that is open source

    There are numerous open-source Python libraries and Python statistics packages for data manipulation, data visualization, statistics, mathematics, machine learning, and natural language processing. Pandas, matplotlib, scikit-learn, and SciPy are examples of Python statistic libraries for Python statistics for Python.

    2. Less code line

    Python gives programmers the advantage of needing fewer lines of code to complete things than older languages require. Python statistics can help you accomplish outstanding data analysis with relatively few lines of code.

    3. Great support

    Python, fortunately, has a significant following and is widely used in academic and industry circles. Thus, there are many excellent analytics libraries accessible. Python users in need of assistance can always seek assistance from Stack Overflow, mailing lists, and user-contributed code and documentation. And as Python grows in popularity, more users will submit information about their user experiences, resulting in more free assistance material. It's no surprise that Python's popularity is growing! The Data Science Professional Certificate can help you learn about the fundamental data types with descriptive analysis methods, Series, and Data frames.

    Understanding Descriptive Statistics

    Descriptive statistics, in general, refers to the process of describing data using representative methods such as charts, tables, Excel files, etc. The data has been described in such a way that it can communicate relevant information that can also be utilized to predict future trends. Univariate analysis is the process of describing and summarising a single variable. Bivariate analysis is the process of describing a statistical relationship between two variables. Multivariate analysis is the process of describing the statistical relationship between many variables. Python descriptive statistics are used to implement descriptive statistics.

    A) Types of Measures

    Descriptive statistics are classified into two types:

    1. Measure of central tendency

    The central tendency measure is a single value that seeks to describe the entire set of data. The three main characteristics of central tendency are as follows:

    a. Mean

    It is calculated by dividing the total number of observations by the sum of the observations. It can also be described as the sum divided by the count.

    b. Median

    (n+1)/2

    It is the data set's middle value. It divides the data into two halves. If the number of items in the data set is odd, the center element is the median; otherwise, the median is the average of two center elements.

    c. Mode

    It is the most often occurring value in the given data collection. If the frequency of all data points is the same, the data set may not have a mode. We can also have several modes if we meet two or more data points with the same frequency.

    2. Measure of variability

    The spread of data, or how well our data is dispersed, is a measure of variability. The most common measures of variability are:

    a. Standard deviation 

    It is calculated by taking the square root of the variance. It is determined by first determining the Mean, then subtracting each number from the Mean, also known as the average, and squaring the result. Adding the values, dividing by the number of words, and finally taking the square root.

    b. Range 

    The range represents the difference between the largest and smallest data points in our data set. The range is proportional to the spread of data, so the wider the range, the wider the spread of data, and vice versa.

    Range = Largest data value – smallest data value

    c. Variance

    It is defined as a squared deviation from the mean on average. It is determined by squaring the difference between each data point and the average, also known as the mean, adding all of them, and then dividing by the number of data points in our data collection.

    measure of variability

    B) Population and Samples

    The population is a grouping of all the elements or things you are interested in statistics. Populations are frequently large, making them unsuitable for data collection and analysis. That is why statisticians typically attempt to draw conclusions about a population by selecting and analyzing a representative subset of that group.

    This subset of a population is referred to as a sample. Ideally, the sample should preserve the population's key statistical traits to a reasonable degree. You'll be able to conclude the population based on the sample.

    C) Outliers

    A data point that deviates significantly from the rest of the data in a sample or population is referred to as an outlier.

     Outliers can have a variety of causes, but here is a handful to get you started:

    • Natural data variation
    • Changes in the observed system's behavior
    • Data gathering errors

    Outliers are frequently caused by data-gathering problems

    Note: Outliers do not have a precise mathematical meaning. To decide if a data point is an outlier and how to treat it, you must rely on the expertise, understanding of the area of interest, and common sense. 

    Skills covered like Descriptive Statistics and Inferential Statistics, make our Data Science Bootcamp worth it when you are looking to take your data science career to the next level.

    Choosing Python Statistics Libraries 

    There are numerous Python statistics libraries available for use, but in this tutorial, you'll learn about some of the more popular and extensively used ones:

    1. Python’s Statistics

    It is a built-in Python module for descriptive statistics. If your datasets are not too large or if you cannot rely on importing other libraries, you can utilize it.

    2. NumPy

    It is a third-party numerical computing package that is optimized for working with single- and multi-dimensional arrays. Its primary type is an array known as ndarray. This package offers a large number of statistical analysis routines.

    3. SciPy

    It is a NumPy-based third-party library for scientific computing. It provides more capabilities than NumPy, such as scipy.stats for statistical analysis.

    4. Pandas

    It is a NumPy-based third-party library for numerical computing. It excels at labelled one-dimensional (1D) data handling with Series objects and two-dimensional (2D) data handling with DataFrame objects.

    5. Matplotlib

    It is a third-party data visualization package. It is useful in conjunction with NumPy, SciPy, and Pandas.

    Getting Started with Python Statistics Libraries

    The Python statistics library includes only a subset of the most relevant statistics routines. If you can only use Python, the Python statistics library might be the best option.

    If you want to learn Pandas, the official Getting Started page is an excellent place to begin. Matplotlib has a comprehensive official User’s Guide that you can use to dive into the details of using the library.

    Let’s start using these Python statistics libraries!

    Calculating Descriptive Statistics in Python

    Python statistical modules provide simple and effective techniques for interacting with data.

    Let’s get our hands filthy by implementing these libraries and techniques in Python.

    1. Measures of Central Tendency

     a. Mean

    import statistics
    # initializing list
    li = [1, 2, 3, 3, 2, 2, 2, 1]
    # using mean() to calculate average of list
    # elements
    print ("The average of list values is : ",end="")
    print (statistics.mean(li))

    Output:

    The average of list values is : 2

    b. Median

    from statistics import median
    from fractions import Fraction as fr
    data1 = (2, 3, 4, 5, 7, 9, 11)
    # tuple of floating point values
    data2 = (2.4, 5.1, 6.7, 8.9)
    # tuple of fractional numbers
    data3 = (fr(1, 2), fr(44, 12), fr(10, 3), fr(2, 3))
    data4 = (-5, -1, -12, -19, -3)
    data5 = (-1, -2, -3, -4, 4, 3, 2, 1)
    # Printing the median of above datasets
    print("Median of data-set 1 is % s" % (median(data1)))
    print("Median of data-set 2 is % s" % (median(data2)))
    print("Median of data-set 3 is % s" % (median(data3)))
    print("Median of data-set 4 is % s" % (median(data4)))
    print("Median of data-set 5 is % s" % (median(data5)))

    Output:

    Median of data-set 1 is 5
    Median of data-set 2 is 5.9
    Median of data-set 3 is 2
    Median of data-set 4 is -5
    Median of data-set 5 is 0.0

    c. Mode

    from statistics import mode
    from fractions import Fraction as fr
    # tuple of positive integer numbers
    data1 = (2, 3, 3, 4, 5, 5, 5, 5, 6, 6, 6, 7)
    # tuple of a set of floating point values
    data2 = (2.4, 1.3, 1.3, 1.3, 2.4, 4.6)
    # tuple of a set of fractional numbers
    data3 = (fr(1, 2), fr(1, 2), fr(10, 3), fr(2, 3))
    # tuple of a set of negative integers
    data4 = (-1, -2, -2, -2, -7, -7, -9)
    # tuple of strings
    data5 = ("red", "blue", "black", "blue", "black", "black", "brown")
    # Printing out the mode of the above data-sets
    print("Mode of data set 1 is % s" % (mode(data1)))
    print("Mode of data set 2 is % s" % (mode(data2)))
    print("Mode of data set 3 is % s" % (mode(data3)))
    print("Mode of data set 4 is % s" % (mode(data4)))
    print("Mode of data set 5 is % s" % (mode(data5)))

    Output:

    Mode of data set 1 is 5
    Mode of data set 2 is 1.3
    Mode of data set 3 is 1/2
    Mode of data set 4 is -2
    Mode of data set 5 is black

    2. Measure of variability

    a. Range

    # Sample Data
    arr = [1, 2, 3, 4, 5]
    #Finding Max
    Maximum = max(arr)
    # Finding Min
    Minimum = min(arr)
    # Difference Of Max and Min
    Range = Maximum-Minimum
    print("Maximum = {}, Minimum = {} and Range = {}".format(
     Maximum, Minimum, Range))

    Output:

    Maximum = 5, Minimum = 1 and Range = 4

    b. Variance

    # Python code to demonstrate variance()
    # function on varying range of data-types
    # importing statistics module
    from statistics import variance
    # importing fractions as parameter values
    from fractions import Fraction as fr
    # tuple of a set of positive integers
    # numbers are spread apart but not very much
    sample1 = (1, 2, 5, 4, 8, 9, 12)
    # tuple of a set of negative integers
    sample2 = (-2, -4, -3, -1, -5, -6)
    # tuple of a set of positive and negative numbers
    # data-points are spread apart considerably
    sample3 = (-9, -1, -0, 2, 1, 3, 4, 19)
    # tuple of a set of fractional numbers
    sample4 = (fr(1, 2), fr(2, 3), fr(3, 4),
    fr(5, 6), fr(7, 8))
    # tuple of a set of floating point values
    sample5 = (1.23, 1.45, 2.1, 2.2, 1.9)
    # Print the variance of each samples
    print("Variance of Sample1 is % s " % (variance(sample1)))
    print("Variance of Sample2 is % s " % (variance(sample2)))
    print("Variance of Sample3 is % s " % (variance(sample3)))
    print("Variance of Sample4 is % s " % (variance(sample4)))
    print("Variance of Sample5 is % s " % (variance(sample5)))

    Output:

    Variance of Sample1 is 15.80952380952381
    Variance of Sample2 is 3.5
    Variance of Sample3 is 61.125
    Variance of Sample4 is 1/45
    Variance of Sample5 is 0.17613000000000006

    c. Standard Deviation

    from statistics import stdev
    # importing fractions as parameter values
    from fractions import Fraction as fr
    # creating a varying range of sample sets
    # numbers are spread apart but not very much
    sample1 = (1, 2, 5, 4, 8, 9, 12)
    # tuple of a set of negative integers
    sample2 = (-2, -4, -3, -1, -5, -6)
    # tuple of a set of positive and negative numbers
    # data-points are spread apart considerably
    sample3 = (-9, -1, -0, 2, 1, 3, 4, 19)
    # tuple of a set of floating point values
    sample4 = (1.23, 1.45, 2.1, 2.2, 1.9)
    # Print the standard deviation of
    # following sample sets of observations
    print("The Standard Deviation of Sample1 is % s"
     % (stdev(sample1)))
    print("The Standard Deviation of Sample2 is % s"
     % (stdev(sample2)))
    print("The Standard Deviation of Sample3 is % s"
     % (stdev(sample3)))
    print("The Standard Deviation of Sample4 is % s"
     % (stdev(sample4)))

    Output:

    The Standard Deviation of Sample1 is 3.9761191895520196
    The Standard Deviation of Sample2 is 1.8708286933869707
    The Standard Deviation of Sample3 is 7.8182478855559445
    The Standard Deviation of Sample4 is 0.4196784483387

    3. Summary of Descriptive Statistics

    SciPy and Pandas provide useful techniques for obtaining descriptive statistics rapidly with a single function or method call. You can use scipy.stats.describe() in the following way:

    >>> result = scipy.stats.describe(y, ddof=1, bias=False)
    >>> result
    DescribeResult(nobs=9, minmax=(-5.0, 41.0), mean=11.622222222222222, variance=228.75194444444446, skewness=0.9249043136685094, kurtosis=0.14770623629658886)

    The dataset must be provided as the first input. A NumPy array, list, tuple, or equivalent data structure can be used as the parameter. You can omit ddof=1 because it is the default and solely affects the variance calculation. Pass bias=False to force statistical bias correction of skewness and kurtosis.

    describe() returns an object that holds the following descriptive statistics:

    1. nobs: the number of observations or elements in your dataset
    2. minmax: the tuple with the minimum and maximum values of your dataset
    3. mean: the mean of your dataset
    4. variance: the variance of your dataset
    5. skewness: the skewness of your dataset
    6. kurtosis: the kurtosis of your dataset

    You can access particular values with dot notation:

    >>> result.nobs
    9
    >>> result.minmax[0] # Min
    -5.0
    >>> result.minmax[1] # Max
    41.0
    >>> result.mean
    11.622222222222222
    >>> result.variance
    228.75194444444446
    >>> result.skewness
    0.9249043136685094
    >>> result.kurtosis
    0.14770623629658886

    A descriptive statistics summary for your dataset is simply one function call away with SciPy.

    Pandas has similar, if not better, functionality. Series objects have the method .describe():

    >>> result = z.describe()
    >>> result
    count 9.000000
    mean 11.622222
    std 15.124548
    min -5.000000
    25% 0.100000
    50% 8.000000
    75% 21.000000
    max 41.000000
    dtype: float64

    It returns a new Series that holds the following:

    1. count: the number of elements in your dataset
    2. mean: the mean of your dataset
    3. std: the standard deviation of your dataset
    4. min and max: the minimum and maximum values of your dataset
    5. 25%, 50%, and 75%: the quartiles of your dataset

    If you want the resulting Series object to contain other percentiles, then you should specify the value of the optional parameter percentiles. You can access each item of result with its label:

    >>> result['mean']
    11.622222222222222
    >>> result['std']
    15.12454774346805
    >>> result['min']
    -5.0
    >>> result['max']
    41.0
    >>> result['25%']
    0.1
    >>> result['50%']
    8.0
    >>> result['75%']
    21.0

    That’s how you can get descriptive statistics of a Series object with a single method call using Pandas.

    4. Measures of Correlation Between Pairs of Data

    You'll frequently need to investigate the link between two variables' corresponding elements in a dataset. Assume you have two variables, x and y, each with an equal number of elements, n. Let x1 from x corresponding to y1 from y, x2 from x correspond to y2 from y, and so on. Then you can say there are n pairs of corresponding elements: (x1, y1), (x2, y2), and so on.

    You'll notice the following correlation measures between pairs of data:

    A positive correlation exists when higher x values correspond to higher y values and vice versa.

    When bigger values of x correlate to smaller values of y and vice versa, there is a negative correlation.

    If there is no obvious association, there is a weak or no correlation.

    measures of Correlation Between Pairs of Data

    The plot with red dots on the left demonstrates a negative correlation. The plot with the green dots in the centre demonstrates a weak association. Finally, the figure with blue dots on the right demonstrates a positive association.

    Covariance and the correlation coefficient are two statistics that assess the correlation between datasets. Let's create some statistics to go along with these metrics. You'll build two Python lists and utilise them to get NumPy arrays and Pandas Series:

    >>> x = list(range(-10, 11))>>> y = [0, 2, 2, 2, 2, 3, 3, 6, 7, 4, 7, 6, 6, 9, 4, 5, 5, 10, 11, 12, 14]>>> x_, y_ = np.array(x), np.array(y)>>> x__, y__ = pd.Series(x_), pd.Series(y_)

    a. Covariance

    The sample covariance is a quantitative assessment of the intensity and direction of a relationship between two variables:

    If the correlation is positive, the covariance is also positive. A higher covariance value indicates a stronger association.

    If the correlation is negative, the covariance is also negative. A stronger association is represented by a lower (or higher absolute) covariance value.

    The covariance is close to zero when the correlation is weak.

    This is how you can calculate the covariance in pure Python:

    >>> n = len(x)
    >>> mean_x, mean_y = sum(x) / n, sum(y) / n
    >>> cov_xy = (sum((x[k] - mean_x) * (y[k] - mean_y) for k in range(n))
    ... / (n - 1))
    >>> cov_xy
    19.95

    First, you have to find the mean of x and y. Then, you apply the mathematical formula for the covariance.

    NumPy has the function cov() that returns the covariance matrix:

    >>> cov_matrix = np.cov(x_, y_)
    >>> cov_matrix
    array([[38.5 , 19.95 ],
    [19.95 , 13.91428571]])

    b. Coefficient of Correlation

    The symbol r represents the correlation coefficient, often known as the Pearson product-moment correlation coefficient. The coefficient is yet another measure of data correlation. Consider it to be a standardised covariance. Here are some key facts regarding it:

    The value

    Working With 2D Data in Python

    2D Data Processing

    Statisticians frequently work with two-dimensional data. Here are some 2D data format examples:

    • Tables in a database
    • CSV documents

    In addition to Excel, Calc, and Google spreadsheets, NumPy and SciPy offer a comprehensive way to interact with 2D data. Pandas provide a class called DataFrame specifically designed to handle 2D labelled data.

    1. Axes

    Begin by making a 2D NumPy array:

    >>> a = np.array([[1, 1, 1],
    ... [2, 3, 1],
    ... [4, 9, 2],
    ... [8, 27, 4],
    ... [16, 1, 1]])
    >>> a
    array([[ 1, 1, 1], 
    [ 2, 3, 1], 
    [ 4, 9, 2], 
    [ 8, 27, 4], 
    [16, 1, 1]])

    You now have a 2D dataset to work with in this part. You can use Python statistics functions and techniques on it in the same way that you would on 1D data:

    >>> np.mean(a)
    5.4
    >>> a.mean()
    5.4
    >>> np.median(a)
    2.0>>> a.var(ddof=1)
    53.40000000000001

    The functions and methods you've used so far include one optional argument called axis, which is essential when working with 2D data. Axis can have any of the following values:

    a. axis=None instructs the programme to compute statistics across all data in the array. This is how the examples above operate. In NumPy, this is frequently the default behaviour.

    1. axis=0 instructs the programme to compute statistics across all rows, or for each column of the array. This is frequently the default behaviour of SciPy statistical functions.

    • axis=1says to calculate the statistics across all columns, that is, for each row of the array.

    Let’s see axis=0 in action with np.mean():

    >>> np.mean(a, axis=0)
    array([6.2, 8.2, 1.8])
    >>> a.mean(axis=0)
    array([6.2, 8.2, 1.8])

    The two statements above return new NumPy arrays with the mean for each column of a. In this example, the mean of the first column is 6.2. The second column has the mean 8.2, while the third has 1.8.

    If you provide axis=1 to mean(), then you’ll get the results for each row:

    >>> np.mean(a, axis=1)
    array([ 1., 2., 5., 13., 6.])
    >>> a.mean(axis=1)
    array([ 1., 2., 5., 13., 6.])

    As you can see, the first row of a has the mean 1.0, the second 2.0, and so on.

    2. DataFrames

    One of the fundamental Pandas data types is the DataFrame class. It's incredibly easy to use because it provides labels for rows and columns. Create a DataFrame with the array a:

    >>> row_names = ['first', 'second', 'third', 'fourth', 'fifth']
    >>> col_names = ['A', 'B', 'C']
    >>> df = pd.DataFrame(a, index=row_names, columns=col_names)
    >>> df
     A B C
    first 1 1 1
    second 2 3 1
    third 4 9 2
    fourth 8 27 4
    fifth 16 1 1

    Though the functionality differs, DataFrame methods are fairly similar to Series methods. When you invoke Python statistics methods without any arguments, the DataFrame will return the following results for each column:

    >>> df.mean()
    A 6.2
    B 8.2
    C 1.8dtype: float64
    >>> df.var()
    A 37.2
    B 121.2
    C 1.7dtype: float64

    DataFrame objects, like Series, have the method. description() provides another DataFrame with a summary of all columns' statistics using Python summary statistics:

    >>> df.describe()
     A B C
    count 5.00000 5.000000 5.00000
    mean 6.20000 8.200000 1.80000
    std 6.09918 11.009087 1.30384
    min 1.00000 1.000000 1.00000
    25% 2.00000 1.000000 1.00000
    50% 4.00000 3.000000 1.00000
    75% 8.00000 9.000000 2.00000
    max 16.00000 27.000000 4.00000

    The Python summary statistics contains the following results:

    1. count: the number of items in each column
    2. mean: the mean of each column
    3. std: the standard deviation
    4. min and max: the minimum and maximum values
    5. 25%, 50%, and 75%: the percentiles

    To learn more about these awesome methods in data science, you must have a look at Data Science Professional Certificate.

    Visualizing Data in Python

    In addition to calculating numerical numbers such as mean, median, and variance, visual methods can be used to show, describe, and summarise data. In this section, you will learn how to visually exhibit your data using the graphs listed below:

    1. Plots in boxes
    2. Histograms
    3. Pie graphs
    4. Bar graphs
    5. X-Y diagrams
    6. Heatmaps

    Although matplotlib.pyplot is a very useful and commonly used library, it is not the only Python library available for this purpose. You can import it as follows:

    >>> import matplotlib.pyplot as plt
    >>> plt.style.use('ggplot')

    Pseudo-random numbers will be used to produce data. This section does not need prior understanding of random numbers. You simply need some arbitrary numbers, and pseudo-random number generators may help you get them. The np.random package creates pseudo-random number arrays:

    np.random.randn generates normally distributed numbers ().
    np.random.randint generates uniformly distributed integers ().

    1. Box Plots

     The box plot is an effective tool for visually showing descriptive statistics in a given dataset. You may see the range, interquartile range, median, mean, outliers, and all quartiles. First, gather some data to depict using a box plot:

    >>> np.random.seed(seed=0)
    >>> x = np.random.randn(1000)
    >>> y = np.random.randn(100)
    >>> z = np.random.randn(10)

     The first phrase utilises seed() to establish the seed of the NumPy random number generator, guaranteeing that the results are consistent each time the code is executed. You are not need to set the seed, but if you do, the outcomes will vary each time.

    The remaining commands generate three NumPy arrays of pseudo-random integers with a normally distributed distribution. x represents a 1000-item array, y represents a 100-item array, and z represents a 10-item array. You can apply now that you have the necessary details. boxplot() yields a box plot:

    fig, ax = plt.subplots()
    ax.boxplot((x, y, z), vert=False, showmeans=True, meanline=True,
     labels=('x', 'y', 'z'), patch_artist=True,
     medianprops={'linewidth': 2, 'color': 'purple'},
     meanprops={'linewidth': 2, 'color': 'red'})
    plt.show()

    The parameters of .boxplot() define the following:

    x represents your data.

    When False, vert sets the plot orientation to horizontal. Vertical is the default orientation.

    When True, showmeans displays the mean of your data.

    When True, meanline represents the mean as a line. A point is the default representation.

    labels: your data's labels

    patch artist specifies how the graph is drawn.

    The term medianprops refers to the qualities of the line that represents the median.

    meanprops denotes the attributes of the mean-representing line or dot.

    The code above generates the following image:

    box plots

    There are three box plots visible. Each one corresponds to a single dataset (x, y, or z) and demonstrates the following:

    • The mean is shown by the red dashed line.
    • The purple line represents the median.
    • The left border of the blue rectangle represents the first quartile.
    • The right border of the blue rectangle represents the third quartile.
    • The length of the blue rectangle is the interquartile range.
    • Everything from left to right is included in the range.
    • The dots on the left and right are outliers.]

    2. Histograms

    Histograms are very useful when a dataset has many unique values. The histogram separates a sorted dataset's values into intervals known as bins. All bins are frequently of similar width, but this is not always the case. The bin edges are the values of a bin's bottom and upper boundaries.

    Each bin is allocated a single frequency value. It is the number of elements in the dataset that have values between the edges of the bin. All save the rightmost bin are, as is customary, half-open. They include values equal to the lower borders but omit values equal to the upper boundaries. The rightmost bin is closed since it encompasses both borders. If you split a dataset with the bin edges 0, 5, 10, and 15, you get three bins.

    • The values larger than or equal to 0 and less than 5 are found in the first and leftmost bin.
    • The numbers more than or equal to 5 and less than 10 are in the second bin.
    • The values larger than or equal to 10 and less than or equal to 15 are found in the third and rightmost bin.

    The method np.histogram() provides an easy way to obtain data for histograms:

    >>> hist, bin_edges = np.histogram(x, bins=10)
    >>> hist
    array([ 9, 20, 70, 146, 217, 239, 160, 86, 38, 15])
    >>> bin_edges
    array([-3.04614305, -2.46559324, -1.88504342, -1.3044936 , -0.72394379,
    -0.14339397, 0.43715585, 1.01770566, 1.59825548, 2.1788053 ,
    2.75935511])

    It accepts your data array and the number of bins (or edges) and returns two NumPy arrays:

    • hist contains the frequency or quantity of items associated with each bin.
    • bin edges holds the bin's edges or boundaries.

    What histogram() computes, .hist() can display the following graph:

    fig, ax = plt.subplots()
    ax.hist(x, bin_edges, cumulative=False)
    ax.set_xlabel('x')
    ax.set_ylabel('Frequency') 
    plt.show()

    histograms

    3. Pie Charts

    Pie charts show data with a small number of labels and relative frequencies. They work effectively with labels that cannot be sorted (like nominal data). A pie chart is a circle divided into multiple sections. Each slice in the dataset corresponds to a single label and has an area proportional to the label's relative frequency.

    Let us define data that is associated with three labels:

    >>> x, y, z = 128, 256, 1024

    Now, create a pie chart with .pie():

    fig, ax = plt.subplots()
    ax.pie((x, y, z), labels=('x', 'y', 'z'), autopct='%1.1f%%')
    plt.show()

    pie charts


    The first input to.pie() is your data, and the second is the sequence of labels. The format of the relative frequencies depicted in the figure is defined by autopct. You should receive something like this:

    4. Bar Charts

    Bar charts can also show data that corresponds to labels or discrete numeric values. They can display data pairs from two datasets. Labels are represented by items in one group, while frequencies are represented by things in the other. They can also display the faults associated with the frequencies if desired.

    The bar chart displays parallel rectangles known as bars. Each bar represents a single label and has a height proportionate to its frequency or relative frequency. Let's make three datasets of 21 items each:

    >>> x = np.arange(21)
    >>> y = np.random.randint(21, size=21)
    >>> err = np.random.randn(21)

    To obtain x, or an array of consecutive integers ranging from 0 to 20, use np.arange(). This will be used to symbolise the labels. y is an array of uniformly distributed random integers ranging from 0 to 20. The frequencies will be represented by this array. The errors are represented by properly distributed floating-point numbers in err. These parameters are optional.

    You may make a bar chart using.bar() for vertical bars or.barh() for horizontal bars:

    fig, ax = plt.subplots())
    ax.bar(x, y, yerr=err)
    ax.set_xlabel('x')
    ax.set_ylabel('y')
    plt.show()

    This code should yield the following result:


    >>> x = np.arange(21)
    >>> y = 5 + 2 * x + 2 * np.random.randn(21)>>> slope, intercept, r, *__ = scipy.stats.linregress(x, y)
    >>> line = f'Regression line: y={intercept:.2f}+{slope:.2f}x, r={r:.2f}'

    The dataset x is once again an array of integers ranging from 0 to 20. y is determined as a linear function of x that has been corrupted with random noise.

    linregress delivers several results. You'll need the regression line's slope and intercept, as well as the correlation coefficient r. You can then apply. To obtain an x-y plot, use plot():

    fig, ax = plt.subplots()ax.plot(x, y, linewidth=0, marker='s', label='Data points')
    ax.plot(x, intercept + slope * x, label=line)
    ax.set_xlabel('x')
    ax.set_ylabel('y')
    ax.legend(facecolor='white')
    plt.show()

    x y plots

    5. Heatmaps

    A heatmap can be used to display a matrix visually. The colors represent the matrix's numbers or elements. Heatmaps are especially useful for displaying covariance and correlation matrices. .imshow() can be used to generate a heatmap for a covariance matrix:

    matrix = np.cov(x, y).round(decimals=2)
    fig, ax = plt.subplots()
    ax.imshow(matrix)
    ax.grid(False)
    ax.xaxis.set(ticks=(0, 1), ticklabels=('x', 'y'))
    ax.yaxis.set(ticks=(0, 1), ticklabels=('x', 'y'))
    ax.set_ylim(1.5, -0.5)
    for i in range(2):
     for j in range(2):
     ax.text(j, i, matrix[i, j], ha='center', va='center', color='w')
    plt.show()

    Here, the heatmap contains the labels 'x' and 'y' as well as the numbers from the covariance matrix. You’ll get a figure like this:

    heatmaps

    The yellow field corresponds to the matrix's greatest element, 130.34, while the purple field corresponds to the matrix's lowest element, 38.5. The blue squares in between represent the value 69.9.

    The heatmap for the correlation coefficient matrix may be obtained using the same logic:

    matrix = np.corrcoef(x, y).round(decimals=2)
    fig, ax = plt.subplots()
    ax.imshow(matrix)
    ax.grid(False)
    ax.xaxis.set(ticks=(0, 1), ticklabels=('x', 'y'))
    ax.yaxis.set(ticks=(0, 1), ticklabels=('x', 'y'))
    ax.set_ylim(1.5, -0.5)
    for i in range(2):
    for j in range(2):
    ax.text(j, i, matrix[i, j], ha='center', va='center', color='w')
    plt.show()

    The result is the figure below:

    heatmap


    The yellow color symbolizes the number 1.0, whereas the purple color indicates the value 0.99.

    Conclusion

    You now understand the quantities that describe and summarise datasets, as well as how to compute them in Python. It is feasible to obtain descriptive statistics using only Python code. However, this is rarely required. Typically, you'll use one of the libraries designed specifically for this purpose:

    1. For the most significant Python statistics functions, use statistics.
    2. To effectively handle arrays, use NumPy.
    3. For further Python statistics functions for NumPy arrays, use SciPy.
    4. To work with labeled datasets, use Pandas.
    5. Matplotlib can be used to visualize data via plots, charts, and histograms.

    You must know how to calculate descriptive statistics measures in the age of big data and artificial intelligence. You're now prepared to delve even further into the world of data science and machine learning by enrolling into If you have any questions or comments, please leave them in the space below.

    Statistics for Data Science with Python FAQs

    1Can you use Python for statistics?

    Yes, Absolutely as Python prioritizes simplicity and readability while simultaneously offering a wealth of relevant choices for data analysts/scientists. As a result, even inexperienced programmers may readily use its comparatively basic syntax to design effective solutions for complex circumstances with just a few lines of code.

    Python's built-in analytics tools make it ideal for processing large amounts of data. In addition to other essential matrices in measuring performance, Python's built-in analytics tools may easily explore patterns, correlate information in large quantities, and deliver greater insights.

    2What statistics do you need for data science?

    At the very least, data analysis necessitates descriptive statistics and probability theory. These ideas will assist you in making better business decisions based on data. Probability distributions, statistical significance, hypothesis testing, and regression are all important concepts.

    Furthermore, knowing Bayesian thinking is required for machine learning. Bayesian reasoning is the act of updating beliefs as new data is gathered, and it is at the heart of many machine learning algorithms. Conditional probability, priors and posteriors, and maximum likelihood are all important topics.    

    3Is Python as good as R for statistics?

    Both can handle almost any data analysis work and are regarded reasonably simple languages to learn, particularly for novices. When it comes to learning Python or R, there is no wrong decision. Both are in-demand talents that will enable you to complete almost any data analytics work you come across. Which one is best for you will ultimately depend on your history, interests, and professional objectives.

    Python is good for Dealing with large amounts of data, Graphic design and data visualization, constructing deep learning models Developing statistical models. Non-statistical operations such as web scraping, database storing, and process execution. While R is good for its large ecosystem of statistical packages.

    4What percentage of data scientists use Python?

    In 2018, 66% of data scientists reported using Python every day, making Python the most popular data science language!

    Profile

    Rohit Verma

    Author

    I am currently pursuing an engineering degree in data science and AI. worked on projects involving data science and full-stack web development (MERN). writes articles on web and data science technologies with passion.

    Share This Article
    Ready to Master the Skills that Drive Your Career?

    Avail your free 1:1 mentorship session.

    Select
    Your Message (Optional)

    Upcoming Data Science Batches & Dates

    NameDateFeeKnow more
    Course advisor icon
    Course Advisor
    Whatsapp/Chat icon