Ashish is a techology consultant with 13+ years of experience and specializes in Data Science, the Python ecosystem and Django, DevOps and automation. He specializes in the design and delivery of key, impactful programs.

- Home
- Blog
- Data Science
- A Guide to Probability and Statistics for Data Science

HomeBlogData ScienceA Guide to Probability and Statistics for Data Science

Share

Published

30th Jun, 2024

Views

Read TimeRead it in

23 Mins

In this article

Imagine this; you are a doctor who wants to model the cholesterol level of your 20 patients and classify it into three categories: low, borderline-high, or high cholesterol from data consisting of various features such as age, existing diseases, occupation, stress levels, and many more. Each of the categories has a known probability of being selected.

The real-world data around us is unclean, messy and often obtained from various sources such as sensors, databases, and websites in the form of text, video, audio and images. The unstructured raw data needs to be broken down, analyzed and converted into actionable insights.

The ‘Science’ part of Data Science consists of math and covers four major domains - Probability and Statistics, Linear Algebra, Calculus and Mathematical Optimization. These mathematical elements are applied in experimental design, data processing, modeling and drawing inferences to arrive at the best fit solution for a complex problem. This article will discuss the important concepts of Probability and Statistics for Data Science.

Probability and Statistics are everywhere around us - be it predicting stock prices, shortlisting students for a university, or even predicting the result of a cricket match and many more. Probability tells us about the likelihood of a particular event occurring. Using Statistics, we make a generalization or an inference about the population from samples.

For example, a medical application could find the relationship between Smoke and Lung Cancer. It could also be used in Time Series Applications such as Sales Forecasting and Stock Market Price Prediction. Concepts like Probability Distribution, Statistical Significance, Regression, and Hypothesis Testing are used for an informed decision-making process. Major researchers combine skills from Software Development, Business, and Mathematics.

Probability is used to predict the likelihood of future events, while Statistics involves analysis of the frequency of past events. Companies use statistics and probability to make their bet in finding the most usable and profitable product. The website has amazing visualizations related to Probability and Statistics.

ResearchGate

Collection, Presentation, Analysis, and use of data to make decisions, solve problems, and design products and processes are done using Statistics. It is used to draw inferences. Examples: quality, performance, or durability of a product, weather forecasts, utilization, or loading of the system.

Probability enables us to use information and data to make intelligent statements and forecasts about future events. It helps quantify the risks associated with practical statistics for data science and its inferences. Both are basic foundations of Data Science and can be used for robust design, simulation, design of experiments, decision analysis, forecasting, time-series analysis, and operations research. To get more insights and upskill yourself, consider enrolling in the Data Science Bootcamp job Placement program.

Statistical Analysis helps uncover underlying trends and patterns in large amounts of data by forming a hypothesis, deciding the sample size, and formulating a sampling procedure. We can collect data from numerous sources such as websites, sensors, and databases and use Descriptive Statistics to summarize the data.

Further, Inferential Statistics can be used to test the hypothesis and derive estimates about the population by using a sample size. Finally, the findings could be visualized, interpreted and generalized. Data can be present in many forms therefore it is important to choose the right tools and techniques of data analysis method for a particular problem statement to arrive at accurate business insights. Python and R are majorly used for statistical analysis in industry and academia. Python for statistical analysis is still preferred by many.

**A. Quantitative Analysis **

Quantitative data are expressed in the form of numbers. It comprises numerical data collected from structured surveys, reviews of records, feedback, documents and website data. It is a deductive process used to test pre-generated concepts, constructs and hypotheses that make up a theory. It has more observed findings recommended by researchers, is more objective and depends on the skills of the researcher. A quantitative data would consist of numerical variables such as age, height, weight, income, group size and university size. One could answer questions like ‘How much?’, ‘How often?’ and ‘How many?’. For example:

- How many people attended the class last week?
- How much was the profit revenue for last quarter at the company?
- How often does a person visit a website?

**B. Qualitative Analysis**

Qualitative data consists of non-numeric data such as audio, video, transcripts, text documents,

Interview responses. Such types of data can be categorized into various classes; therefore, qualitative data is also known as categorical data. Only relying on numerical data is not sufficient and to get better insights into consumers viewpoints, companies tend to collect qualitative data to understand the mindset and emotions before purchasing a product. A qualitative data would consist of variables such as gender, marital status, type of occupation.

The questions answered with qualitative analysis are ‘How?’ and ‘Why?’. For example:

- Why does a person like visiting a particular cafe?
- How is a particular dish prepared in the kitchen?

In this era of Big Data, billions of queries and data are q generated every second. Data handling and Data analytics has become the major focus of many companies. As a data scientist, it’s very important to understand basic data classification and how the raw data is present around us. Data can be classified into 3 types:

**A) Structured Data****:** These are stored in the form of rows and columns, like a structure.

Ex: ‘Star Rating’ of movies released this year, ‘Unique ID number’ of employees.

**B) Semi-Structured Data**: These are loosely organized into categories using meta tags.

Ex: ‘Type of Email’ arranged in the inbox, ‘Twitter Tweets’ arranged by hashtags.

**C) Unstructured Data**: These are text-heavy information consisting of characters.

Ex: Videos, Images, Media Posts, Speech, Sound

**Data can also be : **

**A) Descriptive (aka Qualitative):** measured without numbers, only categories or textual data. It can further be divided into three types : Binary, Nominal and Ordinal Data.

**Nominal**: the data can be categorized but there is no ranking present.

**Example:**Gender (Male/Female) and Race (Asian/American/Black)

**Ordinal**: the data can be categorized as well as ranked in ordered series.

**Example:**Blood Group (A+,A,O+,B+) and Performance

**Binary**: Variables with two options among which one might be correct.

**Example:**Variables such as (Yes/No) or (True/False)

**B) Numerical (aka Quantitative)**: measured with numbers and can be Discrete or Continuous

**Interva****l**: the data can be categorized, and ranked and have even interval spaces. It cannot be negative and no true zero point exists.

**Example:**Test Scores(200 - 400), Credit Score(300-500)

**Ratio:** the data can be categorized, ranked and have even interval spaces. It can be negative and can have a true zero point.

**Example:**Reaction Rate, Flow Rate, Pulse, Temperature in Kelvin (0.0)

The measure of Central Tendency is the first moment of Business Decision used for Statistical Analysis. It comprises Mean, Median, and Mode and is used to summarize the data.

**Mean**: It represents the average of the whole data present. For example, if we want to find how the class results were in a particular month, we can calculate the average or mean of all data points. It is given by the summation of all values divided by a total number of values and is used when the distribution is normal.

**Example**: A chocobar costs ₹20, a vanilla ice cream costs ₹25, and a strawberry shake costs ₹45. What is the mean of all items purchased?

**Answer****:** We can use mean if there are no outliers present in our data. It takes up all data points, is sensitive in nature and may give biased output if outliers or error data points are present. It is the most used method of central tendency in case of normal distribution.

**Median**: If there are outliers present in the data, the median would be a better method to consider as it does not take all the data points into consideration. The Median is the middlemost value of a sorted number. First, we arrange or sort the list of numbers given to us and then find the middle.

**Example** 1: Calculate the median of 15,34,98,23,11,45

First, arrange this into a sorted list (according to rank) : 11,15,23,34,45,98

It has an even number of elements, so the median would be the average of middle numbers.

23 + 342 = 28.523 + 342 = 28.5

**Example 2:** Calculate the median of 10,4,3,5,6

First, arrange this into a sorted list (according to rank) : 3,4,5,6,10

It has an odd number of elements, so the median would be the middle number.

Median = 5

Thus, we can observe that the median remains unaffected by extreme values.

**Mode**: In Data Science, the mode is mostly preferred when we have categorical values, not just numerical ones. Mode is the most occurring element in the data. Data can have many modes or no modes at all.

**Example **: Find the mode of the dataset: 4,6,7, 4,5,2,6,2,2,5,2,7,1,3

Answer: Let’s arrange the number according to frequencies :

1 : 1 time, 2 : 4 times, 3 : 1 time, 4 : 2 times, 5 : 2 times, 6 : 2 times, 7 : 2 times

Mode = 2

We can observe 2 is the element that occurs the most. Therefore, Mode = 2.

The first moment of the Business decision could be used to understand the data space by plotting them on graphs and the df.describe() function can be used to find the mean, median, and mode in Python. Check out the Data Science Training program and design your career into Data Science with experts and job placements.

From the first moment of a business decision, we won’t be able to differentiate between mean, median, or mode, that’s where the second moment of business decision - Measures of Dispersion comes into the picture. It helps us know how much deviation our data is from the average (or mean) value.

**Variance**: It is the square of deviation from the mean value and is calculated by dividing the sum of squares of the difference of each value and the mean with the total number of values.

It is represented in the squared units of the mean of data.

**Standard Deviation**: It is the measure of spreadness or dispersion of data. It is the square root of the variance and maintains uniformity in the data. The larger the standard deviation, the larger the spread. It is represented in the same unit as the mean of the data.

**Example: **Find the mean and variance of data: 10,8,10,8,8,4

Javatpoint

Linear Regression helps us to map the relationship between two variables. There are two variables - one independent variable and another dependent variable. It could be used to determine the relationship between variables such as Height and Weight. There is an increasing linear relationship between them. The relationship between the two variables is given with a correlation coefficient. So, how is an equation for regression statistics represented by?

A linear regression equation is given by

**Y = βo + β1x**

βο = constant, β1 = regression coefficient, X = independent variable and ŷ = dependent variable

βο0 = y-intercept and β1 = slope of the regression line.

Y is called the response variable. We use x to predict or estimate Y.

Example: Consider the values : (1,3), (2,4), (3,8), and (4,9). Find the estimated regression line.

Answer :

6 - (2.2)(2.5) = 0.5

**Final regression line value** = y = 0.5 + 2.2ϰ

In Data Science, Linear Regression is used for forecasting, time-series analysis, and finding the cause-and-effect relationships between variables.

In resampling, random samples from the dataset are taken up and repeated so that a unique sample distribution is generated based on the same data. The sample size is not defined. The more the data, the better is the regeneration distribution generated for the new data.

When data is large enough, in order to derive sampling distributions, t-distribution and chi-square tests could suffice. However, when data is of unknown sample or distribution, resampling tests are recommended. Bootstrapping, Monte-Carlo Methods, Cross Validation are some methods of resampling.

**A) Bootstrapping:** This method involves predicting quantities or characteristics about a population by averaging estimates from small samples, which are recollected and reused more than once - sampling with replacement.

The method is as follows :

- Choose a number of bootstrap samples to perform.
- Choose a sample size.
- For each sample, draw a sample with replacement, fit the data and estimate the result on the out of bag (not included data) sample.
- Calculate the mean obtained on the prediction of an out-of-bag data sample.
- Each of the samples has its own mean. Graph it on a histogram and confidence interval.

The bootstrap method can be used to calculate mean, median, mode, standard deviation, variance, correlation, regression, odds ratios, and multivariate statistics.

**B) Cross Validation: ** In this method, a portion of data is kept aside (hold-out data), the training is done on the rest of the data, and the hold-out data is used to test the model. However, the method is computationally expensive but works really well with small datasets.

For example, if you want to train using K-fold algorithm, the steps are :

- Divide the dataset into (1/k = ⅕) = 20 % of data points aside, chosen randomly.
- Perform training using the remaining 80 % of the dataset.
- Take the hold out sample and score the next 20% of the data.
- Repeat until all data has been inserted for scoring.
- Find the mean of model metrics.

**C) Tree-based methods****: **Tree-based models use decision trees and an if-then approach to generate predictions. We can predict numerical values (Regression) or categorical values (Classification) with the help of a Tree-based approach. Decision tree models are fundamental to all tree-based methods. A combination of many decision trees in parallel is used in the Random Forest Method. Similarly, a combination of many decision trees in the sequence is used in Gradient Boosting Methods.

A decision tree consists of a : root node and a decision node. With each consecutive step, we evaluate the decision and answer it in the form of yes and no and move down to the next level.

One rule is created for each path from the root to leaf. The leaf holds the class prediction.

For Example: If Risk == Low, predict on-time payment of LOAN.

Probability is the likelihood of an event happening. The probability of an uncertain future numerical outcome can take one of several values, each associated with a probability. It is expressed with a number between 0 (can never happen) and 1 (will always happen). In real life, the probability is used to predict cricket scores and weather conditions. Probability for Statistical Analysis is a good read for someone who wants to understand the purpose of statistical analysis in real-life applications.

In some real-life situations, like predicting the stock market, genetics, and the weather of the week, the current or previous outcomes have an impact on the prediction of future probability.

P(A|B) denotes the probability of event A, given event B has already occurred. Let A and B be two events such that P(B) > 0, and then the conditional probability is given by :

**Example**: Let us roll a die. Let A and B be two event spaces with the following

- Event A - An outcome is an odd number
- Event B - The outcome is less than or equal to 3.

What is the conditional probability of A given B has already occurred?

**Answer**: The event space of rolling a die has 6 outcomes = {1,2,3,4,5,6}

A random variable is a set of possible values from a random experiment.

There are two types of random variables, namely :

- Discrete Random Variable : It has a finite set of possible values. X ={0,1,2)}
- Continuous Random Variable : It has set values in the form of intervals. X < 3 = {0,1,2}

Suppose, we have an experiment of tossing a coin and our possible outcomes are Head, Tail.

Then, Random Variable X = 1 ( the outcome is head)

Random Variable X = 0 ( the outcome is tail)

Random variables are methods to represent the outcome and are different from other variables.

The probability distribution for a random variable is an assignment of probability to each of the possible values of the variable. It is a mathematical variable that relates the value of the variable with the probability of occurrence of that value in the population.

A continuous probability distribution for a random variable X is called a normal distribution curve.

As a Data Scientist, you’ll encounter Different Categories of Probability Distributions that will help you to brush up and revise some basic concepts.

It is a method to determine the probability of an event based on prior events. Bayes's theorem calculates the probability based on the hypothesis. It can provide insights into the performance of diagnostic tests. When we go to a doctor to get tested, we want to know the probability of being sick, given the test is positive.

**Likelihood:**Probability of “B” being True, given “A” is True. Denoted by P(B|A).**Prior:**Probability of “A” being True. This is knowledge. Denoted by P(A).**Posterior**: Probability of “A” being True, given “B” is True. Denoted by P(A|B).**Marginalization:**The probability of “B” being True. Denoted by P(B).

A hypothesis is an assumption made on some available data. A p-value is a method in statistics used to validate a hypothesis against observed data. The null hypothesis depicts there is no significant relationship between the two variables. The alternative Hypothesis says there is a significant relationship between the two variables. P value is the probability of obtaining the observed result, assuming that the null hypothesis is true.

In order to test a hypothesis of population, the P value is found and the value is used to decide whether to accept or reject the null hypothesis.

Scientists use 90% - 99% confidence level values in order to get a more robust statistical test.

A 0.01 confidence level is considered more significant than a 0.05 level.

- P value < acceptable value - Reject the Null Hypothesis
- P value > = acceptable value - Accept the Null Hypothesis

**Example:**

Null Hypothesis: People of the city will like the chocolate cake

Alternative Hypothesis: People of the city will not like the chocolate cake

Mean = 330, Standard Deviation = 154, Sample Size = 25, Null hypothesis value = 260

To calculate P value, one can use T test , given by the formula :

**Where,**

ϰ = sample mean

µ = null hypothesis value

n = sample size

From the t table, on the 24th row, one can get the values for t = 2.28 as between

2.064 and 2.492 (statistically significant value).

Statistics and Probability in Data Science involve the collecting, organizing, and analyzing of data with the intent of deriving meaning, which can then be actioned. An explosion of data has been produced by the regular use of the internet and apps on phones, computers, and fitness trackers. These data sets can be combined to provide insights through statistical analysis.

Probability is one of the most often utilized statistical testing criterion for analyzing data. In a variety of situations, from comprehending how a self-driving car should behave in a collision to spotting the warning signals of an impending stock market meltdown, the ability to forecast the chance of something happening is crucial. Forecasting the weather is a typical application of probability in predictive modelling, a discipline that has evolved since it first emerged in the 19th century. Probability can be used by data-driven businesses like Spotify or Netflix to forecast the type of music or movie you might like to view next.

In order to learn Probability and Statistics for Data Science, there are a few mathematical topics that you need to master, for example:

- Statistics and Probability Theory
- Probability Distributions
- Hypothesis Testing
- Statistical Modelling and Fitting
- Machine Learning
- Regression Analysis
- Bayesian Thinking and Modelling
- Markov Chains

- Mathematical Optimization (most machine learning involves optimization)
- Real Analysis and Probability
- Linear Algebra (prefer the abstract, coordinate-free kind)

- Stats theory (Classical optimality, derivation of distributions, hypothesis testing)
- Applied statistics (regression, generalized linear models, discriminant models)
- Graphical models
- Dimensionality Reduction (PCA, kernel PCA)

These are a few resources that could be used to master probability and statistics for data science.

- data mining, inference, and prediction. 2nd Edition. (download/buy)
- Statistical Inference: George Casella, Roger L. Berger: 9780534243128: Amazon.com: Books
- Home page for the book, "Bayesian Data Analysis"
- Mining of Massive Datasets - The Stanford University InfoLab
- All of Statistics: A Concise Course in Statistical Inference (Springer Texts in Statistics): Larry Wasserman: 9780387402727: Amazon.com: Books

Data Science can be used to answer various questions in the field of research and science. However, one needs to learn probability and statistics for data science. One could work on statistical problems like :

- Identifying the tissue sample from classes
- Form an inference from statistical and location-wise demographics of an area.
- Identify numbers in a license plate from a variety of digits.

One needs to know Time Series Analysis, Bayesian Inferencing, Markov Chain Models, Monte Carlo Methods and Clustering. The below methods and algorithms incorporate advanced statistics for data science:

- Linear Regression and Logistic Regression
- Classification
- Principal Component Analysis
- Subset Selection
- Support Vector Machines
- Dimension Reduction
- Non-Linear Models

According to the problem statement, one could incorporate different methods on the basis on dependent and independent variables and prediction criteria. For example, Linear Regression is used to get the best fit linear value or line for the independent and dependent variables.

For any data scientist, probability and advanced statistics are the two most important skills needed in order to evaluate the models and get accurate results. One should start by building fundamental math and statistics concepts, such as

- Data Types and Business Understanding
- Descriptive and Inferential Statistics
- Probability Distributions
- Bayesian Statistics
- Univariate, Bivariate and Multivariable analysis
- Regression Techniques
- Data Visualization and Inference from Graphs
- P values and Bias Reduction
- Likelihood Ratio Tests
- Stochastic Processes.

Some important topics to study are measures of Central Tendency, Measures of Dispersion, Skewness, Kurtosis, Percentile, Conditional, and Joint Probability, Regression, Population and Sampling, Covariance and Correlation, Hypothesis Testing and Statistical Significance.

Statistics and Probability have a wide variety of industry use cases. Below mentioned are some books that are advisable to read if one wants to go into deep mathematics and have some hands-on problem-solving for Statistics and Probability in Data Science.

- The Elements of Statistical Learning
- Introduction to Probability for Data Science
- Mathematics for Machine Learning
- The Elements of Statistical Learning
- Practical Statistics for Data Science
- Naked Statistics: Stripping the Dread from the Data
- Bayesian Methods for Hackers
- Hands-on Mathematics for Deep Learning
- The Number Bias
- Head First Statistics

Probability and Statistics are non-negotiable skills for a Data Scientist. The key takeaways from this article included concepts of Probability and Statistics, along with some core concepts such as the role of mathematics in Data Science, hypothesis testing, regression, resampling methods, data types, and Bayes’ theorem, which would be beneficial for a data scientist. You can start interpreting machine learning models with a probabilistic perspective and see them in terms of the Bayes rule, posterior/prior probabilities, and distributions to get clarity on your data.

1. What statistics are required for data science?

To get better in basic statistics for data science, you should learn about concepts like parameter estimation, hypothesis testing, Bayesian analysis, linear regression, time series analysis, bootstrapping, sampling processes, generalized linear models.

2. Where can I learn statistics for data science?

To learn mathematics and statistics for data science, courses in probability and statistics, check out the links and courses at KnowledgeHut Data Science Bootcamp job placement program along with job placement program.

3. Should I study probability or statistics first?

You could study anything first. Start preparing from the basics of Probability and then eventually shift to Statistics. They both together will come in handy in building ML models and applying calculus or equations for a regression line or any data science application.

4. What are three important reasons for studying statistics for Data Science?

In Data Science, statistics help us to test our predictions about a population from a given sample inference. It is used to gain insights and patterns from the data present with the help of graphs, tests, and visualizations. A most important reason is it helps us quantify data and companies make better decisions with the help of statistical evidence.

Name | Date | Fee | Know more |
---|

Course Advisor