Category

Courses

- Agile Methodology
- Certified ScrumMaster (CSM) Certification
- Certified Scrum Product Owner (CSPO) Certification
- Leading SAFe 6.0 Certification
- Professional Scrum Master-Advanced™ (PSM-A) Training
- SAFe 6.0 Scrum Master (SSM) Certification
- Implementing SAFe 6.0 (SPC) Certification
- SAFe 6.0 Release Train Engineer (RTE) Certification
- SAFe 6.0 Product Owner Product Manager (POPM) Certification
- ICP-ACC Certification
- Agile Master's Program
- Agile Excellence Master's Program
- Kanban Management Professional (KMP I: Kanban System Design) Certification
- Professional Scrum Product Owner I (PSPO I) Training
- View All Courses

Accreditation Bodies

- Project Management
- Project Management Professional (PMP) Certification
- PRINCE2 Certification
- PRINCE2 Foundation Certification
- PRINCE2 Practitioner Certification
- Change Management Training
- Project Management Techniques Training
- Certified Associate in Project Management (CAPM) Certification
- Program Management Professional (PgMP) Certification
- Portfolio Management Professional (PfMP) Certification
- Oracle Primavera P6 Certification
- Project Management Master's Program
- Microsoft Project Training
- View All Courses

Accreditation Bodies

- Data Science
- Data Science Bootcamp
- Data Engineer Bootcamp
- Data Analyst Bootcamp
- AI Engineer Bootcamp
- Data Science with Python Certification
- Python for Data Science
- Machine Learning with Python
- Data Science with R
- Machine Learning with R
- Deep Learning Certification Training
- Natural Language Processing (NLP)
- View All Courses

- DevOps
Accreditation Bodies

- Cloud Computing
- AWS Certified Solutions Architect - Associate
- Multi-Cloud Engineer Bootcamp
- AWS Cloud Practitioner Certification
- Developing on AWS
- AWS DevOps Certification
- Azure Solution Architect Certification
- Azure Fundamentals Certification
- Azure Administrator Certification
- Azure Data Engineer Certification
- Azure Devops Certification
- AWS Cloud Architect Master's Program
- AWS Certified SysOps Administrator Certification
- Azure Security Engineer Certification
- Azure AI Solution Certification Training
- View All Courses

Career TrackSupercharge your career with our Multi-Cloud Engineer Bootcamp

KNOW MORE - Web Development
- Full-Stack Developer Bootcamp
- UI/UX Design Bootcamp
- Full-Stack [Java Stack] Bootcamp
- Software Engineer Bootcamp
- Software Engineer Bootcamp (with PMI)
- Front-End Development Bootcamp
- Back-End Development Bootcamp
- React Training
- Node JS Training
- Angular Training (Version 12)
- Javascript Training
- PHP and MySQL Training
- View All Courses

- IT Service Management
- ITIL 4 Foundation Certification
- ITIL Practitioner Certification
- ISO 14001 Foundation Certification
- ISO 20000 Certification
- ISO 27000 Foundation Certification
- ITIL 4 Specialist: Create, Deliver and Support Training
- ITIL 4 Specialist: Drive Stakeholder Value Training
- ITIL 4 Strategist Direct, Plan and Improve Training
- View All Courses

- Programming
- BI And Visualization
- Blockchain
- Big Data
- Mobile App Development
- Software Testing
- Selenium Certification Training
- ISTQB Foundation Certification
- ISTQB Advanced Level Security Tester Training
- ISTQB Advanced Level Test Manager Certification
- ISTQB Advanced Level Test Analyst Certification
- ISTQB Advanced Level Technical Test Analyst Certification
- Silk Test Workbench Training
- Automation Testing using TestComplete Training
- Cucumber Training
- Functional Testing Using Ranorex Training
- Teradata Certification Training
- View All Courses

- Business Management
- Quality Management
- IT Security
- Cyber Security Bootcamp
- Certified Ethical Hacker (CEH v12) Certification
- Certified Information Systems Auditor (CISA) Certification
- Certified Information Security Manager (CISM) Certification
- Certified Information Systems Security Professional (CISSP) Certification
- Cybersecurity Master's Program
- Certified Cloud Security Professional (CCSP) Certification
- Certified Information Privacy Professional - Europe (CIPP-E) Certification
- Control Objectives for Information and Related Technology (COBIT5) Foundation
- Payment Card Industry Security Standards (PCI-DSS) Certification
- Introduction to Forensic

- Digital Marketing
- Risk Management
- Finance
- Credit Risk Management
- Budget Analysis and Forecasting
- International Financial Reporting Standards (IFRS) for SMEs
- Diploma In International Financial Reporting
- Certificate in International Financial Reporting
- Corporate Governance
- Finance for Non-Finance Managers
- Financial Modeling with Excel
- Auditing and Assurance

- Database
- Soft Skills Training
- CompTIA
- Master of Business Administration
- Other
- MS Excel 2010
- Advanced Excel 2013
- IoT
- Certified Supply Chain Professional
- Software Estimation and Measurement Using IFPUG FPA
- Software Size Estimation and Measurement using IFPUG FPA & SNAP
- Leading and Delivering World Class Product Development Course
- Product Management and Product Marketing for Telecoms IT and Software
- Foundation Certificate in Marketing
- Flow Measurement and Custody Transfer Training Course
- View All Courses

- Home
- Data Science
- Statistics Interview Questions and Answers for 2024

- 4.7 Rating
- 80 Question(s)
- 40 Mins of Read
- 6578 Reader(s)

The article contains statistics interview questions for freshers and experienced to get started with your preparation for the interview. The basic statistics questions for the interview include probability and statistics interview questions, data cleaning and visualization topics for data analyst profiles. The intermediate and advanced sections contain data science and machine learning statistics interview questions. This guide will help you to cover most of the statistics interview questions. Statistics is an important field for data analysts, machine learning and data science professionals. Therefore, this article also covers interview questions on statistics for data science and data analytics.

- 4.7 Rating
- 80 Question(s)
- 40 Mins of Read
- 6578 Reader(s)

Filter By

Clear all

Numerical data represents number or figures, for example, sales amount, height of students, salary, etc. A numerical variable is further divided into two subsets, discrete and continuous. The number of students in a class or the results of a test are two examples of discrete data that can typically be counted in a finite way. Continuous variable data cannot be counted since it is infinite. For instance, continuous variables such as a person's weight, a region's area, etc. might vary by small quantities

Categorical data represents categories, including things like gender, email type, colour, and more. Categorical data is further divided into nominal and ordinal variables. Ordinal categorical variables can be displayed in a certain order, such as when a product is rated as either awful, satisfactory, good, or excellent. Nominal variables can never be arranged in a hierarchy. For instance, a person's gender.

A bar chart uses rectangular vertical and horizontal bars to statistically represent the given data. Each bar's length is proportionate to the value it corresponds to. The values among various categories are compared using bar charts. With the use of two axes, bar charts illustrate the relationship. It depicts the discrete values on one axis while the categories are represented on another. There are a number of different bar charts available for visualizing the data but the 4 major categories in which we can distinguish them is vertical, horizontal, stacked, and grouped bar chart.

**Vertical bar chart**

The most popular type of bar chart is the vertical bar chart. A vertical bar chart is one in which the given data is displayed on the graph using vertical bars. The measure of the data is represented by these vertical rectangular bars. On the x- and y-axes, vertical lines are drawn to represent the rectangular bars. The number of the variables listed on the x-axis is represented by these rectangle-shaped bars.

**Horizontal bar chart**

Charts that show the given data as horizontal bars are referred to as horizontal bar charts. The measures of the provided data are displayed in these horizontal, rectangular bars. In this style, the x-axis and y-axis are labelled with the data categories. The bar chart's horizontal representation is displayed in the y-axis category.

**Stacked bar chart**

Each sub-bar that makes up a normal bar chart represents a level of the second categorical variable, and they are all stacked on top of one another. A 100% stacked bar chart represents the given data as the percentage of data that contributes to a total volume in a distinct category, in contrast to a stacked bar chart that directly depicts the given data.

**Grouped bar chart**

A grouped bar chart makes it easier to compare data from multiple categories. For levels of a single categorical variable, bars are grouped by position, with colour often designating the secondary category level within each group.

Scatter plot is a very important graph when it comes to understanding the relationship between two numerical variables. For example, consider the following table which provides the percentage marks scored and total attendance of ten students of a class.

Student | Attendance | Percentage |
---|---|---|

Student 1 | 78 | 84 |

Student 2 | 91 | 96 |

Student 3 | 66 | 70 |

Student 4 | 42 | 85 |

Student 5 | 90 | 92 |

Student 6 | 59 | 62 |

Student 7 | 83 | 75 |

Student 8 | 72 | 75 |

Student 9 | 94 | 96 |

Student 10 | 88 | 67 |

The percentage of students in attendance is represented on the x-axis, while the percentage of marks scored is represented on the y-axis. The scatter plot could therefore help us comprehend the relationship between the two variables. We may argue that when students attend class more frequently, they tend to perform better academically. We can also spot instances that are the exception rather than the rule, like Student 4.

Frequency distribution is a series when a number of observations with similar or closely related values are put in separate bunches or groups, each group being in order of magnitude in a series. The data are simply organised into classes in a table, and the number of cases that fall into each class is noted. It displays the frequency with which various values of a single phenomenon occur. In order to estimate frequencies of the unknown population distribution from the distribution of sample data, a frequency distribution is created.

Take a survey of 50 households in a society as an example. The number of children in each family was recorded, and the results are shown in the following frequency distribution table

No. of children | Frequency |
---|---|

0 | 12 |

1 | 24 |

2 | 13 |

3 | 0 |

4 | 1 |

As a result, frequency in the table refers to how frequently an observation occurs. The number of observations is always equal to the sum of the frequencies. We can evaluate the data's underlying distribution and base judgements on it with the aid of frequency distribution.

The three measures of central tendency are mean, median, and mode.

Mean, also known as simple average, denoted by the Greek letter µ for a population and for a population. By adding up each observation of a dataset and then dividing the result by the total number of observations, we may determine the dataset's mean. This is the most common measure of central tendency.

The median of an ordered set of data is its middle number. As a result, it divides the data into two halves: the higher and lower halves. The median of the first nine natural numbers, for instance, is five.

Mode is the value that occurs most often. Although it can be applied to both numerical and categorical data, categorical data are typically preferred. For instance, if 60% of the observations for a gender variable are male, then male will be the mode value, signifying the value of maximum occurrence.

We initially arrange the collection of numbers in ascending order before calculating the median value of the data. The observation is then located in the middle of this sorted list. The element present at location (n+1)/2, where 'n' is the total number of observations, will be the mode for an odd number of total observations. The median value, however, will be the simple average of the middle two elements located at positions n/2 and (n+1)/2 if the total number of observations is even.

The quantiles are values used to segment the distribution so that a specific percentage of data fall below each quantile. A quantile is the median, for instance. The median can also be referred to as the 50th quantile which is the point where half the points are more than or equal to it and half are less than or equal to it in the distribution. Similarly, we can have 25th and 75th quantile which will represent the 25% and 75% of the observations on one side respectively. If we consider a data set of the first hundred natural numbers then the 25th, 50th, and 75th quantiles will be 25, 50, and 75 respectively. If the number of quantiles is four, then it is referred to as quartiles.

The most significant measures of dispersion for a single variable are the standard deviation and coefficient of variation, which are frequently employed in statistical formulas.

In statistics, a distribution is a function that displays the range of potential values for a variable along with their frequency. The probability for each individual observation in the sample space can be determined using a parameterized mathematical function. We utilise a statistical distribution to assess the likelihood of a specific value. The most common distributions are –

- Binomial distribution – It is a discrete distribution expressing the probability of a set of dichotomous alternatives i.e., success or failure repeated for a finite number of times.
- Poisson distribution – It is a limiting case of Binomial distribution where the number of trials is very large and probability of success is very small.
- Gaussian distribution – It is the most important continuous distribution, also known as the normal distribution which follows a symmetrical bell-shaped curve.
- Uniform distribution – All the number of possible outcomes of a uniform distribution are equally likely. For example, when you roll a fair die, the outcomes are equally likely.
- Exponential distribution – It follows the exponential functions and is widely used for survival analysis from the expected life of a machine to the expected life of a human.

The estimators with the lowest bias and highest efficiency are the most accurate. Without surveying the full population, you can never be entirely confident. We want to be as precise as possible. Most of the time, a confidence interval will produce reliable results. A point estimate, however, will nearly always be inaccurate but is easier to comprehend and convey.

We might be interested in predicting the value of one variable given the value of other variables after we understand the link between two or more variables. The term "target" or "dependent" or "explained" refers to the variable that is predicted based on other variables, and "independent" or "predicting" refers to the other variables that aid in estimating the target variable. The prediction is based on an average association that regression analysis has statistically determined. The formula, whether linear or not, is known as the regression equation or the explanatory equation. Real numbers are used as the output or target values for regression operations.

Think about estimating the cost of a house, for instance. In this scenario, the house price serves as your target variable. Some potential independent variables that may aid in estimating this price are the area, the year the house was built, the number of bedrooms and bathrooms, the neighbourhood, etc. Other instances of regression include predicting retail sales based on the season or agricultural output based on rainfall.

Regression analysis operates under three different categories:

- Simple and Multiple – In case of simple relationship only two variables are considered, for example, the influence of advertising expenditure on sales turnover. In the case of multiple relationship, more than two variables are involved. On this while one variable is a dependent variable the remaining variables are independent ones. For example, the turnover may depend on advertising expenditure and the income of the people.
- Linear and Non-linear – The equation of the straight-line trend, on which the linear relationships are based, has no power higher than one. Thus, they result in a straight line. Curved trend lines are created when there is a non-linear relationship. These equations have parabolic forms.
- Total and Partial – All relevant factors are taken into account while analysing total relationships. They typically are made up of multiple associations. One or more factors are taken into account in the case of a partial relationship, but not all of them, hence removing the influence of those not thought to be pertinent for a specified task.

Normal distribution is a symmetrical bell-shaped curve representing frequencies of different classes from the data. Some of the characteristics of normal distribution include:

- The mean, median and mode of the distribution coincide.
- The curve of the distribution is bell-shaped and symmetrical about the line x = mean value. This means that exactly half of the values are to the left of the centre and the other half to the right.
- The total area under the curve is 1.
- It is a limiting form of binomial distribution where the number of trials in indefinitely large (infinity) and the probability of success and failure is not indefinitely small.

Normal distribution is one of the most significant probability distributions in the study of statistics. This is so because a number of natural events fit the normal distribution. For instance, the normal distribution is observed for heights and weights of an age group, test scores, blood pressure, rolling a die or tossing a coin, and income of individuals. The normal distribution provides a good approximation when the sample size is large.

The graph reshapes when the standard deviation changes while the mean remains constant. When the standard deviation is lower, more data are seen in the centre and have thinner tails. The graph will flatten out with more points at the ends or better tails and fewer points in the middle as a result of a larger standard deviation.

Skewness is a measure of asymmetry that indicates whether the data is concentrated on one side. It allows us to get a complete understanding of the distribution of data. Based on the type, skewness is classified into three different types.

**Positive skewness or right skew**

Outliers at the top end of the range of values cause positive skewness. Extremely high numbers will cause the graph to skew to the right, showing that there are outliers present. The higher numbers slightly raise the mean above the median in this instance, meaning that the mean is higher than the median.

**No skewness or zero skew**

This is a classic instance of skewness not being present. It denotes a uniformly distributed distribution around the mean. As a result, it appears that the three values, mean, median, and mode, all coincide.

**Negative skewness or left skew**

Outliers near the lower end of the values cause negative skewness. Extremely low numbers will cause the graph to skew to the left, indicating that there are outliers present. In this instance, the mean is significantly smaller than the median because the lower values cause the mean to fall from the central value

In probability theory and statistics, a central moment is a moment of a probability distribution of a random variable about the random variable's mean.

- The zeroth central moment is the total probability i.e., equal to one.
- The first central moment is the expected value or mean and equal to zero.
- The second central moment is the variance.
- The third central moment is skewness.
- The fourth central moment is kurtosis

For univariate analysis of a numerical variable, the must use visualizations are histograms and box and whisker plot (or box plot). Scatter plots are used to perform multivariate analysis of numerical variables.

**Histograms**

A histogram is a graphic representation of the distribution of data that has been grouped into classes. It is a type of frequency chart that is made up of a number of rectangles. Each piece of data is sorted, then each value is assigned to the proper class interval. The frequency of each class interval is determined by the number of data values that fall within it. A specific class of data is represented by each rectangle in the histogram. The width of the rectangle represents the width of the class. It is commonly used to determine

**Box and whisker plot (box plot)**

A box plot shows the maximum and minimum values, the first and third quartiles, and the median value, which is a measure of central tendency. In addition to these quantities, it also explains the symmetry and variability of the data distribution. Outliers in the dataset are frequently visualised using this visualization.

**Scatter plot**

The scatterplot is a very helpful and effective tool that is frequently used in regression analysis. A pair of observed values for the dependent and independent variables are represented by each point. Before selecting a suitable model, it enables graphically determining whether a relationship between two variables exists. These scatterplots are also very helpful for residual analysis because they let you check whether the model is a good fit or not.

Covariance and correlation coefficient reveals the relationship and the strength of relationship between the two variables.

Covariance is a measure of how two random variables in a data set will change jointly. When two variables are positively correlated and moving in the same direction, this is referred to as positive covariance. A negative covariance denotes an inverse relationship between the variables or a movement in the opposite directions. For instance, a student's performance on a particular examination improves with increased attendance, which is a positive correlation, whereas a decrease in demand caused by a rise in the price of an item is a negative correlation. When the covariance value is zero, the variables are said to be independent of one another and have no influence on one another. If the covariance value is higher than 0, it means that the variables are positively correlated and move in the same direction. The variables are negatively correlated and move in the opposite direction when the correlations have a negative value.

Covariance Value | Effect on Variables |
---|---|

| Positive Correlation (X & Y variables move together) |

| No Correlation (X & Y are independent) |

| Negative Correlation (X & Y variables move in opposite direction) |

Similar information is given by the correlation coefficient and the covariance. The fact that the correlation coefficient will always retain a value between negative one and one is its benefit over covariance. A perfect positive correlation exists between the variables under study when the correlation coefficient is 1. In other words, as one moves, the other follows suit proportionally in the opposite direction. A less than perfect positive correlation is present if the correlation coefficient is less than one but still larger than zero. The correlation between the two variables is stronger as the correlation coefficient approaches one. There is no observable relationship between the variables when the correlation coefficient is zero. That means it is difficult to predict the movement of the other variable if one variable moves. The variables are perfectly negatively or inversely connected if the correlation coefficient is zero, or negative one. One variable will drop proportionally in response to an increase in the other. The variables will oscillate in opposing directions. If the correlation coefficient is more than negative one, it means that the negative correlation is not perfect. The correlation increases as it gets closer to being negative one.

Covariance Value | Effect on Variables |
---|---|

A fresh dataset of sample means is produced by the new samples that were collected. There is a certain distribution of these values. The phrase "sampling distribution" is used to describe a distribution made out of samples. We are dealing with a sampling distribution of the mean in this instance. These values are distinct when we look at them closely, but they are centred on one particular value.

Every sample mean in this analysis approximates the population mean. The value they centre on may provide a very accurate indication of the population mean. In fact, we anticipate getting a pretty accurate approximation of the population mean if we take the average of those sample means. We see a normal distribution when we visualise the distribution of the sampling means and the Central Limit confirms that. The sampling distribution of the mean will resemble a normal distribution regardless of the underlying population distribution, whether it be binomial, exponential, or another type.

As a result, even when the population is not normally distributed, we can still conduct tests, work through issues, and draw conclusions using the normal distribution according to the central limit theorem.

There are various performance measures or metrics that can help to evaluate the performance of a classification model. However, it depends on the kind of problem we are dealing it. At times, accuracy might not be a good idea for evaluation and we need to focus on certain aspects of the results rather than the accuracy as a whole. The most common metrics used for the purpose are –

**Confusion matrix**

A confusion matrix is one of the evaluation techniques for machine learning models in which you compare the results of all the predicted and actual values. Confusion matrix helps us to derive several different metrics for evaluation purpose such as accuracy, precision, recall, and F1 score which are widely used across different classification use cases.

**ROC AUC curve**

The probability curve, the Receiver Operator Characteristic (ROC) separates the signal from the noise by plotting the True Positive Rate (TPR) versus the False Positive Positive Rate (FPR) at different threshold values. A classifier's capacity to distinguish between classes is measured by the Area Under the Curve (AUC). The performance of the model at various thresholds between positive and negative classes is improved by a higher AUC. The classifier can correctly discriminate between all Positive and Negative class points when AUC is equal to 1. The classifier would be predicting all negatives as positives and vice versa when AUC is equal to 0.

**Jaccard index**

Jaccard Index or also known as Jaccard similarity coefficient. If y is the actual label and ŷ is the predicted value then we can define Jaccard index as the size of the intersection by the size of the union of two labelled sets.

Consider if you have a total of 50 observations, out of which your model predicts 41 of them correctly, then the Jaccard index is given as 41 / (50 + 50 - 41) = 0.69. The Jaccard index of 0.69 defines that the model predicts on the test set with an accuracy of 69%. So, a Jaccard index ranges from 0 to 1 where an index value of 1 implies maximum accuracy.

**Log loss**

Log loss or logarithmic loss measures the performance of a classifier where predicted output is a probability value between 0 and 1. We can calculate the log loss using the log loss equation which measures how far each prediction is from the actual label. It is obvious that most ideal classifiers have a lower value of log loss. So the classifier with lower log loss has better accuracy.

Confusion matrix is one of the evaluation methods for machine learning models that compares the outcomes of all the expected and actual values.

The figure representing confusion matrix has four different cases:

- There are five instances where the predicted value and the actual value are both true. This is referred to as a True Positive case, where True denotes that the values are identical (true and true) and Positive denotes that the situation is true. Example: A diabetes test is positive for a diabetic patient.
- There are four instances where both the predicted value and the actual value are false. This is referred to as a True Negative situation, where True denotes identical numbers (false and false) and Negative denotes a negative outcome. Example: A diabetes test is negative for a non-diabetic patient.
- In three instances, the projected value is true, but the actual value is false. False denotes that the values are different (false and true), while Positive means that the predicted value is positive. This is referred to as a False Positive event. Example: A diabetes test is positive for a non-diabetic patient.
- There are two situations where the projected value is false, whereas the actual value is true. This situation is known as a False Negative Case, where False denotes that the values (true and false) are different, and Negative denotes that the predicted value is negative. Example: A diabetes test is negative for a diabetic patient.

In this matrix, the values in green are correctly identified by the model and the values in red are wrongly identified by the model. Confusion matrix can also be used for non-binary target variables.

The occurrence of Type 1 Error, also known as a False Positive event, occurs when the expected value is positive but it is actually negative. When the actual value is positive when the predicted value is negative, this is known as a False Negative event and results in Type 2 Error. For instance, if we consider rain to be a positive event, then your device's prediction that it would rain today but it didn't actually happen is a type 1 error, while your device's prediction that it wouldn't rain today but it actually did happen is a type 2 error.

Also known as the F1-Score or the F-Score, the F-Measure is a numerical value. It evaluates how accurate a test is. In a perfect scenario, both the precision and recall values would be high. However, there is always a trade-off between recall and precision, and unfortunately, we must prioritise one over the other. The two components of the F1 score are precision and recall. The F1 score aims to combine the precision and recall measures into a single metric. This F-score is what we use to compare two models. In terms of the formula, F1 score is the harmonic mean of precision and recall and given by –

The value of F1 score ranges between 0 and 1. An F1 score of 1 is regarded as ideal, whereas a score of 0 indicates that the model is a complete failure.

The sum of the squares between the predicted value and the dependent variable's mean is known as the sum of squares due to regression (SSR). It explains how well the data fit our regression line. If this value is the same as the SST, our regression model perfectly captures the observed variability.

The difference between the actual value and the predicted value is known as the Sum of Squared Error (SSE). Usually, we wish to reduce the error. The regression's estimating power increases with decreasing error.

The overall variability of the data set provided by SST is equal to the sum of the variability described by the regression line, or SSR, and the unexplained variability, or SSE.

One common way to identify outliers is through the internal limits or whiskers derived from the interquartile range. The interquartile range represents the 50% of the observations. The median is located at the centre of this range. When there are minimum or maximum values at the extreme sides then we can define the cut-off value to determine the outliers using the formula –

- Lower limit = First quartile – (1.5 x interquartile range)
- Upper limit = Third quartile – (1.5 x interquartile range)

These two data values are known as adjacent points. If we find observations outside of the interval between the lower limit and the upper limit, then it can be termed as outliers in the dataset.

Precision is the ratio of the correctly identified positive classes to the sum of the predicted positive classes. The predicted positive classes are the ones which are predicted positive irrespective of the actual value being positive or negative, that is, true positive and false positive classes. This ratio provides information that out of all the positive classes we have predicted correctly, how many are actually positive.

**Precision= TP/TP+FP**

Recall is the ratio of the correctly identified positive classes to the sum of the actual positive classes. The actual positive classes can be predicted as positive or negative that is, true positive and false negative. This ratio provides the information that out of all the positive classes, how much we predicted correctly.

**Recall= TP/TP+FN**

The OLS assumptions for a linear regression are divided into five assumptions:

**Linearity**

The regression assumes that the data is linear in nature. For higher degrees of variables, linear regression will not produce good predictions.

**No endogeneity**

The issue of endogeneity arises when we have a variable that is related to the target and also the predictors but not included in the model. Therefore, endogeneity is a situation in which a predictor in a linear regression model is correlated to the error term. We call such predictors as endogenous variables.

**Normality and homoscedasticity**

This assumes that the error term is normally distributed, and the expected value of error is 0, meaning that we expect to have no error on average. Homoscedasticity assumes that the variance is constant for the error term.

**No autocorrelation**

This assumes that the covariance between two error terms is not zero.

**No multicollinearity**

When two or more variables in our regression are strongly correlated, this situation is referred to as multicollinearity. The OLS assumptions assume that there are no strongly correlated variables in our analysis.

One can standardise any distribution. The process of standardisation involves transforming the variables to one with a mean of zero and a standard deviation of one.

Standardization is also possible for normal distributions. The result is known as a standard normal distribution. A standard normal distribution is represented by the letter Z. The Z-score is referred to as the standardised variable. The formula for standardising variables is defined by the Z-score. We first determine a variable's mean and standard deviation. The mean is then subtracted from each observed value of the variable, and then divide by the standard deviation.

Skewed data refers to the data that contains outliers. Outliers are known to have a negative influence on the model’s predictions and thus needs to be eliminated. However, it is not always advised to remove the outliers, so we need to handle them through certain transformations. The common transformations applied on the data are log transformation, square root transformation, and box-cox transformation.

**Log transformation**

The logarithmic transformation is one of the most helpful and popular transformations. In fact, it might be a good idea to use the dependent variable's logarithm as a replacement before doing a linear regression. A similar operation would stabilise the target variable's variance and bring the transformed variable's distribution closer to normal.

**Square root transformation**

If there are any outlier values that are exceptionally large, you might consider using the square root transformation. The transformation can help scaling them down to a much lower value in comparison. A limitation of this transformation is that the square root of a negative number is not a real number.

**Box-Cox transformation**

Box-cox transformation is yet another transformation method that can help to transform skewed data into normal. It has a controlling parameter lambda which ranges between -5 to 5. Initially, it was only used in presence of positive values, but modifications have been made to the transformation to take care of the negative values as well.

A one-sided or one-tailed test on a population parameter is a sort of hypothesis test in which the values for which we can reject the null hypothesis, indicated, are exclusively located in one tail of the probability distribution. For instance, if "The mean height of men in India is higher than 5 feet 6 inches" is the null hypothesis, then the alternative hypothesis would be "the mean height of men in India is equal to or less than 5 feet 6 inches." This is a one-sided test because the alternate hypothesis, i.e., equal to or less than 5 feet 6 inches, only considers one end of the distribution.

A two-sided test for a population is a hypothesis test used when comparing an estimate of a parameter to a given value versus the alternative hypothesis that the parameter is not equal to the stated value. If the null hypothesis is, for instance, "The mean height of men in India is equal to 5 feet 6 inches," then the alternative hypothesis would be, "The mean height of men in India is either less than or greater than 5 feet 6 inches but not equal." The alternate hypothesis, greater than or less than 5 feet 6 inches, deals with both extremes of the distribution, making this a two-tailed test.

A resampling technique called cross-validation uses several data subsets to evaluate and train a model across a number of iterations. It is typically applied in situations where the objective is prediction, and one wishes to evaluate how well a predictive model will function in real-world situations. Due to sampling variability between training and test set, our model gives better prediction on training data but fails to generalize on test data. This leads to low training error rate and high test error rate. When we split the dataset into training, validation and test set, we only use a subset of data. To overcome these issues, we can adopt various cross validation approaches, namely, K-fold cross validation, stratified k-fold cross validation, leave one out cross validation, stratified shuffle split, etc.

**Leave One Out Cross Validation (LOOCV)**

A dataset with n observations is divided into n-1 observations as the training data and 1 observation as test data. The process is iterated for each data point. Therefore, the execution is expensive. Also, for an outlier in test data, variability in MSE is much higher.

**K-Fold Cross Validation**

Randomly divides data in k-groups or folds of equal sizes. The first fold is kept for testing and the model is trained on k-1 folds. The process is repeated k-times and each time different fold or a different group of data points are used for validation. Typically, k in k-fold is 5 or 10. LOOCV is a variant of k-fold where k = n. Though it is less computationally expensive than the LOOCV but still it is.

**Stratified K-fold Cross Validation**

Each fold in the dataset has at least m instances of each class. This approach ensures that one class of data is not over-represented especially when the target variable is unbalanced.

Non-parametric models do not make any assumptions for the data. Instead, they are free to learn but controlled by some hyperparameters. These models often have low bias and high variance which makes them prone to overfitting. Decision trees is an example of non-parametric models.

Bayes' theorem provides a formula for the likelihood that an occurrence is the direct outcome of a given condition if we take into account the set of conditions that an event occurs. So, it is possible to think of Bayes' theorem as a formula for the conditional probability of an occurrence. For instance, let us consider there are 10 bags containing different coloured marbles. Bayes’ theorem helps to determine the probability of drawing the marble from a particular bag, given the condition that the marble is red in colour. If A is the event of drawing the marble from a particular bag and B is the event of drawing a red marble, then the formula for Bayes’ theorem is given by –**P(A|B)=P(B|A)∙P(A)/P(B)**

Where P(A|B) is the probability of event A to occur given the condition that event B has already occurred.

The following procedures are typically used when evaluating hypotheses about a sample:

- Formulate the null hypothesis (H0) and alternate hypothesis (H1).
- Determine the significance level, alpha for the hypothesis test.
- Check for one-tailed or two-tailed test based on the null hypothesis.
- Compute the critical value for the null hypothesis.
- Determine the acceptance and rejection regions in accordance to the critical value.
- Compare the sample statistics with the significance level, if it falls in the acceptance region then accept the null hypothesis, else reject it.

Time-series data is data that is collected at different points in time. Autocorrelation, Seasonality, Stationarity are the three main components of a time series.

**Autocorrelation**

Autocorrelation refers to the similarity between observations as a function of the time lag between them. We can find the value of a point by finding the period in the plot.

**Seasonality**

Seasonality refers to periodic fluctuations. Period can give the length of the season. For instance, the amount of electricity consumed varies greatly from summer to winter, and online sales peak around Diwali before dipping again.

**Stationarity**

Stationary means that statistical properties do not change over time, that is, constant mean and variance, and covariance is independent of time. For example, stock prices are not a stationary process. For modelling, we would prefer to have a stationary time series. However, there are other transformations we can apply to make them stationary.

The survival function is estimated by Kaplan Meier curves. The survival function is graphically represented by the Kaplan-Meier curve. It displays the likelihood that a subject will live until time t. Plotting the survival function against time leads to the formation of the curve.

Numerical data represents number or figures, for example, sales amount, height of students, salary, etc. A numerical variable is further divided into two subsets, discrete and continuous. The number of students in a class or the results of a test are two examples of discrete data that can typically be counted in a finite way. Continuous variable data cannot be counted since it is infinite. For instance, continuous variables such as a person's weight, a region's area, etc. might vary by small quantities

Categorical data represents categories, including things like gender, email type, colour, and more. Categorical data is further divided into nominal and ordinal variables. Ordinal categorical variables can be displayed in a certain order, such as when a product is rated as either awful, satisfactory, good, or excellent. Nominal variables can never be arranged in a hierarchy. For instance, a person's gender.

A bar chart uses rectangular vertical and horizontal bars to statistically represent the given data. Each bar's length is proportionate to the value it corresponds to. The values among various categories are compared using bar charts. With the use of two axes, bar charts illustrate the relationship. It depicts the discrete values on one axis while the categories are represented on another. There are a number of different bar charts available for visualizing the data but the 4 major categories in which we can distinguish them is vertical, horizontal, stacked, and grouped bar chart.

**Vertical bar chart**

The most popular type of bar chart is the vertical bar chart. A vertical bar chart is one in which the given data is displayed on the graph using vertical bars. The measure of the data is represented by these vertical rectangular bars. On the x- and y-axes, vertical lines are drawn to represent the rectangular bars. The number of the variables listed on the x-axis is represented by these rectangle-shaped bars.

**Horizontal bar chart**

Charts that show the given data as horizontal bars are referred to as horizontal bar charts. The measures of the provided data are displayed in these horizontal, rectangular bars. In this style, the x-axis and y-axis are labelled with the data categories. The bar chart's horizontal representation is displayed in the y-axis category.

**Stacked bar chart**

Each sub-bar that makes up a normal bar chart represents a level of the second categorical variable, and they are all stacked on top of one another. A 100% stacked bar chart represents the given data as the percentage of data that contributes to a total volume in a distinct category, in contrast to a stacked bar chart that directly depicts the given data.

**Grouped bar chart**

A grouped bar chart makes it easier to compare data from multiple categories. For levels of a single categorical variable, bars are grouped by position, with colour often designating the secondary category level within each group.

Scatter plot is a very important graph when it comes to understanding the relationship between two numerical variables. For example, consider the following table which provides the percentage marks scored and total attendance of ten students of a class.

Student | Attendance | Percentage |
---|---|---|

Student 1 | 78 | 84 |

Student 2 | 91 | 96 |

Student 3 | 66 | 70 |

Student 4 | 42 | 85 |

Student 5 | 90 | 92 |

Student 6 | 59 | 62 |

Student 7 | 83 | 75 |

Student 8 | 72 | 75 |

Student 9 | 94 | 96 |

Student 10 | 88 | 67 |

The percentage of students in attendance is represented on the x-axis, while the percentage of marks scored is represented on the y-axis. The scatter plot could therefore help us comprehend the relationship between the two variables. We may argue that when students attend class more frequently, they tend to perform better academically. We can also spot instances that are the exception rather than the rule, like Student 4.

Frequency distribution is a series when a number of observations with similar or closely related values are put in separate bunches or groups, each group being in order of magnitude in a series. The data are simply organised into classes in a table, and the number of cases that fall into each class is noted. It displays the frequency with which various values of a single phenomenon occur. In order to estimate frequencies of the unknown population distribution from the distribution of sample data, a frequency distribution is created.

Take a survey of 50 households in a society as an example. The number of children in each family was recorded, and the results are shown in the following frequency distribution table

No. of children | Frequency |
---|---|

0 | 12 |

1 | 24 |

2 | 13 |

3 | 0 |

4 | 1 |

As a result, frequency in the table refers to how frequently an observation occurs. The number of observations is always equal to the sum of the frequencies. We can evaluate the data's underlying distribution and base judgements on it with the aid of frequency distribution.

The three measures of central tendency are mean, median, and mode.

Mean, also known as simple average, denoted by the Greek letter µ for a population and for a population. By adding up each observation of a dataset and then dividing the result by the total number of observations, we may determine the dataset's mean. This is the most common measure of central tendency.

The median of an ordered set of data is its middle number. As a result, it divides the data into two halves: the higher and lower halves. The median of the first nine natural numbers, for instance, is five.

Mode is the value that occurs most often. Although it can be applied to both numerical and categorical data, categorical data are typically preferred. For instance, if 60% of the observations for a gender variable are male, then male will be the mode value, signifying the value of maximum occurrence.

We initially arrange the collection of numbers in ascending order before calculating the median value of the data. The observation is then located in the middle of this sorted list. The element present at location (n+1)/2, where 'n' is the total number of observations, will be the mode for an odd number of total observations. The median value, however, will be the simple average of the middle two elements located at positions n/2 and (n+1)/2 if the total number of observations is even.

The quantiles are values used to segment the distribution so that a specific percentage of data fall below each quantile. A quantile is the median, for instance. The median can also be referred to as the 50th quantile which is the point where half the points are more than or equal to it and half are less than or equal to it in the distribution. Similarly, we can have 25th and 75th quantile which will represent the 25% and 75% of the observations on one side respectively. If we consider a data set of the first hundred natural numbers then the 25th, 50th, and 75th quantiles will be 25, 50, and 75 respectively. If the number of quantiles is four, then it is referred to as quartiles.

The most significant measures of dispersion for a single variable are the standard deviation and coefficient of variation, which are frequently employed in statistical formulas.

In statistics, a distribution is a function that displays the range of potential values for a variable along with their frequency. The probability for each individual observation in the sample space can be determined using a parameterized mathematical function. We utilise a statistical distribution to assess the likelihood of a specific value. The most common distributions are –

- Binomial distribution – It is a discrete distribution expressing the probability of a set of dichotomous alternatives i.e., success or failure repeated for a finite number of times.
- Poisson distribution – It is a limiting case of Binomial distribution where the number of trials is very large and probability of success is very small.
- Gaussian distribution – It is the most important continuous distribution, also known as the normal distribution which follows a symmetrical bell-shaped curve.
- Uniform distribution – All the number of possible outcomes of a uniform distribution are equally likely. For example, when you roll a fair die, the outcomes are equally likely.
- Exponential distribution – It follows the exponential functions and is widely used for survival analysis from the expected life of a machine to the expected life of a human.

The estimators with the lowest bias and highest efficiency are the most accurate. Without surveying the full population, you can never be entirely confident. We want to be as precise as possible. Most of the time, a confidence interval will produce reliable results. A point estimate, however, will nearly always be inaccurate but is easier to comprehend and convey.

We might be interested in predicting the value of one variable given the value of other variables after we understand the link between two or more variables. The term "target" or "dependent" or "explained" refers to the variable that is predicted based on other variables, and "independent" or "predicting" refers to the other variables that aid in estimating the target variable. The prediction is based on an average association that regression analysis has statistically determined. The formula, whether linear or not, is known as the regression equation or the explanatory equation. Real numbers are used as the output or target values for regression operations.

Think about estimating the cost of a house, for instance. In this scenario, the house price serves as your target variable. Some potential independent variables that may aid in estimating this price are the area, the year the house was built, the number of bedrooms and bathrooms, the neighbourhood, etc. Other instances of regression include predicting retail sales based on the season or agricultural output based on rainfall.

Regression analysis operates under three different categories:

- Simple and Multiple – In case of simple relationship only two variables are considered, for example, the influence of advertising expenditure on sales turnover. In the case of multiple relationship, more than two variables are involved. On this while one variable is a dependent variable the remaining variables are independent ones. For example, the turnover may depend on advertising expenditure and the income of the people.
- Linear and Non-linear – The equation of the straight-line trend, on which the linear relationships are based, has no power higher than one. Thus, they result in a straight line. Curved trend lines are created when there is a non-linear relationship. These equations have parabolic forms.
- Total and Partial – All relevant factors are taken into account while analysing total relationships. They typically are made up of multiple associations. One or more factors are taken into account in the case of a partial relationship, but not all of them, hence removing the influence of those not thought to be pertinent for a specified task.

Normal distribution is a symmetrical bell-shaped curve representing frequencies of different classes from the data. Some of the characteristics of normal distribution include:

- The mean, median and mode of the distribution coincide.
- The curve of the distribution is bell-shaped and symmetrical about the line x = mean value. This means that exactly half of the values are to the left of the centre and the other half to the right.
- The total area under the curve is 1.
- It is a limiting form of binomial distribution where the number of trials in indefinitely large (infinity) and the probability of success and failure is not indefinitely small.

Normal distribution is one of the most significant probability distributions in the study of statistics. This is so because a number of natural events fit the normal distribution. For instance, the normal distribution is observed for heights and weights of an age group, test scores, blood pressure, rolling a die or tossing a coin, and income of individuals. The normal distribution provides a good approximation when the sample size is large.

The graph reshapes when the standard deviation changes while the mean remains constant. When the standard deviation is lower, more data are seen in the centre and have thinner tails. The graph will flatten out with more points at the ends or better tails and fewer points in the middle as a result of a larger standard deviation.

Skewness is a measure of asymmetry that indicates whether the data is concentrated on one side. It allows us to get a complete understanding of the distribution of data. Based on the type, skewness is classified into three different types.

**Positive skewness or right skew**

Outliers at the top end of the range of values cause positive skewness. Extremely high numbers will cause the graph to skew to the right, showing that there are outliers present. The higher numbers slightly raise the mean above the median in this instance, meaning that the mean is higher than the median.

**No skewness or zero skew**

This is a classic instance of skewness not being present. It denotes a uniformly distributed distribution around the mean. As a result, it appears that the three values, mean, median, and mode, all coincide.

**Negative skewness or left skew**

Outliers near the lower end of the values cause negative skewness. Extremely low numbers will cause the graph to skew to the left, indicating that there are outliers present. In this instance, the mean is significantly smaller than the median because the lower values cause the mean to fall from the central value

In probability theory and statistics, a central moment is a moment of a probability distribution of a random variable about the random variable's mean.

- The zeroth central moment is the total probability i.e., equal to one.
- The first central moment is the expected value or mean and equal to zero.
- The second central moment is the variance.
- The third central moment is skewness.
- The fourth central moment is kurtosis

For univariate analysis of a numerical variable, the must use visualizations are histograms and box and whisker plot (or box plot). Scatter plots are used to perform multivariate analysis of numerical variables.

**Histograms**

A histogram is a graphic representation of the distribution of data that has been grouped into classes. It is a type of frequency chart that is made up of a number of rectangles. Each piece of data is sorted, then each value is assigned to the proper class interval. The frequency of each class interval is determined by the number of data values that fall within it. A specific class of data is represented by each rectangle in the histogram. The width of the rectangle represents the width of the class. It is commonly used to determine

**Box and whisker plot (box plot)**

A box plot shows the maximum and minimum values, the first and third quartiles, and the median value, which is a measure of central tendency. In addition to these quantities, it also explains the symmetry and variability of the data distribution. Outliers in the dataset are frequently visualised using this visualization.

**Scatter plot**

The scatterplot is a very helpful and effective tool that is frequently used in regression analysis. A pair of observed values for the dependent and independent variables are represented by each point. Before selecting a suitable model, it enables graphically determining whether a relationship between two variables exists. These scatterplots are also very helpful for residual analysis because they let you check whether the model is a good fit or not.

Covariance and correlation coefficient reveals the relationship and the strength of relationship between the two variables.

Covariance is a measure of how two random variables in a data set will change jointly. When two variables are positively correlated and moving in the same direction, this is referred to as positive covariance. A negative covariance denotes an inverse relationship between the variables or a movement in the opposite directions. For instance, a student's performance on a particular examination improves with increased attendance, which is a positive correlation, whereas a decrease in demand caused by a rise in the price of an item is a negative correlation. When the covariance value is zero, the variables are said to be independent of one another and have no influence on one another. If the covariance value is higher than 0, it means that the variables are positively correlated and move in the same direction. The variables are negatively correlated and move in the opposite direction when the correlations have a negative value.

Covariance Value | Effect on Variables |
---|---|

| Positive Correlation (X & Y variables move together) |

| No Correlation (X & Y are independent) |

| Negative Correlation (X & Y variables move in opposite direction) |

Similar information is given by the correlation coefficient and the covariance. The fact that the correlation coefficient will always retain a value between negative one and one is its benefit over covariance. A perfect positive correlation exists between the variables under study when the correlation coefficient is 1. In other words, as one moves, the other follows suit proportionally in the opposite direction. A less than perfect positive correlation is present if the correlation coefficient is less than one but still larger than zero. The correlation between the two variables is stronger as the correlation coefficient approaches one. There is no observable relationship between the variables when the correlation coefficient is zero. That means it is difficult to predict the movement of the other variable if one variable moves. The variables are perfectly negatively or inversely connected if the correlation coefficient is zero, or negative one. One variable will drop proportionally in response to an increase in the other. The variables will oscillate in opposing directions. If the correlation coefficient is more than negative one, it means that the negative correlation is not perfect. The correlation increases as it gets closer to being negative one.

Covariance Value | Effect on Variables |
---|---|

A fresh dataset of sample means is produced by the new samples that were collected. There is a certain distribution of these values. The phrase "sampling distribution" is used to describe a distribution made out of samples. We are dealing with a sampling distribution of the mean in this instance. These values are distinct when we look at them closely, but they are centred on one particular value.

Every sample mean in this analysis approximates the population mean. The value they centre on may provide a very accurate indication of the population mean. In fact, we anticipate getting a pretty accurate approximation of the population mean if we take the average of those sample means. We see a normal distribution when we visualise the distribution of the sampling means and the Central Limit confirms that. The sampling distribution of the mean will resemble a normal distribution regardless of the underlying population distribution, whether it be binomial, exponential, or another type.

As a result, even when the population is not normally distributed, we can still conduct tests, work through issues, and draw conclusions using the normal distribution according to the central limit theorem.

There are various performance measures or metrics that can help to evaluate the performance of a classification model. However, it depends on the kind of problem we are dealing it. At times, accuracy might not be a good idea for evaluation and we need to focus on certain aspects of the results rather than the accuracy as a whole. The most common metrics used for the purpose are –

**Confusion matrix**

A confusion matrix is one of the evaluation techniques for machine learning models in which you compare the results of all the predicted and actual values. Confusion matrix helps us to derive several different metrics for evaluation purpose such as accuracy, precision, recall, and F1 score which are widely used across different classification use cases.

**ROC AUC curve**

The probability curve, the Receiver Operator Characteristic (ROC) separates the signal from the noise by plotting the True Positive Rate (TPR) versus the False Positive Positive Rate (FPR) at different threshold values. A classifier's capacity to distinguish between classes is measured by the Area Under the Curve (AUC). The performance of the model at various thresholds between positive and negative classes is improved by a higher AUC. The classifier can correctly discriminate between all Positive and Negative class points when AUC is equal to 1. The classifier would be predicting all negatives as positives and vice versa when AUC is equal to 0.

**Jaccard index**

Jaccard Index or also known as Jaccard similarity coefficient. If y is the actual label and ŷ is the predicted value then we can define Jaccard index as the size of the intersection by the size of the union of two labelled sets.

Consider if you have a total of 50 observations, out of which your model predicts 41 of them correctly, then the Jaccard index is given as 41 / (50 + 50 - 41) = 0.69. The Jaccard index of 0.69 defines that the model predicts on the test set with an accuracy of 69%. So, a Jaccard index ranges from 0 to 1 where an index value of 1 implies maximum accuracy.

**Log loss**

Log loss or logarithmic loss measures the performance of a classifier where predicted output is a probability value between 0 and 1. We can calculate the log loss using the log loss equation which measures how far each prediction is from the actual label. It is obvious that most ideal classifiers have a lower value of log loss. So the classifier with lower log loss has better accuracy.

Confusion matrix is one of the evaluation methods for machine learning models that compares the outcomes of all the expected and actual values.

The figure representing confusion matrix has four different cases:

- There are five instances where the predicted value and the actual value are both true. This is referred to as a True Positive case, where True denotes that the values are identical (true and true) and Positive denotes that the situation is true. Example: A diabetes test is positive for a diabetic patient.
- There are four instances where both the predicted value and the actual value are false. This is referred to as a True Negative situation, where True denotes identical numbers (false and false) and Negative denotes a negative outcome. Example: A diabetes test is negative for a non-diabetic patient.
- In three instances, the projected value is true, but the actual value is false. False denotes that the values are different (false and true), while Positive means that the predicted value is positive. This is referred to as a False Positive event. Example: A diabetes test is positive for a non-diabetic patient.
- There are two situations where the projected value is false, whereas the actual value is true. This situation is known as a False Negative Case, where False denotes that the values (true and false) are different, and Negative denotes that the predicted value is negative. Example: A diabetes test is negative for a diabetic patient.

In this matrix, the values in green are correctly identified by the model and the values in red are wrongly identified by the model. Confusion matrix can also be used for non-binary target variables.

The occurrence of Type 1 Error, also known as a False Positive event, occurs when the expected value is positive but it is actually negative. When the actual value is positive when the predicted value is negative, this is known as a False Negative event and results in Type 2 Error. For instance, if we consider rain to be a positive event, then your device's prediction that it would rain today but it didn't actually happen is a type 1 error, while your device's prediction that it wouldn't rain today but it actually did happen is a type 2 error.

Also known as the F1-Score or the F-Score, the F-Measure is a numerical value. It evaluates how accurate a test is. In a perfect scenario, both the precision and recall values would be high. However, there is always a trade-off between recall and precision, and unfortunately, we must prioritise one over the other. The two components of the F1 score are precision and recall. The F1 score aims to combine the precision and recall measures into a single metric. This F-score is what we use to compare two models. In terms of the formula, F1 score is the harmonic mean of precision and recall and given by –

The value of F1 score ranges between 0 and 1. An F1 score of 1 is regarded as ideal, whereas a score of 0 indicates that the model is a complete failure.

The sum of the squares between the predicted value and the dependent variable's mean is known as the sum of squares due to regression (SSR). It explains how well the data fit our regression line. If this value is the same as the SST, our regression model perfectly captures the observed variability.

The difference between the actual value and the predicted value is known as the Sum of Squared Error (SSE). Usually, we wish to reduce the error. The regression's estimating power increases with decreasing error.

The overall variability of the data set provided by SST is equal to the sum of the variability described by the regression line, or SSR, and the unexplained variability, or SSE.

One common way to identify outliers is through the internal limits or whiskers derived from the interquartile range. The interquartile range represents the 50% of the observations. The median is located at the centre of this range. When there are minimum or maximum values at the extreme sides then we can define the cut-off value to determine the outliers using the formula –

- Lower limit = First quartile – (1.5 x interquartile range)
- Upper limit = Third quartile – (1.5 x interquartile range)

These two data values are known as adjacent points. If we find observations outside of the interval between the lower limit and the upper limit, then it can be termed as outliers in the dataset.

Precision is the ratio of the correctly identified positive classes to the sum of the predicted positive classes. The predicted positive classes are the ones which are predicted positive irrespective of the actual value being positive or negative, that is, true positive and false positive classes. This ratio provides information that out of all the positive classes we have predicted correctly, how many are actually positive.

**Precision= TP/TP+FP**

Recall is the ratio of the correctly identified positive classes to the sum of the actual positive classes. The actual positive classes can be predicted as positive or negative that is, true positive and false negative. This ratio provides the information that out of all the positive classes, how much we predicted correctly.

**Recall= TP/TP+FN**

The OLS assumptions for a linear regression are divided into five assumptions:

**Linearity**

The regression assumes that the data is linear in nature. For higher degrees of variables, linear regression will not produce good predictions.

**No endogeneity**

The issue of endogeneity arises when we have a variable that is related to the target and also the predictors but not included in the model. Therefore, endogeneity is a situation in which a predictor in a linear regression model is correlated to the error term. We call such predictors as endogenous variables.

**Normality and homoscedasticity**

This assumes that the error term is normally distributed, and the expected value of error is 0, meaning that we expect to have no error on average. Homoscedasticity assumes that the variance is constant for the error term.

**No autocorrelation**

This assumes that the covariance between two error terms is not zero.

**No multicollinearity**

When two or more variables in our regression are strongly correlated, this situation is referred to as multicollinearity. The OLS assumptions assume that there are no strongly correlated variables in our analysis.

One can standardise any distribution. The process of standardisation involves transforming the variables to one with a mean of zero and a standard deviation of one.

Standardization is also possible for normal distributions. The result is known as a standard normal distribution. A standard normal distribution is represented by the letter Z. The Z-score is referred to as the standardised variable. The formula for standardising variables is defined by the Z-score. We first determine a variable's mean and standard deviation. The mean is then subtracted from each observed value of the variable, and then divide by the standard deviation.

Skewed data refers to the data that contains outliers. Outliers are known to have a negative influence on the model’s predictions and thus needs to be eliminated. However, it is not always advised to remove the outliers, so we need to handle them through certain transformations. The common transformations applied on the data are log transformation, square root transformation, and box-cox transformation.

**Log transformation**

The logarithmic transformation is one of the most helpful and popular transformations. In fact, it might be a good idea to use the dependent variable's logarithm as a replacement before doing a linear regression. A similar operation would stabilise the target variable's variance and bring the transformed variable's distribution closer to normal.

**Square root transformation**

If there are any outlier values that are exceptionally large, you might consider using the square root transformation. The transformation can help scaling them down to a much lower value in comparison. A limitation of this transformation is that the square root of a negative number is not a real number.

**Box-Cox transformation**

Box-cox transformation is yet another transformation method that can help to transform skewed data into normal. It has a controlling parameter lambda which ranges between -5 to 5. Initially, it was only used in presence of positive values, but modifications have been made to the transformation to take care of the negative values as well.

A one-sided or one-tailed test on a population parameter is a sort of hypothesis test in which the values for which we can reject the null hypothesis, indicated, are exclusively located in one tail of the probability distribution. For instance, if "The mean height of men in India is higher than 5 feet 6 inches" is the null hypothesis, then the alternative hypothesis would be "the mean height of men in India is equal to or less than 5 feet 6 inches." This is a one-sided test because the alternate hypothesis, i.e., equal to or less than 5 feet 6 inches, only considers one end of the distribution.

A two-sided test for a population is a hypothesis test used when comparing an estimate of a parameter to a given value versus the alternative hypothesis that the parameter is not equal to the stated value. If the null hypothesis is, for instance, "The mean height of men in India is equal to 5 feet 6 inches," then the alternative hypothesis would be, "The mean height of men in India is either less than or greater than 5 feet 6 inches but not equal." The alternate hypothesis, greater than or less than 5 feet 6 inches, deals with both extremes of the distribution, making this a two-tailed test.

A resampling technique called cross-validation uses several data subsets to evaluate and train a model across a number of iterations. It is typically applied in situations where the objective is prediction, and one wishes to evaluate how well a predictive model will function in real-world situations. Due to sampling variability between training and test set, our model gives better prediction on training data but fails to generalize on test data. This leads to low training error rate and high test error rate. When we split the dataset into training, validation and test set, we only use a subset of data. To overcome these issues, we can adopt various cross validation approaches, namely, K-fold cross validation, stratified k-fold cross validation, leave one out cross validation, stratified shuffle split, etc.

**Leave One Out Cross Validation (LOOCV)**

A dataset with n observations is divided into n-1 observations as the training data and 1 observation as test data. The process is iterated for each data point. Therefore, the execution is expensive. Also, for an outlier in test data, variability in MSE is much higher.

**K-Fold Cross Validation**

Randomly divides data in k-groups or folds of equal sizes. The first fold is kept for testing and the model is trained on k-1 folds. The process is repeated k-times and each time different fold or a different group of data points are used for validation. Typically, k in k-fold is 5 or 10. LOOCV is a variant of k-fold where k = n. Though it is less computationally expensive than the LOOCV but still it is.

**Stratified K-fold Cross Validation**

Each fold in the dataset has at least m instances of each class. This approach ensures that one class of data is not over-represented especially when the target variable is unbalanced.

Non-parametric models do not make any assumptions for the data. Instead, they are free to learn but controlled by some hyperparameters. These models often have low bias and high variance which makes them prone to overfitting. Decision trees is an example of non-parametric models.

Bayes' theorem provides a formula for the likelihood that an occurrence is the direct outcome of a given condition if we take into account the set of conditions that an event occurs. So, it is possible to think of Bayes' theorem as a formula for the conditional probability of an occurrence. For instance, let us consider there are 10 bags containing different coloured marbles. Bayes’ theorem helps to determine the probability of drawing the marble from a particular bag, given the condition that the marble is red in colour. If A is the event of drawing the marble from a particular bag and B is the event of drawing a red marble, then the formula for Bayes’ theorem is given by –**P(A|B)=P(B|A)∙P(A)/P(B)**

Where P(A|B) is the probability of event A to occur given the condition that event B has already occurred.

The following procedures are typically used when evaluating hypotheses about a sample:

- Formulate the null hypothesis (H0) and alternate hypothesis (H1).
- Determine the significance level, alpha for the hypothesis test.
- Check for one-tailed or two-tailed test based on the null hypothesis.
- Compute the critical value for the null hypothesis.
- Determine the acceptance and rejection regions in accordance to the critical value.
- Compare the sample statistics with the significance level, if it falls in the acceptance region then accept the null hypothesis, else reject it.

Time-series data is data that is collected at different points in time. Autocorrelation, Seasonality, Stationarity are the three main components of a time series.

**Autocorrelation**

Autocorrelation refers to the similarity between observations as a function of the time lag between them. We can find the value of a point by finding the period in the plot.

**Seasonality**

Seasonality refers to periodic fluctuations. Period can give the length of the season. For instance, the amount of electricity consumed varies greatly from summer to winter, and online sales peak around Diwali before dipping again.

**Stationarity**

Stationary means that statistical properties do not change over time, that is, constant mean and variance, and covariance is independent of time. For example, stock prices are not a stationary process. For modelling, we would prefer to have a stationary time series. However, there are other transformations we can apply to make them stationary.

The survival function is estimated by Kaplan Meier curves. The survival function is graphically represented by the Kaplan-Meier curve. It displays the likelihood that a subject will live until time t. Plotting the survival function against time leads to the formation of the curve.

- In order to aid hiring managers in determining a candidate's competency and expertise, the interview may include several questions covering a wide range of topics and problems as discussed in this interview questions and answers series.
- Your knowledge will all improve as a result of your preparation for the various questions with the help of this guide which also contains basic statistics interview questions for freshers looking for a role as data analysts.
- It is highly recommended to start learning probability, followed by descriptive statistics, and then inferential statistics for a smooth learning journey.
- Statistics is a field that would require you to try out some hands-on of the concepts that you learn and its implications in real life.
- Your profile should be able to demonstrate your expertise with data by mentioning in your portfolio certain analytical reports or findings that you have created.
- Always include the correct terminologies and equations wherever possible to support your knowledge and understanding of the subject. You can join the Data Science Bootcamp training to prepare for your data science statistics interviews.

Preparing yourselves through mock interviews is always a great place to start with. Apart from this,

- you can prepare a short note to mention all the key terminologies related to statistics.
- For quick reference, you can always view this article to prepare for the statistics interview questions data science.
- For further reading, you can go through the books, “Statistics without Tears” from Derek Rowntree and “The Concise Encyclopaedia of Statistics” by Yadolah Dodge to gain additional in-depth knowledge of the subject.
- KnowledgeHut’s Data Science course details link contains 20+ courses on data science specially curated for your requirements. You can sign up for a data science course to further enhance your skills to crack the interviews of top companies in India.

- In interviews for jobs in statistics, most interviewers want to know how you approached a problem. This demonstrates to them your competence in working with data and its implications, as well as your ability for problem-solving, analytical and logical thinking.
- A statistician might also be required to fulfil the role of Data Analyst. For this purpose, we have mentioned statistics interview questions for data analysts as well.
- A good understanding of core concepts of statistics is expected from an individual.
- Apart from core concepts, you can also be asked to determine probabilities, test hypotheses, etc.
- To help with your interviews for Data Science and Machine Learning profile, we have also included some statistics interview questions for data science as well as machine learning.

The job roles that you can look for after completing this course include:

- Data Scientist
- Data Analyst
- Statistician
- Business Analyst
- Machine Learning Engineer

If you are able to crack these statistics questions for data science interview then you can expect to join the top data science companies including:

- Findability Sciences
- LatentView Analytics
- Fractal Analytics
- Tiger Analytics
- Convergytics Solutions
- eClerx Services
- Wipro
- Tata Insights and Quants

In today's age of computing and huge data handling, statistics is a very interesting field that has a significant impact. Many businesses are pouring billions of dollars into learning analytics and statistics to leverage their existing data. This opens the door for the establishment of numerous jobs in this industry. These Probability and Statistics interview questions will help you brush up on the fundamentals of Statistics as you get ready for employment involving data science and machine learning. With our online Bootcamps, you can learn how to manage enormous data sets and prepare ready to accept lucrative job offers. Our Data Science certification courses will help you become knowledgeable in both fundamental and advanced subjects. To start or advance a successful data career, develop skills in a variety of programming languages and technologies, such as Python, R, MongoDB, TensorFlow, Keras, Tableau, Hadoop, Spark, and more. You should also gain expertise in data manipulation, visualisation, predictive analytics, data science, machine learning, and AI. With more than 400,000+ professionals trained by 650+ expert trainers in 100+ countries, it is the right choice to achieve high growth in your career and journey as a data scientist or ML engineer.

Submitted questions and answers are subjecct to review and editing,and may or may not be selected for posting, at the sole discretion of Knowledgehut.

Get a 1:1 Mentorship call with our Career Advisor

- Get Job-Ready Digital Skills
- Experience Outcome-Based Immersive Learning
- Get Trained by a Stellar Pool of Industry Experts
- Best-In-Class Industry-Vetted Curriculum

By tapping submit, you agree to KnowledgeHut Privacy Policy and Terms & Conditions