The article contains statistics interview questions for freshers and experienced to get started with your preparation for the interview. The basic statistics questions for the interview include probability and statistics interview questions, data cleaning and visualization topics for data analyst profiles. The intermediate and advanced sections contain data science and machine learning statistics interview questions. This guide will help you to cover most of the statistics interview questions. Statistics is an important field for data analysts, machine learning and data science professionals. Therefore, this article also covers interview questions on statistics for data science and data analytics.
The study of statistics focuses on gathering, organising, analysing, interpreting, and presenting data.
A data analyst works in a particular business vertical while a statistician is responsible to work with data irrespective of the industry vertical.
To perform exploratory data analysis to understand the data, relationship and its distribution as well as provide predictions based on these relationships.
Statistics forms the base of machine learning. The predictive modelling in machine learning takes into account the concepts from inferential statistics.
No, it is highly advised to know the basics of statistics before jumping to data science. Statistics is the key to understanding data and therefore, one must be aware of descriptive statistics to learn data science.
The study of data analysis, visualisation, presentation, and interpretation is a key component of statistics. Descriptive statistics and inferential statistics are the two subcategories of statistics. In descriptive statistics, we primarily use numerical metrics, graphs, plots, tables, etc. to organise and summarise data. For instance, a bar graph with specific numbers can be used to summarise the sales results for the financial year. Inferential statistics uses sample data to estimate or infer characteristics of the population. For instance, using past data analysis to forecast the anticipated sales statistics for the upcoming few quarters
Analysis is the process of looking at smaller datasets from a larger dataset that has previously been collected in the past. To understand how and/or why something happened, we conduct analysis. Analytics, on the other hand, typically refers to the future rather than interpreting the past. Analytics is basically the application of logical and computational reasoning to the individual pieces received from an analysis in order to seek for patterns and investigate what we may do with them in the future.
Data, in any form, is a fact that must be processed in order to give it meaning because it is raw, unprocessed, and disorganised. Data can take the form of a number, picture, word, graph, etc. Information is a processed data formed by manipulation of raw data which includes data that has context, meaning, and purpose. Data can be thought of as the raw material for producing information. For instance, a data may be the volume of sales occurring at a store, and the information deduced from this data could be the average sales.
Any statistical analysis you conduct begins by determining whether you are working with a population or a sample of data. A population, which is typically represented by an uppercase "N," is the totality of the things relevant to our study. A sample, represented by a lowercase 'n,' is a subset of the population. Let's think about a voting example. Who wins the nomination is determined by the final vote results. The population is the basis for these findings. The people that show up to vote make up the population in this case. However, there is a survey conducted by numerous different parties to predict the winner of the competition before the results are released. This survey is conducted on a small sample size—say, let's 20% of the population. This portion of the population is referred to as a sample. The field of statistics deals with sample data.
Different types of variables require different types of statistical and visualization approaches. Based on the type of variable, it is further divided into numerical and categorical data.
Numerical data represents number or figures, for example, sales amount, height of students, salary, etc. A numerical variable is further divided into two subsets, discrete and continuous. The number of students in a class or the results of a test are two examples of discrete data that can typically be counted in a finite way. Continuous variable data cannot be counted since it is infinite. For instance, continuous variables such as a person's weight, a region's area, etc. might vary by small quantities
Categorical data represents categories, including things like gender, email type, colour, and more. Categorical data is further divided into nominal and ordinal variables. Ordinal categorical variables can be displayed in a certain order, such as when a product is rated as either awful, satisfactory, good, or excellent. Nominal variables can never be arranged in a hierarchy. For instance, a person's gender.
A bar chart uses rectangular vertical and horizontal bars to statistically represent the given data. Each bar's length is proportionate to the value it corresponds to. The values among various categories are compared using bar charts. With the use of two axes, bar charts illustrate the relationship. It depicts the discrete values on one axis while the categories are represented on another. There are a number of different bar charts available for visualizing the data but the 4 major categories in which we can distinguish them is vertical, horizontal, stacked, and grouped bar chart.
Vertical bar chart
The most popular type of bar chart is the vertical bar chart. A vertical bar chart is one in which the given data is displayed on the graph using vertical bars. The measure of the data is represented by these vertical rectangular bars. On the x- and y-axes, vertical lines are drawn to represent the rectangular bars. The number of the variables listed on the x-axis is represented by these rectangle-shaped bars.
Horizontal bar chart
Charts that show the given data as horizontal bars are referred to as horizontal bar charts. The measures of the provided data are displayed in these horizontal, rectangular bars. In this style, the x-axis and y-axis are labelled with the data categories. The bar chart's horizontal representation is displayed in the y-axis category.
Stacked bar chart
Each sub-bar that makes up a normal bar chart represents a level of the second categorical variable, and they are all stacked on top of one another. A 100% stacked bar chart represents the given data as the percentage of data that contributes to a total volume in a distinct category, in contrast to a stacked bar chart that directly depicts the given data.
Grouped bar chart
A grouped bar chart makes it easier to compare data from multiple categories. For levels of a single categorical variable, bars are grouped by position, with colour often designating the secondary category level within each group.
Scatter plot is a very important graph when it comes to understanding the relationship between two numerical variables. For example, consider the following table which provides the percentage marks scored and total attendance of ten students of a class.
The percentage of students in attendance is represented on the x-axis, while the percentage of marks scored is represented on the y-axis. The scatter plot could therefore help us comprehend the relationship between the two variables. We may argue that when students attend class more frequently, they tend to perform better academically. We can also spot instances that are the exception rather than the rule, like Student 4.
Frequency distribution is a series when a number of observations with similar or closely related values are put in separate bunches or groups, each group being in order of magnitude in a series. The data are simply organised into classes in a table, and the number of cases that fall into each class is noted. It displays the frequency with which various values of a single phenomenon occur. In order to estimate frequencies of the unknown population distribution from the distribution of sample data, a frequency distribution is created.
Take a survey of 50 households in a society as an example. The number of children in each family was recorded, and the results are shown in the following frequency distribution table
|No. of children||Frequency|
As a result, frequency in the table refers to how frequently an observation occurs. The number of observations is always equal to the sum of the frequencies. We can evaluate the data's underlying distribution and base judgements on it with the aid of frequency distribution.
The three measures of central tendency are mean, median, and mode.
Mean, also known as simple average, denoted by the Greek letter µ for a population and for a population. By adding up each observation of a dataset and then dividing the result by the total number of observations, we may determine the dataset's mean. This is the most common measure of central tendency.
The median of an ordered set of data is its middle number. As a result, it divides the data into two halves: the higher and lower halves. The median of the first nine natural numbers, for instance, is five.
Mode is the value that occurs most often. Although it can be applied to both numerical and categorical data, categorical data are typically preferred. For instance, if 60% of the observations for a gender variable are male, then male will be the mode value, signifying the value of maximum occurrence.
The dataset's midpoint can be estimated using both the mean and the median. Depending on the type of data, the mean or the median may be a better choice for describing the dataset's midpoint. When the data is equally distributed (symmetrical) and follows a distribution that is close to normal, the mean is typically used. It is preferable to utilise the median to identify the central value if the data is skewed, which indicates the presence of outliers in the dataset. Let's take 10 data scientists as an example, whose salaries are (in LPA) 12, 14, 9, 10.5, 17, 11, 8, 14, and 65. While the median pay is 13, the mean salary is 18.5. The extreme figure of 65 LPA, which can be viewed as an anomaly because the hired individual may be a member of a prestigious university or located on-site, had an impact on the mean income. The median, however, is unaffected. We can infer from the dataset that the median in this instance represents the centre value more accurately than the mean.
What is a categorical variable's mode when there are several values present in the majority of instances?
The most frequent value in a data set is referred to as the mode. A set of data may have one mode, multiple modes, or none at all. Multimodal refers to a set of numbers with more than one mode. Bimodal data is defined as having two modes, which means that two values equal the dataset's maximum occurrence. Similar to this, a group of numbers with three modes is referred to as trimodal. Datasets without repeated values, on the other hand, would indicate that there is no mode in the data.
We initially arrange the collection of numbers in ascending order before calculating the median value of the data. The observation is then located in the middle of this sorted list. The element present at location (n+1)/2, where 'n' is the total number of observations, will be the mode for an odd number of total observations. The median value, however, will be the simple average of the middle two elements located at positions n/2 and (n+1)/2 if the total number of observations is even.
The quantiles are values used to segment the distribution so that a specific percentage of data fall below each quantile. A quantile is the median, for instance. The median can also be referred to as the 50th quantile which is the point where half the points are more than or equal to it and half are less than or equal to it in the distribution. Similarly, we can have 25th and 75th quantile which will represent the 25% and 75% of the observations on one side respectively. If we consider a data set of the first hundred natural numbers then the 25th, 50th, and 75th quantiles will be 25, 50, and 75 respectively. If the number of quantiles is four, then it is referred to as quartiles.
The examination of one, two, and more than two variables is referred to as univariate, bivariate, and multivariate analysis respectively. Since there is only one variable involved, univariate analysis can be referred to as a frequency distribution table, the computation of minimum, maximum, average value, etc. Univariate analysis, for instance, includes salary analysis of employees inside an organisation. In many situations simultaneous study of two variables become necessary. For instance, we wish to categorise information on a group of people's income and spending patterns, their attendance and grades, etc. Examples of bivariate analysis include scatter plots and bivariate frequency distribution charts. Multivariate analysis is used when there are more than two variables being observed at once.
The central tendency measure helps identify the distribution's centre, but it does not show how the items are distributed on either side of the centre. Dispersion is the term used to describe this property of a frequency distribution. The items in a series are not all equal. The values vary or differ from one another. Different measurements of dispersion are used to assess the level of variance. Large dispersion suggests less uniformity, while small dispersion indicates good homogeneity of the observations.
The most significant measures of dispersion for a single variable are the standard deviation and coefficient of variation, which are frequently employed in statistical formulas.
In statistics, a distribution is a function that displays the range of potential values for a variable along with their frequency. The probability for each individual observation in the sample space can be determined using a parameterized mathematical function. We utilise a statistical distribution to assess the likelihood of a specific value. The most common distributions are –
There are two types of estimators: point estimates and confidence interval estimates. While confidence interval estimates give a range, point estimates simply represent a number that indicates where you expect your population parameter to be. Since point estimates might be unpredictable, confidence intervals are a far more accurate term to describe reality. The confidence interval's centre is exactly where the point estimates are located. As an illustration, stating that I spend 350 rupees per day on transportation uses point estimates, but stating that I spend between 300 and 350 rupees per day on transportation uses confidence interval estimates.
The estimators with the lowest bias and highest efficiency are the most accurate. Without surveying the full population, you can never be entirely confident. We want to be as precise as possible. Most of the time, a confidence interval will produce reliable results. A point estimate, however, will nearly always be inaccurate but is easier to comprehend and convey.
The range that you anticipate the population parameter to fall inside is known as a confidence interval. The margin of error is what we will add or subtract from our guess to create our confidence interval. For example, according to a poll, a particular candidate will likely win an election with 51% of the vote. The inaccuracy is 4%, and the degree of confidence is 95%. Let's assume that the survey was conducted again using the same methods. The pollsters would anticipate that 95% of the time, the results would be within 4% of the declared outcome. In other words, they would anticipate the outcomes to fall between 47% (51-4) and 55% (51+4). Margin of error can be calculated using either the standard deviation or the standard error.
We might be interested in predicting the value of one variable given the value of other variables after we understand the link between two or more variables. The term "target" or "dependent" or "explained" refers to the variable that is predicted based on other variables, and "independent" or "predicting" refers to the other variables that aid in estimating the target variable. The prediction is based on an average association that regression analysis has statistically determined. The formula, whether linear or not, is known as the regression equation or the explanatory equation. Real numbers are used as the output or target values for regression operations.
Think about estimating the cost of a house, for instance. In this scenario, the house price serves as your target variable. Some potential independent variables that may aid in estimating this price are the area, the year the house was built, the number of bedrooms and bathrooms, the neighbourhood, etc. Other instances of regression include predicting retail sales based on the season or agricultural output based on rainfall.
Regression analysis operates under three different categories:
As statisticians, we are more interested in patterns and trends than in single points. Outliers are specific points that do not fit the pattern or trend that was discovered. These points may exist as a result of different measurement thresholds, extraordinary circumstances, or even experiment logging errors. Think about the collection of information about the height of pupils in a given grade. An outlier can be a data point that represents a measurement that was taken in a different unit or a pupil who is noticeably taller or shorter than their peers.
Metrics are intended to evaluate business performance. For example, average sales per customer is a metric which is a useful measure having business meaning. Comparative analysis makes great use of metrics. Key performance indicators, or KPIs, are a group of metrics that are in line with a certain business goal. The key reflects our primary business objective, and performance indicators show how well we have accomplished over the course of a given period of time. For instance, KPIs will identify the traffic generated just from users who have clicked on a link provided in our ad campaign, while metrics will describe the traffic of the page from our website that was visited by any sort of users.
A sample is referred to as the subset of a population. These samples are drawn from a population and need to be a good representative of the actual population. For example, consider we are collecting feedback of a university from a group of students to prepare a sample. Now, we notice that there are students present in the cafeteria or the library from whom we can gather the feedback. But this feedback might possibly come with a bias. The feedback should also comprise students who are attending the lectures, or even bunking to get the actual representation of the population. When the sample contains data points from a specific.
According to the Pareto principle, 20% of causes account for about 80% of the consequences for most outcomes. According to this theory, there is an unbalanced link between inputs and outputs. The Pareto Principle states that most things in life are not distributed evenly, with some contributing more than others. This is an observation, and not a rule. For example, we can state that the maximum revenue of an organisation comes from a handful of its oversees clients.
The events "A occurs" and "A does not occur" refer to events that are complementary to one another. Both the event and its complements are mutually exclusive. For instance, when rolling a dice, getting odd numbers is represented by 1, 3, 5, and getting even numbers by 2, 4, 6. These two things don't occur together and are complementary to one another.
There will be a 5/12 chance. When we throw two dice, there are a total of 36 potential outcomes. There are 15 scenarios out of these 36 possible outcomes where the sum is more than 7. The result is 15/36 or 5/12 when the number of favourable outcomes is divided by the total outcomes.
Normal distribution is a symmetrical bell-shaped curve representing frequencies of different classes from the data. Some of the characteristics of normal distribution include:
Normal distribution is one of the most significant probability distributions in the study of statistics. This is so because a number of natural events fit the normal distribution. For instance, the normal distribution is observed for heights and weights of an age group, test scores, blood pressure, rolling a die or tossing a coin, and income of individuals. The normal distribution provides a good approximation when the sample size is large.
The distribution moves to either side of the horizontal axis if we adjust the mean while maintaining the same standard deviation. The graph is shifted to the right by a higher mean value and to the left by a lower mean value.
The graph reshapes when the standard deviation changes while the mean remains constant. When the standard deviation is lower, more data are seen in the centre and have thinner tails. The graph will flatten out with more points at the ends or better tails and fewer points in the middle as a result of a larger standard deviation.
Outlier is an observation which is well separated from the rest of the data. The interpretation of an outlier takes into account the purported underlying distribution. Outliers can be dealt with primarily in two ways: first, by adapting techniques that can handle the existence of outliers in the sample, and second, by attempting to remove the outliers. We know that outliers have a significant impact on our estimation. Instead of following the sample or population, these observations have an impact on the predictions. The removal of an outlier from our sample is frequently not the best option, therefore we either employ techniques to mitigate their negative effects or use estimators that are insensitive to outliers.
Skewness is a measure of asymmetry that indicates whether the data is concentrated on one side. It allows us to get a complete understanding of the distribution of data. Based on the type, skewness is classified into three different types.
Positive skewness or right skew
Outliers at the top end of the range of values cause positive skewness. Extremely high numbers will cause the graph to skew to the right, showing that there are outliers present. The higher numbers slightly raise the mean above the median in this instance, meaning that the mean is higher than the median.
No skewness or zero skew
This is a classic instance of skewness not being present. It denotes a uniformly distributed distribution around the mean. As a result, it appears that the three values, mean, median, and mode, all coincide.
Negative skewness or left skew
Outliers near the lower end of the values cause negative skewness. Extremely low numbers will cause the graph to skew to the left, indicating that there are outliers present. In this instance, the mean is significantly smaller than the median because the lower values cause the mean to fall from the central value
In probability theory and statistics, a central moment is a moment of a probability distribution of a random variable about the random variable's mean.
For univariate analysis of a numerical variable, the must use visualizations are histograms and box and whisker plot (or box plot). Scatter plots are used to perform multivariate analysis of numerical variables.
A histogram is a graphic representation of the distribution of data that has been grouped into classes. It is a type of frequency chart that is made up of a number of rectangles. Each piece of data is sorted, then each value is assigned to the proper class interval. The frequency of each class interval is determined by the number of data values that fall within it. A specific class of data is represented by each rectangle in the histogram. The width of the rectangle represents the width of the class. It is commonly used to determine
Box and whisker plot (box plot)
A box plot shows the maximum and minimum values, the first and third quartiles, and the median value, which is a measure of central tendency. In addition to these quantities, it also explains the symmetry and variability of the data distribution. Outliers in the dataset are frequently visualised using this visualization.
The scatterplot is a very helpful and effective tool that is frequently used in regression analysis. A pair of observed values for the dependent and independent variables are represented by each point. Before selecting a suitable model, it enables graphically determining whether a relationship between two variables exists. These scatterplots are also very helpful for residual analysis because they let you check whether the model is a good fit or not.
Covariance and correlation coefficient reveals the relationship and the strength of relationship between the two variables.
Covariance is a measure of how two random variables in a data set will change jointly. When two variables are positively correlated and moving in the same direction, this is referred to as positive covariance. A negative covariance denotes an inverse relationship between the variables or a movement in the opposite directions. For instance, a student's performance on a particular examination improves with increased attendance, which is a positive correlation, whereas a decrease in demand caused by a rise in the price of an item is a negative correlation. When the covariance value is zero, the variables are said to be independent of one another and have no influence on one another. If the covariance value is higher than 0, it means that the variables are positively correlated and move in the same direction. The variables are negatively correlated and move in the opposite direction when the correlations have a negative value.
|Covariance Value||Effect on Variables|
Cov (X, Y) > 0
Positive Correlation (X & Y variables move together)
Cov (X, Y) = 0
No Correlation (X & Y are independent)
Cov (X, Y) < 0
Negative Correlation (X & Y variables move in opposite direction)
Similar information is given by the correlation coefficient and the covariance. The fact that the correlation coefficient will always retain a value between negative one and one is its benefit over covariance. A perfect positive correlation exists between the variables under study when the correlation coefficient is 1. In other words, as one moves, the other follows suit proportionally in the opposite direction. A less than perfect positive correlation is present if the correlation coefficient is less than one but still larger than zero. The correlation between the two variables is stronger as the correlation coefficient approaches one. There is no observable relationship between the variables when the correlation coefficient is zero. That means it is difficult to predict the movement of the other variable if one variable moves. The variables are perfectly negatively or inversely connected if the correlation coefficient is zero, or negative one. One variable will drop proportionally in response to an increase in the other. The variables will oscillate in opposing directions. If the correlation coefficient is more than negative one, it means that the negative correlation is not perfect. The correlation increases as it gets closer to being negative one.
|Covariance Value||Effect on Variables |
Think about a sample taken from a population as a whole with a mean value. It's possible that we'll obtain an entirely different mean if we take another sample from the same population. Let's say you gathered ten distinct samples. You'll observe that the sample mean is influenced by the members belonging to their own sample. Hence, using just one value is not the best course of action.
A fresh dataset of sample means is produced by the new samples that were collected. There is a certain distribution of these values. The phrase "sampling distribution" is used to describe a distribution made out of samples. We are dealing with a sampling distribution of the mean in this instance. These values are distinct when we look at them closely, but they are centred on one particular value.
Every sample mean in this analysis approximates the population mean. The value they centre on may provide a very accurate indication of the population mean. In fact, we anticipate getting a pretty accurate approximation of the population mean if we take the average of those sample means. We see a normal distribution when we visualise the distribution of the sampling means and the Central Limit confirms that. The sampling distribution of the mean will resemble a normal distribution regardless of the underlying population distribution, whether it be binomial, exponential, or another type.
As a result, even when the population is not normally distributed, we can still conduct tests, work through issues, and draw conclusions using the normal distribution according to the central limit theorem.
There are various performance measures or metrics that can help to evaluate the performance of a classification model. However, it depends on the kind of problem we are dealing it. At times, accuracy might not be a good idea for evaluation and we need to focus on certain aspects of the results rather than the accuracy as a whole. The most common metrics used for the purpose are –
A confusion matrix is one of the evaluation techniques for machine learning models in which you compare the results of all the predicted and actual values. Confusion matrix helps us to derive several different metrics for evaluation purpose such as accuracy, precision, recall, and F1 score which are widely used across different classification use cases.
ROC AUC curve
The probability curve, the Receiver Operator Characteristic (ROC) separates the signal from the noise by plotting the True Positive Rate (TPR) versus the False Positive Positive Rate (FPR) at different threshold values. A classifier's capacity to distinguish between classes is measured by the Area Under the Curve (AUC). The performance of the model at various thresholds between positive and negative classes is improved by a higher AUC. The classifier can correctly discriminate between all Positive and Negative class points when AUC is equal to 1. The classifier would be predicting all negatives as positives and vice versa when AUC is equal to 0.
Jaccard Index or also known as Jaccard similarity coefficient. If y is the actual label and ŷ is the predicted value then we can define Jaccard index as the size of the intersection by the size of the union of two labelled sets.
Consider if you have a total of 50 observations, out of which your model predicts 41 of them correctly, then the Jaccard index is given as 41 / (50 + 50 - 41) = 0.69. The Jaccard index of 0.69 defines that the model predicts on the test set with an accuracy of 69%. So, a Jaccard index ranges from 0 to 1 where an index value of 1 implies maximum accuracy.
Log loss or logarithmic loss measures the performance of a classifier where predicted output is a probability value between 0 and 1. We can calculate the log loss using the log loss equation which measures how far each prediction is from the actual label. It is obvious that most ideal classifiers have a lower value of log loss. So the classifier with lower log loss has better accuracy.
Confusion matrix is one of the evaluation methods for machine learning models that compares the outcomes of all the expected and actual values.
The figure representing confusion matrix has four different cases:
In this matrix, the values in green are correctly identified by the model and the values in red are wrongly identified by the model. Confusion matrix can also be used for non-binary target variables.
The occurrence of Type 1 Error, also known as a False Positive event, occurs when the expected value is positive but it is actually negative. When the actual value is positive when the predicted value is negative, this is known as a False Negative event and results in Type 2 Error. For instance, if we consider rain to be a positive event, then your device's prediction that it would rain today but it didn't actually happen is a type 1 error, while your device's prediction that it wouldn't rain today but it actually did happen is a type 2 error.
The performance of a regression model is evaluated based on how close or far the predictions are to the actual value. Primarily, there are three metrics widely used to evaluate the performance of regression tasks, namely, Mean Absolute Percentage Error (MAPE), Mean Squared Error (MSE), and Root Mean Squared Error (RMSE). All three of these metrics evaluate the distance between the predictions and the actual values. MAPE is generally used in tasks where there is presence of outliers, like the timeseries data. MSE and RMSE is used when the outliers are not present or relatively less. RMSE is generally preferred over MSE as it provides the metric value in the same unit as the target variable which makes it easy for comparison.
The R2 score, often known as the R-squared value, indicates how much of the overall variability can be explained by the regression. It is a relative scale with values ranging from 0 to 1. An R2 score of 0 indicates that our regression model does not explain any of the data variability. An R2 value of 1 indicates that our regression model completely accounts for the data's variability, which is uncommon or practically impossible. The degree of topic complexity and the number of variables employed both affect the R2 score. However, the higher the score, the better it is. However, R2 score has a drawback because it is a monotonically growing function. This implies that the value of the R2 score will increase each time we include a new variable in the regression model, giving the impression that the more variables we include, the better our model will be. This isn't always the case because the additional variable might not have much of an impact on the model. To take care of this, an adjusted R2 score is preferred which penalises the model for using an insignificant variable. This will ensure that the score is higher only if we have used significant variables and have avoided the insignificant ones.
Also known as the F1-Score or the F-Score, the F-Measure is a numerical value. It evaluates how accurate a test is. In a perfect scenario, both the precision and recall values would be high. However, there is always a trade-off between recall and precision, and unfortunately, we must prioritise one over the other. The two components of the F1 score are precision and recall. The F1 score aims to combine the precision and recall measures into a single metric. This F-score is what we use to compare two models. In terms of the formula, F1 score is the harmonic mean of precision and recall and given by –
The value of F1 score ranges between 0 and 1. An F1 score of 1 is regarded as ideal, whereas a score of 0 indicates that the model is a complete failure.
The sum of the square differences between the observed dependent variable and its mean is known as the sum of square total (SST). It is a measurement of the dataset's overall variability.
The sum of the squares between the predicted value and the dependent variable's mean is known as the sum of squares due to regression (SSR). It explains how well the data fit our regression line. If this value is the same as the SST, our regression model perfectly captures the observed variability.
The difference between the actual value and the predicted value is known as the Sum of Squared Error (SSE). Usually, we wish to reduce the error. The regression's estimating power increases with decreasing error.
The overall variability of the data set provided by SST is equal to the sum of the variability described by the regression line, or SSR, and the unexplained variability, or SSE.
A regression model can be assessed using MAPE, MSE, and RMSE metrics. In statistics, the Mean Absolute Percentage Error, or MAPE, is a measure of forecasting method accuracy. The accuracy is typically expressed as a ratio. If there are several outliers present, it is usually preferred and known as the average absolute deviation. Two vectors—the vector of predictions and the vector of target values—can be measured using the Mean Absolute Error (MAE) and the Root Mean Squared Error (RMSE), respectively. The l2 and l1 norms, respectively, relate to the calculation of the RMSE and MAE. The RMSE works exceptionally well and is typically preferred when the outliers are exceedingly rare, as in a bell-shaped curve.
Distance is a measurement of how far apart two objects are. Therefore, a distance is equivalent to a real number. If both objects are the same, it is zero; otherwise, it is positive. The distance units that are most frequently used are the Euclidean, Manhattan, Minkowski, and Hamming. The Manhattan distance and the Euclidean distance can both be thought of as generalisations of the Minkowski distance. The Minkowski distance for p = 1 is also known as the Manhattan distance, the L1 norm, or the absolute distance. The Euclidean distance is referred to as the distance when p = 2.
One common way to identify outliers is through the internal limits or whiskers derived from the interquartile range. The interquartile range represents the 50% of the observations. The median is located at the centre of this range. When there are minimum or maximum values at the extreme sides then we can define the cut-off value to determine the outliers using the formula –
These two data values are known as adjacent points. If we find observations outside of the interval between the lower limit and the upper limit, then it can be termed as outliers in the dataset.
Precision is the ratio of the correctly identified positive classes to the sum of the predicted positive classes. The predicted positive classes are the ones which are predicted positive irrespective of the actual value being positive or negative, that is, true positive and false positive classes. This ratio provides information that out of all the positive classes we have predicted correctly, how many are actually positive.
Recall is the ratio of the correctly identified positive classes to the sum of the actual positive classes. The actual positive classes can be predicted as positive or negative that is, true positive and false negative. This ratio provides the information that out of all the positive classes, how much we predicted correctly.
The OLS assumptions for a linear regression are divided into five assumptions:
The regression assumes that the data is linear in nature. For higher degrees of variables, linear regression will not produce good predictions.
The issue of endogeneity arises when we have a variable that is related to the target and also the predictors but not included in the model. Therefore, endogeneity is a situation in which a predictor in a linear regression model is correlated to the error term. We call such predictors as endogenous variables.
Normality and homoscedasticity
This assumes that the error term is normally distributed, and the expected value of error is 0, meaning that we expect to have no error on average. Homoscedasticity assumes that the variance is constant for the error term.
This assumes that the covariance between two error terms is not zero.
When two or more variables in our regression are strongly correlated, this situation is referred to as multicollinearity. The OLS assumptions assume that there are no strongly correlated variables in our analysis.
One can standardise any distribution. The process of standardisation involves transforming the variables to one with a mean of zero and a standard deviation of one.
Standardization is also possible for normal distributions. The result is known as a standard normal distribution. A standard normal distribution is represented by the letter Z. The Z-score is referred to as the standardised variable. The formula for standardising variables is defined by the Z-score. We first determine a variable's mean and standard deviation. The mean is then subtracted from each observed value of the variable, and then divide by the standard deviation.
Exploratory data analysis is a method of data analysis in which the properties and qualities of the data are examined without making any attempt to fit the data to a specific model. This method's guiding principle is to analyse the data before using a particular model. We use numerical and graphical processes in exploratory data analysis to better comprehend the data. The emphasis on visual data representation has been a key component in the development of exploratory data analysis.
Skewed data refers to the data that contains outliers. Outliers are known to have a negative influence on the model’s predictions and thus needs to be eliminated. However, it is not always advised to remove the outliers, so we need to handle them through certain transformations. The common transformations applied on the data are log transformation, square root transformation, and box-cox transformation.
The logarithmic transformation is one of the most helpful and popular transformations. In fact, it might be a good idea to use the dependent variable's logarithm as a replacement before doing a linear regression. A similar operation would stabilise the target variable's variance and bring the transformed variable's distribution closer to normal.
Square root transformation
If there are any outlier values that are exceptionally large, you might consider using the square root transformation. The transformation can help scaling them down to a much lower value in comparison. A limitation of this transformation is that the square root of a negative number is not a real number.
Box-cox transformation is yet another transformation method that can help to transform skewed data into normal. It has a controlling parameter lambda which ranges between -5 to 5. Initially, it was only used in presence of positive values, but modifications have been made to the transformation to take care of the negative values as well.
The likelihood of an event provided that another event is known to have occurred is known as conditional probability. The expression "probability of A given that B has already occurred" refers to the conditional probability, which is written as P(A|B). When determining the likelihood of an event when a piece of information is already known, either entirely or partially, conditional probabilities are helpful. Examples include, determining the probability of getting the number 5 on the second throw provided that we have already got the number 6 on the first throw, drawing a red ball provided that the first two balls drawn are blue and green.
A bag contains 4 white, 5 red, and 6 blue balls. Three balls are drawn at random from the bag. The probability that all of them are red is?
In this case, we have total 15 balls. The sample space will be the number of ways in which we can draw three balls from the bag of 15 balls. This can be done in 15C3 ways which is equal to 455. Now, the event of getting three red balls implies that we are trying to draw three red balls out of the 5 red balls present in the bag. We can draw three red balls among 5 red balls in 5C3 ways, equal to 10. The final result will be the number of possibilities of drawing three red balls divided by the total possibilities, that is, 10/455 or 2/91.
The strength of the linear relationship between two random variables is gauged by the simple correlation coefficient, commonly known as Pearson correlation. The range [1; 1] is where values of the correlation coefficient can exist. The two extreme values of this interval represent a positively skewed and negatively skewed perfectly linear relationship between the variables. The number zero indicates that there is no linear relationship. A non-parametric measuring correlation is the Spearman rank correlation coefficient. It is also used to establish the relationship that exists between two pieces of data. The Spearman rank correlation also works with monotonic or non-linear functions, unlike the Pearson coefficient, which requires a linear relationship between the two variables.
The process of hypothesis testing enables us to either validate the null hypothesis, which serves as the beginning point for our investigation, or to reject it in favour of the alternative hypothesis. A parametric test is a type of hypothesis test that assumes a specific shape for each distribution connected to the underlying populations. In a non-parametric test, the parametric form of the underlying population's distribution is not required to be specified. The null hypothesis is the one that needs to be tested while conducting hypothesis testing. The alternate hypothesis is the opposite argument. If the test results show that the null hypothesis cannot be verified, the alternative hypothesis will be adopted. For example, if the null hypothesis states that “The mean height of men in India is more than 5 feet 6 inches” then the alternate hypothesis will state that, “The mean height of men in India is equal to or less than 5 feet 6 inches”.
The interval that causes the null hypothesis to be rejected in a hypothesis test is known as the rejection region and is measured in the sampling distribution of the statistic under examination. The rejection zone complements with the acceptance region and is connected to a probability alpha, also known as the test's significance level or type I error. It is a user-fixed parameter of the hypothesis test that establishes the likelihood of rejecting the null hypothesis.
A one-sided or one-tailed test on a population parameter is a sort of hypothesis test in which the values for which we can reject the null hypothesis, indicated, are exclusively located in one tail of the probability distribution. For instance, if "The mean height of men in India is higher than 5 feet 6 inches" is the null hypothesis, then the alternative hypothesis would be "the mean height of men in India is equal to or less than 5 feet 6 inches." This is a one-sided test because the alternate hypothesis, i.e., equal to or less than 5 feet 6 inches, only considers one end of the distribution.
A two-sided test for a population is a hypothesis test used when comparing an estimate of a parameter to a given value versus the alternative hypothesis that the parameter is not equal to the stated value. If the null hypothesis is, for instance, "The mean height of men in India is equal to 5 feet 6 inches," then the alternative hypothesis would be, "The mean height of men in India is either less than or greater than 5 feet 6 inches but not equal." The alternate hypothesis, greater than or less than 5 feet 6 inches, deals with both extremes of the distribution, making this a two-tailed test.
The probability determined using the null hypothesis is the basis of the p-value. Consider if we are trying to reject the null hypothesis at a certain significance level, alpha. If we are not able to reject the null hypothesis at this significance level, we can reduce the significance level which might allow us to accept the null hypothesis. The p-value is the smallest value of significance level alpha, for which we can reject the null hypothesis. If the p-value is smaller than the alpha, we reject the null hypothesis otherwise we fail to reject the null hypothesis.
How do you calculate the confidence interval for a population mean with known and unknown variance?
If the sample size is high or the population variance is known, many statistical tests can be conveniently carried out as approximate Z-tests. The Student's t-test would be more appropriate if the population variance is unknown (and must therefore be approximated from the sample itself) and the sample size is small (n < 30). The sample size affects the t-distribution. The distribution of t-distribution approaches the z-distribution as the sample size increases. The t-statistic table becomes nearly identical to the z-statistic after the 30th row, or after 30 degrees of freedom. Therefore, even though the population variance is unknown, we may still apply the z-distribution.
The F1 score is calculated as the harmonic mean of the precision and recall values. The mean or simple average treats all values equally. On the other hand, the harmonic mean gives more weight to the low values. As a result, the classifiers will only get a higher F1 score if both recall and precision is high.
The Akaike information criterion (AIC), a refined method based on in-sample fit, is used to determine how likely it is for a model to estimate or predict future values. Another model selection criterion that assesses the trade-off between model fit and complexity is the Bayesian information criterion (BIC). We utilise either the AIC or the BIC, but not both concurrently and interchangeably, to compare models with one another. The model that has the lowest AIC or BIC of all the models is a good model.
When two or more variables in our regression are strongly correlated, this situation is referred to as multicollinearity. The effect of multicollinearity among our variables is measured by the variance inflation factor, or VIF score. It gauges how much a predicted regression coefficient’s variance rises in the presence of correlation. We aren’t performing well if the variance of our model rises. A general rule of thumb that is frequently applied in practise is that high multicollinearity is present if the VIF score is greater than 10.
Imbalance classes are skewed classes where a single value might make up a significant amount of the data set, also known as the majority class. Consider that we are utilising a dataset of credit card fraud. The percentage of fraud incidents in the overall amount of data can be as low as 1%. In this situation, our model would still be 99% accurate if it were to blindly forecast each case to be authoritative. In order to prevent this, we either under sample the non-fraudulent instances or oversample the fraudulent ones, allowing them to make up a sizable fraction of the population.
The bias is defined as the difference between the actual and the predicted value of a variable. The bias shows how far the estimate is off from the variable’s actual value. Bias represents the assumptions made by the model to make the target function easier to learn. The variance calculates the mean deviations of a variable's values from its mean. It is the amount that the target function will change if a different training data is used. Bias and variance are prediction errors of an algorithm. The bias-variance trade-off is the property of the algorithm which suggests that these two errors should be minimized to avoid the learning algorithm from overfitting and underfitting the training data. An ideal model will have low bias and low variance. A low variance and high bias suggest overfitting while a low bias and high variance will mean that we are overfitting the data. Decreasing bias leads to increase in the variance and vice versa. Therefore, we need to find a balance where both these errors are minimum.
A resampling technique called cross-validation uses several data subsets to evaluate and train a model across a number of iterations. It is typically applied in situations where the objective is prediction, and one wishes to evaluate how well a predictive model will function in real-world situations. Due to sampling variability between training and test set, our model gives better prediction on training data but fails to generalize on test data. This leads to low training error rate and high test error rate. When we split the dataset into training, validation and test set, we only use a subset of data. To overcome these issues, we can adopt various cross validation approaches, namely, K-fold cross validation, stratified k-fold cross validation, leave one out cross validation, stratified shuffle split, etc.
Leave One Out Cross Validation (LOOCV)
A dataset with n observations is divided into n-1 observations as the training data and 1 observation as test data. The process is iterated for each data point. Therefore, the execution is expensive. Also, for an outlier in test data, variability in MSE is much higher.
K-Fold Cross Validation
Randomly divides data in k-groups or folds of equal sizes. The first fold is kept for testing and the model is trained on k-1 folds. The process is repeated k-times and each time different fold or a different group of data points are used for validation. Typically, k in k-fold is 5 or 10. LOOCV is a variant of k-fold where k = n. Though it is less computationally expensive than the LOOCV but still it is.
Stratified K-fold Cross Validation
Each fold in the dataset has at least m instances of each class. This approach ensures that one class of data is not over-represented especially when the target variable is unbalanced.
Parametric models make assumptions of the underlying distribution and consists of fixed learning parameters. These models often have high bias and low variance which makes them prone to underfitting. Linear regression is an example of a parametric model since it assumes the underlying data distribution to be linear.
Non-parametric models do not make any assumptions for the data. Instead, they are free to learn but controlled by some hyperparameters. These models often have low bias and high variance which makes them prone to overfitting. Decision trees is an example of non-parametric models.
Consider a coin that has a slightly biased chance of landing on its head (51%), and a slightly biased chance of landing on its tail (49%). If you toss it 100 times, you will get 51 heads and 49 tails, or mostly heads. But for a small number of trials, this might not be the case. We can see that the likelihood of receiving a head will be 51% with a disproportionately large number of trials. With each further coin toss, the likelihood rises. According to Bernoulli's theorem, which is a condensed version of the law of large numbers, as the number of Bernoulli trials approaches infinity, the relative frequency of success in a series of trials approaches the chance of success.
Autocorrelation is a measure of the correlation of a particular time series with the same time series delayed by k lags. It is calculated by dividing the covariance between the current value and the lagged value by the standard deviation of the two observations, which are separated by k lags in a time series. We may determine the autocorrelation function by calculating the autocorrelation for all values of k. The autocorrelation function for a time series that doesn't vary over time falls off exponentially to zero.
If a timeseries' mean, variance, and covariance remain constant across time or do not change with passing intervals, the data is said to be stationary. The Dickey-Fuller test is a typical stationarity test. If the data is non-stationary, the initial step is to apply X-order of differencing to the data. We can keep comparing until we reach at stationarity. Each stage of the differencing process, however, results in the loss of one row of information. We can also make a seasonal difference for data that is showing a seasonal trend. If we had monthly data with yearly seasonality, for instance, we could differ by a time unit of 12, rather than just 1.
Bayes' theorem provides a formula for the likelihood that an occurrence is the direct outcome of a given condition if we take into account the set of conditions that an event occurs. So, it is possible to think of Bayes' theorem as a formula for the conditional probability of an occurrence. For instance, let us consider there are 10 bags containing different coloured marbles. Bayes’ theorem helps to determine the probability of drawing the marble from a particular bag, given the condition that the marble is red in colour. If A is the event of drawing the marble from a particular bag and B is the event of drawing a red marble, then the formula for Bayes’ theorem is given by –
Where P(A|B) is the probability of event A to occur given the condition that event B has already occurred.
The following procedures are typically used when evaluating hypotheses about a sample:
We might also come across data that aren't normally distributed. There is a need for alternative techniques since they are more suited to examining the discrepancies between expected and observed frequencies. A statistical test that is non-parametric is one in which no assumptions are made on particular parameter values. One of the easiest and most used non-parametric tests in statistical research is the Chi-square test. The chi-square distribution is a continuous probability distribution that is distributed normally. The chi-square distribution moves closer to a normal distribution as ‘n’, the number of degrees of freedom, approaches towards infinity. The goodness of fit test, which compares observed frequencies and hypothetical frequencies of particular classes, is one of many methods for assessing hypotheses that uses the chi-square distribution. Additionally, it is used to assess the independence of two variables and to compare the observed variance and the hypothetical variance of samples with normally distributed data.
Clustering is the division of a data collection into subsets or clusters so that, according to an established distance metric, the degree of association is strong between members of the same cluster and weak between members of other clusters. There are several ways to carry out cluster analysis like partitional clustering, hierarchical clustering, etc. We must establish a distance between the objects that need to be categorised in order to do cluster analysis on a set of ‘n’ objects. It is expected that the collection of objects contains some sort of organisation. The distance between two classes in the single link technique is determined by the Euclidean distance between the two closest items in the distance table. In the complete linkage method, the distance between two classes is given by the Euclidean distance between the two elements furthest away.
The analysis of variance, also known as ANOVA, is an effective statistical method for significance testing. Only the significance of the difference between two sample means may be tested using the t-distribution-based test of significance. The hypothesis that all the samples are taken from the same population, i.e., they have the same mean, needs to be tested using a different approach when we have three or more samples to take into account at once. Therefore, the basic goal of the analysis of variance is to examine the homogeneity of three or more means.
Time-series data is data that is collected at different points in time. Autocorrelation, Seasonality, Stationarity are the three main components of a time series.
Autocorrelation refers to the similarity between observations as a function of the time lag between them. We can find the value of a point by finding the period in the plot.
Seasonality refers to periodic fluctuations. Period can give the length of the season. For instance, the amount of electricity consumed varies greatly from summer to winter, and online sales peak around Diwali before dipping again.
Stationary means that statistical properties do not change over time, that is, constant mean and variance, and covariance is independent of time. For example, stock prices are not a stationary process. For modelling, we would prefer to have a stationary time series. However, there are other transformations we can apply to make them stationary.
The Three Sigma Rule, sometimes known as the empirical rule, states that for a Normal Distribution, 68% of the data will be within one standard error of the mean. The data will be within two standard deviations of the mean in 95% of the cases. The probability that the data will be within three standard deviations of the mean is approximately 99.7%. For example, an RMSE equal to 50 means that about 68% of the system’s predictions fall within 50 of the actual value, about 95% of the predictions fall within 100 of the actual value, and about 99.7% of the predictions fall within 150 of the actual value.
The statistical field of survival analysis examines how long it is likely to be before an event occurs. Survival analysis is also known as time-to-event analysis. In a survival study, the length of time it takes for an event to occur is a key factor. The event we are usually interested in is death or failure. For instance, determining when a person will pass away following a diagnosis of a sickness or the failure of an appliance.
The survival function is estimated by Kaplan Meier curves. The survival function is graphically represented by the Kaplan-Meier curve. It displays the likelihood that a subject will live until time t. Plotting the survival function against time leads to the formation of the curve.
Operations research is an area of applied mathematics that makes use of scientific techniques to offer a foundation for making decisions. In order to discover the optimum approach to accomplish a task, it is frequently applied to complicated issues involving the organisation of personnel and equipment. Simulation, optimization, linear programming, nonlinear mathematical programming, game theory, and other techniques are all included in operations research approaches.
When two variables are correlated while accounting for a third variable or more, it is said to be partial correlation. A third variable Z acts as a measure of the direct relationship between variables X and Y in a measure of partial correlation between them, but it does not take into consideration the effects of their linear relationships with Z.