Question 1

How would you explain statistics as a field of study?

Accepted Answer

The study of statistics focuses on gathering, organising, analysing, interpreting, and presenting data.

Question 2

How are statisticians different from data analysts?

Accepted Answer

A data analyst works in a particular business vertical while a statistician is responsible to work with data irrespective of the industry vertical.

Question 3

What role do you expect to perform as a statistician?

Accepted Answer

To perform exploratory data analysis to understand the data, relationship and its distribution as well as provide predictions based on these relationships.

Question 4

Where can we include statistics in machine learning?

Accepted Answer

Statistics forms the base of machine learning. The predictive modelling in machine learning takes into account the concepts from inferential statistics.

Question 5

Can we understand data science without knowing statistics?

Accepted Answer

No, it is highly advised to know the basics of statistics before jumping to data science. Statistics is the key to understanding data and therefore, one must be aware of descriptive statistics to learn data science.

Question 6

What is the difference between inferential and descriptive statistics?

Accepted Answer

The study of data analysis, visualisation, presentation, and interpretation is a key component of statistics. Descriptive statistics and inferential statistics are the two subcategories of statistics. In descriptive statistics, we primarily use numerical metrics, graphs, plots, tables, etc. to organise and summarise data. For instance, a bar graph with specific numbers can be used to summarise the sales results for the financial year. Inferential statistics uses sample data to estimate or infer characteristics of the population. For instance, using past data analysis to forecast the anticipated sales statistics for the upcoming few quarters

Question 7

How is analysis different from analytics?

Accepted Answer

Analysis is the process of looking at smaller datasets from a larger dataset that has previously been collected in the past. To understand how and/or why something happened, we conduct analysis. Analytics, on the other hand, typically refers to the future rather than interpreting the past. Analytics is basically the application of logical and computational reasoning to the individual pieces received from an analysis in order to seek for patterns and investigate what we may do with them in the future.

Question 8

Explain the difference between information and data?

Accepted Answer

Data, in any form, is a fact that must be processed in order to give it meaning because it is raw, unprocessed, and disorganised. Data can take the form of a number, picture, word, graph, etc. Information is a processed data formed by manipulation of raw data which includes data that has context, meaning, and purpose. Data can be thought of as the raw material for producing information. For instance, a data may be the volume of sales occurring at a store, and the information deduced from this data could be the average sales.

Question 9

What is the difference between a sample and a population?

Accepted Answer

Any statistical analysis you conduct begins by determining whether you are working with a population or a sample of data. A population, which is typically represented by an uppercase "N," is the totality of the things relevant to our study. A sample, represented by a lowercase 'n,' is a subset of the population. Let's think about a voting example. Who wins the nomination is determined by the final vote results. The population is the basis for these findings. The people that show up to vote make up the population in this case. However, there is a survey conducted by numerous different parties to predict the winner of the competition before the results are released. This survey is conducted on a small sample size—say, let's 20% of the population. This portion of the population is referred to as a sample. The field of statistics deals with sample data.

Question 10

Explain the different types of variables and how are they broadly classified.

Accepted Answer

Different types of variables require different types of statistical and visualization approaches. Based on the type of variable, it is further divided into numerical and categorical data.

Numerical data represents number or figures, for example, sales amount, height of students, salary, etc. A numerical variable is further divided into two subsets, discrete and continuous. The number of students in a class or the results of a test are two examples of discrete data that can typically be counted in a finite way. Continuous variable data cannot be counted since it is infinite. For instance, continuous variables such as a person's weight, a region's area, etc. might vary by small quantities

Categorical data represents categories, including things like gender, email type, colour, and more. Categorical data is further divided into nominal and ordinal variables. Ordinal categorical variables can be displayed in a certain order, such as when a product is rated as either awful, satisfactory, good, or excellent. Nominal variables can never be arranged in a hierarchy. For instance, a person's gender.

Question 11

Explain different forms of a bar chart.

Accepted Answer

A bar chart uses rectangular vertical and horizontal bars to statistically represent the given data. Each bar's length is proportionate to the value it corresponds to. The values among various categories are compared using bar charts. With the use of two axes, bar charts illustrate the relationship. It depicts the discrete values on one axis while the categories are represented on another. There are a number of different bar charts available for visualizing the data but the 4 major categories in which we can distinguish them is vertical, horizontal, stacked, and grouped bar chart.

Vertical bar chart

The most popular type of bar chart is the vertical bar chart. A vertical bar chart is one in which the given data is displayed on the graph using vertical bars. The measure of the data is represented by these vertical rectangular bars. On the x- and y-axes, vertical lines are drawn to represent the rectangular bars. The number of the variables listed on the x-axis is represented by these rectangle-shaped bars.

Horizontal bar chart

Charts that show the given data as horizontal bars are referred to as horizontal bar charts. The measures of the provided data are displayed in these horizontal, rectangular bars. In this style, the x-axis and y-axis are labelled with the data categories. The bar chart's horizontal representation is displayed in the y-axis category.

Stacked bar chart

Each sub-bar that makes up a normal bar chart represents a level of the second categorical variable, and they are all stacked on top of one another. A 100% stacked bar chart represents the given data as the percentage of data that contributes to a total volume in a distinct category, in contrast to a stacked bar chart that directly depicts the given data.

Grouped bar chart

A grouped bar chart makes it easier to compare data from multiple categories. For levels of a single categorical variable, bars are grouped by position, with colour often designating the secondary category level within each group.

Question 12

What is a scatter plot and how is it useful?

Accepted Answer

Scatter plot is a very important graph when it comes to understanding the relationship between two numerical variables. For example, consider the following table which provides the percentage marks scored and total attendance of ten students of a class.

Student	Attendance	Percentage
Student 1	78	84
Student 2	91	96
Student 3	66	70
Student 4	42	85
Student 5	90	92
Student 6	59	62
Student 7	83	75
Student 8	72	75
Student 9	94	96
Student 10	88	67

The percentage of students in attendance is represented on the x-axis, while the percentage of marks scored is represented on the y-axis. The scatter plot could therefore help us comprehend the relationship between the two variables. We may argue that when students attend class more frequently, they tend to perform better academically. We can also spot instances that are the exception rather than the rule, like Student 4.

Question 13

What is a frequency distribution table? What is the purpose of it?

Accepted Answer

Frequency distribution is a series when a number of observations with similar or closely related values are put in separate bunches or groups, each group being in order of magnitude in a series. The data are simply organised into classes in a table, and the number of cases that fall into each class is noted. It displays the frequency with which various values of a single phenomenon occur. In order to estimate frequencies of the unknown population distribution from the distribution of sample data, a frequency distribution is created.

Take a survey of 50 households in a society as an example. The number of children in each family was recorded, and the results are shown in the following frequency distribution table

No. of children	Frequency
0	12
1	24
2	13
3	0
4	1

As a result, frequency in the table refers to how frequently an observation occurs. The number of observations is always equal to the sum of the frequencies. We can evaluate the data's underlying distribution and base judgements on it with the aid of frequency distribution.

Question 14

Explain the measures of central tendency.

Accepted Answer

The three measures of central tendency are mean, median, and mode.

Mean, also known as simple average, denoted by the Greek letter µ for a population and for a population. By adding up each observation of a dataset and then dividing the result by the total number of observations, we may determine the dataset's mean. This is the most common measure of central tendency.

The median of an ordered set of data is its middle number. As a result, it divides the data into two halves: the higher and lower halves. The median of the first nine natural numbers, for instance, is five.

Mode is the value that occurs most often. Although it can be applied to both numerical and categorical data, categorical data are typically preferred. For instance, if 60% of the observations for a gender variable are male, then male will be the mode value, signifying the value of maximum occurrence.

Question 15

When do you use mean and median?

Accepted Answer

The dataset's midpoint can be estimated using both the mean and the median. Depending on the type of data, the mean or the median may be a better choice for describing the dataset's midpoint. When the data is equally distributed (symmetrical) and follows a distribution that is close to normal, the mean is typically used. It is preferable to utilise the median to identify the central value if the data is skewed, which indicates the presence of outliers in the dataset. Let's take 10 data scientists as an example, whose salaries are (in LPA) 12, 14, 9, 10.5, 17, 11, 8, 14, and 65. While the median pay is 13, the mean salary is 18.5. The extreme figure of 65 LPA, which can be viewed as an anomaly because the hired individual may be a member of a prestigious university or located on-site, had an impact on the mean income. The median, however, is unaffected. We can infer from the dataset that the median in this instance represents the centre value more accurately than the mean.

Question 16

What is a categorical variable's mode when there are several values present in the majority of instances?

Accepted Answer

The most frequent value in a data set is referred to as the mode. A set of data may have one mode, multiple modes, or none at all. Multimodal refers to a set of numbers with more than one mode. Bimodal data is defined as having two modes, which means that two values equal the dataset's maximum occurrence. Similar to this, a group of numbers with three modes is referred to as trimodal. Datasets without repeated values, on the other hand, would indicate that there is no mode in the data.

Question 17

Explain the steps to find the median of the data. Also, explain what do you mean by quantiles?

Accepted Answer

We initially arrange the collection of numbers in ascending order before calculating the median value of the data. The observation is then located in the middle of this sorted list. The element present at location (n+1)/2, where 'n' is the total number of observations, will be the mode for an odd number of total observations. The median value, however, will be the simple average of the middle two elements located at positions n/2 and (n+1)/2 if the total number of observations is even.

The quantiles are values used to segment the distribution so that a specific percentage of data fall below each quantile. A quantile is the median, for instance. The median can also be referred to as the 50th quantile which is the point where half the points are more than or equal to it and half are less than or equal to it in the distribution. Similarly, we can have 25th and 75th quantile which will represent the 25% and 75% of the observations on one side respectively. If we consider a data set of the first hundred natural numbers then the 25th, 50th, and 75th quantiles will be 25, 50, and 75 respectively. If the number of quantiles is four, then it is referred to as quartiles.

Question 18

What do you mean by univariate, bivariate and multivariate analysis?

Accepted Answer

The examination of one, two, and more than two variables is referred to as univariate, bivariate, and multivariate analysis respectively. Since there is only one variable involved, univariate analysis can be referred to as a frequency distribution table, the computation of minimum, maximum, average value, etc. Univariate analysis, for instance, includes salary analysis of employees inside an organisation. In many situations simultaneous study of two variables become necessary. For instance, we wish to categorise information on a group of people's income and spending patterns, their attendance and grades, etc. Examples of bivariate analysis include scatter plots and bivariate frequency distribution charts. Multivariate analysis is used when there are more than two variables being observed at once.

Question 19

What do you mean by measure of dispersion? How to measure it for a single variable?

Accepted Answer

The central tendency measure helps identify the distribution's centre, but it does not show how the items are distributed on either side of the centre. Dispersion is the term used to describe this property of a frequency distribution. The items in a series are not all equal. The values vary or differ from one another. Different measurements of dispersion are used to assess the level of variance. Large dispersion suggests less uniformity, while small dispersion indicates good homogeneity of the observations.

The most significant measures of dispersion for a single variable are the standard deviation and coefficient of variation, which are frequently employed in statistical formulas.

Question 20

What do you mean by distribution? Mention different types of distribution.

Accepted Answer

In statistics, a distribution is a function that displays the range of potential values for a variable along with their frequency. The probability for each individual observation in the sample space can be determined using a parameterized mathematical function. We utilise a statistical distribution to assess the likelihood of a specific value. The most common distributions are –

Binomial distribution – It is a discrete distribution expressing the probability of a set of dichotomous alternatives i.e., success or failure repeated for a finite number of times.
Poisson distribution – It is a limiting case of Binomial distribution where the number of trials is very large and probability of success is very small.
Gaussian distribution – It is the most important continuous distribution, also known as the normal distribution which follows a symmetrical bell-shaped curve.
Uniform distribution – All the number of possible outcomes of a uniform distribution are equally likely. For example, when you roll a fair die, the outcomes are equally likely.
Exponential distribution – It follows the exponential functions and is widely used for survival analysis from the expected life of a machine to the expected life of a human.

Question 21

What is point estimator and confidence interval? Which one is preferred out of these two?

Accepted Answer

There are two types of estimators: point estimates and confidence interval estimates. While confidence interval estimates give a range, point estimates simply represent a number that indicates where you expect your population parameter to be. Since point estimates might be unpredictable, confidence intervals are a far more accurate term to describe reality. The confidence interval's centre is exactly where the point estimates are located. As an illustration, stating that I spend 350 rupees per day on transportation uses point estimates, but stating that I spend between 300 and 350 rupees per day on transportation uses confidence interval estimates.

The estimators with the lowest bias and highest efficiency are the most accurate. Without surveying the full population, you can never be entirely confident. We want to be as precise as possible. Most of the time, a confidence interval will produce reliable results. A point estimate, however, will nearly always be inaccurate but is easier to comprehend and convey.

Question 22

What do you mean by margin of error?

Accepted Answer

The range that you anticipate the population parameter to fall inside is known as a confidence interval. The margin of error is what we will add or subtract from our guess to create our confidence interval. For example, according to a poll, a particular candidate will likely win an election with 51% of the vote. The inaccuracy is 4%, and the degree of confidence is 95%. Let's assume that the survey was conducted again using the same methods. The pollsters would anticipate that 95% of the time, the results would be within 4% of the declared outcome. In other words, they would anticipate the outcomes to fall between 47% (51-4) and 55% (51+4). Margin of error can be calculated using either the standard deviation or the standard error.

Question 23

What is regression? Give examples.

Accepted Answer

We might be interested in predicting the value of one variable given the value of other variables after we understand the link between two or more variables. The term "target" or "dependent" or "explained" refers to the variable that is predicted based on other variables, and "independent" or "predicting" refers to the other variables that aid in estimating the target variable. The prediction is based on an average association that regression analysis has statistically determined. The formula, whether linear or not, is known as the regression equation or the explanatory equation. Real numbers are used as the output or target values for regression operations.

Think about estimating the cost of a house, for instance. In this scenario, the house price serves as your target variable. Some potential independent variables that may aid in estimating this price are the area, the year the house was built, the number of bedrooms and bathrooms, the neighbourhood, etc. Other instances of regression include predicting retail sales based on the season or agricultural output based on rainfall.

Question 24

What are the types of regression?

Accepted Answer

Regression analysis operates under three different categories:

Simple and Multiple – In case of simple relationship only two variables are considered, for example, the influence of advertising expenditure on sales turnover. In the case of multiple relationship, more than two variables are involved. On this while one variable is a dependent variable the remaining variables are independent ones. For example, the turnover may depend on advertising expenditure and the income of the people.
Linear and Non-linear – The equation of the straight-line trend, on which the linear relationships are based, has no power higher than one. Thus, they result in a straight line. Curved trend lines are created when there is a non-linear relationship. These equations have parabolic forms.
Total and Partial – All relevant factors are taken into account while analysing total relationships. They typically are made up of multiple associations. One or more factors are taken into account in the case of a partial relationship, but not all of them, hence removing the influence of those not thought to be pertinent for a specified task.

Question 25

What is an outlier?

Accepted Answer

As statisticians, we are more interested in patterns and trends than in single points. Outliers are specific points that do not fit the pattern or trend that was discovered. These points may exist as a result of different measurement thresholds, extraordinary circumstances, or even experiment logging errors. Think about the collection of information about the height of pupils in a given grade. An outlier can be a data point that represents a measurement that was taken in a different unit or a pupil who is noticeably taller or shorter than their peers.

Question 26

What is the meaning of KPI? Why is it important?

Accepted Answer

Metrics are intended to evaluate business performance. For example, average sales per customer is a metric which is a useful measure having business meaning. Comparative analysis makes great use of metrics. Key performance indicators, or KPIs, are a group of metrics that are in line with a certain business goal. The key reflects our primary business objective, and performance indicators show how well we have accomplished over the course of a given period of time. For instance, KPIs will identify the traffic generated just from users who have clicked on a link provided in our ad campaign, while metrics will describe the traffic of the page from our website that was visited by any sort of users.

Question 27

What is sampling bias? How to avoid it?

Accepted Answer

A sample is referred to as the subset of a population. These samples are drawn from a population and need to be a good representative of the actual population. For example, consider we are collecting feedback of a university from a group of students to prepare a sample. Now, we notice that there are students present in the cafeteria or the library from whom we can gather the feedback. But this feedback might possibly come with a bias. The feedback should also comprise students who are attending the lectures, or even bunking to get the actual representation of the population. When the sample contains data points from a specific.

Question 28

What is the Pareto principle?

Accepted Answer

According to the Pareto principle, 20% of causes account for about 80% of the consequences for most outcomes. According to this theory, there is an unbalanced link between inputs and outputs. The Pareto Principle states that most things in life are not distributed evenly, with some contributing more than others. This is an observation, and not a rule. For example, we can state that the maximum revenue of an organisation comes from a handful of its oversees clients.

Question 29

What is the complement of an event?

Accepted Answer

The events "A occurs" and "A does not occur" refer to events that are complementary to one another. Both the event and its complements are mutually exclusive. For instance, when rolling a dice, getting odd numbers is represented by 1, 3, 5, and getting even numbers by 2, 4, 6. These two things don't occur together and are complementary to one another.

Question 30

In a simultaneous throw of a pair of dice, find the probability of getting a total more than 7.

Accepted Answer

There will be a 5/12 chance. When we throw two dice, there are a total of 36 potential outcomes. There are 15 scenarios out of these 36 possible outcomes where the sum is more than 7. The result is 15/36 or 5/12 when the number of favourable outcomes is divided by the total outcomes.

Question 31

What is normal distribution? Why do we need it?

Accepted Answer

Normal distribution is a symmetrical bell-shaped curve representing frequencies of different classes from the data. Some of the characteristics of normal distribution include:

The mean, median and mode of the distribution coincide.
The curve of the distribution is bell-shaped and symmetrical about the line x = mean value. This means that exactly half of the values are to the left of the centre and the other half to the right.
The total area under the curve is 1.
It is a limiting form of binomial distribution where the number of trials in indefinitely large (infinity) and the probability of success and failure is not indefinitely small.

Normal distribution is one of the most significant probability distributions in the study of statistics. This is so because a number of natural events fit the normal distribution. For instance, the normal distribution is observed for heights and weights of an age group, test scores, blood pressure, rolling a die or tossing a coin, and income of individuals. The normal distribution provides a good approximation when the sample size is large.

Question 32

Explain the impact of mean and standard deviation on the normal distribution.

Accepted Answer

The distribution moves to either side of the horizontal axis if we adjust the mean while maintaining the same standard deviation. The graph is shifted to the right by a higher mean value and to the left by a lower mean value.

The graph reshapes when the standard deviation changes while the mean remains constant. When the standard deviation is lower, more data are seen in the centre and have thinner tails. The graph will flatten out with more points at the ends or better tails and fewer points in the middle as a result of a larger standard deviation.

Question 33

Explain outliers and their impact on your data.

Accepted Answer

Outlier is an observation which is well separated from the rest of the data. The interpretation of an outlier takes into account the purported underlying distribution. Outliers can be dealt with primarily in two ways: first, by adapting techniques that can handle the existence of outliers in the sample, and second, by attempting to remove the outliers. We know that outliers have a significant impact on our estimation. Instead of following the sample or population, these observations have an impact on the predictions. The removal of an outlier from our sample is frequently not the best option, therefore we either employ techniques to mitigate their negative effects or use estimators that are insensitive to outliers.

Question 34

What is skewness? Explain the different types of skewness.

Accepted Answer

Skewness is a measure of asymmetry that indicates whether the data is concentrated on one side. It allows us to get a complete understanding of the distribution of data. Based on the type, skewness is classified into three different types.

Positive skewness or right skew

Outliers at the top end of the range of values cause positive skewness. Extremely high numbers will cause the graph to skew to the right, showing that there are outliers present. The higher numbers slightly raise the mean above the median in this instance, meaning that the mean is higher than the median.

No skewness or zero skew

This is a classic instance of skewness not being present. It denotes a uniformly distributed distribution around the mean. As a result, it appears that the three values, mean, median, and mode, all coincide.

Negative skewness or left skew

Outliers near the lower end of the values cause negative skewness. Extremely low numbers will cause the graph to skew to the left, indicating that there are outliers present. In this instance, the mean is significantly smaller than the median because the lower values cause the mean to fall from the central value

Question 35

Define the different central moments.

Accepted Answer

In probability theory and statistics, a central moment is a moment of a probability distribution of a random variable about the random variable's mean.

The zeroth central moment is the total probability i.e., equal to one.
The first central moment is the expected value or mean and equal to zero.
The second central moment is the variance.
The third central moment is skewness.
The fourth central moment is kurtosis

Question 36

Mention the different visualization graphs that you will use to understand numerical variables.

Accepted Answer

For univariate analysis of a numerical variable, the must use visualizations are histograms and box and whisker plot (or box plot). Scatter plots are used to perform multivariate analysis of numerical variables.

Histograms

A histogram is a graphic representation of the distribution of data that has been grouped into classes. It is a type of frequency chart that is made up of a number of rectangles. Each piece of data is sorted, then each value is assigned to the proper class interval. The frequency of each class interval is determined by the number of data values that fall within it. A specific class of data is represented by each rectangle in the histogram. The width of the rectangle represents the width of the class. It is commonly used to determine

Box and whisker plot (box plot)

A box plot shows the maximum and minimum values, the first and third quartiles, and the median value, which is a measure of central tendency. In addition to these quantities, it also explains the symmetry and variability of the data distribution. Outliers in the dataset are frequently visualised using this visualization.

Scatter plot

The scatterplot is a very helpful and effective tool that is frequently used in regression analysis. A pair of observed values for the dependent and independent variables are represented by each point. Before selecting a suitable model, it enables graphically determining whether a relationship between two variables exists. These scatterplots are also very helpful for residual analysis because they let you check whether the model is a good fit or not.

Question 37

How would you measure the relationship between variables and the strength of the relationship.

Accepted Answer

Covariance and correlation coefficient reveals the relationship and the strength of relationship between the two variables.

Covariance is a measure of how two random variables in a data set will change jointly. When two variables are positively correlated and moving in the same direction, this is referred to as positive covariance. A negative covariance denotes an inverse relationship between the variables or a movement in the opposite directions. For instance, a student's performance on a particular examination improves with increased attendance, which is a positive correlation, whereas a decrease in demand caused by a rise in the price of an item is a negative correlation. When the covariance value is zero, the variables are said to be independent of one another and have no influence on one another. If the covariance value is higher than 0, it means that the variables are positively correlated and move in the same direction. The variables are negatively correlated and move in the opposite direction when the correlations have a negative value.

Covariance Value	Effect on Variables
Cov (X, Y) > 0	Positive Correlation (X & Y variables move together)
Cov (X, Y) = 0	No Correlation (X & Y are independent)
Cov (X, Y) < 0	Negative Correlation (X & Y variables move in opposite direction)

Similar information is given by the correlation coefficient and the covariance. The fact that the correlation coefficient will always retain a value between negative one and one is its benefit over covariance. A perfect positive correlation exists between the variables under study when the correlation coefficient is 1. In other words, as one moves, the other follows suit proportionally in the opposite direction. A less than perfect positive correlation is present if the correlation coefficient is less than one but still larger than zero. The correlation between the two variables is stronger as the correlation coefficient approaches one. There is no observable relationship between the variables when the correlation coefficient is zero. That means it is difficult to predict the movement of the other variable if one variable moves. The variables are perfectly negatively or inversely connected if the correlation coefficient is zero, or negative one. One variable will drop proportionally in response to an increase in the other. The variables will oscillate in opposing directions. If the correlation coefficient is more than negative one, it means that the negative correlation is not perfect. The correlation increases as it gets closer to being negative one.

Covariance Value	Effect on Variables

Question 38

Explain the purpose of central limit theorem.

Accepted Answer

Think about a sample taken from a population as a whole with a mean value. It's possible that we'll obtain an entirely different mean if we take another sample from the same population. Let's say you gathered ten distinct samples. You'll observe that the sample mean is influenced by the members belonging to their own sample. Hence, using just one value is not the best course of action.

A fresh dataset of sample means is produced by the new samples that were collected. There is a certain distribution of these values. The phrase "sampling distribution" is used to describe a distribution made out of samples. We are dealing with a sampling distribution of the mean in this instance. These values are distinct when we look at them closely, but they are centred on one particular value.

Every sample mean in this analysis approximates the population mean. The value they centre on may provide a very accurate indication of the population mean. In fact, we anticipate getting a pretty accurate approximation of the population mean if we take the average of those sample means. We see a normal distribution when we visualise the distribution of the sampling means and the Central Limit confirms that. The sampling distribution of the mean will resemble a normal distribution regardless of the underlying population distribution, whether it be binomial, exponential, or another type.

As a result, even when the population is not normally distributed, we can still conduct tests, work through issues, and draw conclusions using the normal distribution according to the central limit theorem.

Question 39

How do you measure the performance of a classification model?

Accepted Answer

There are various performance measures or metrics that can help to evaluate the performance of a classification model. However, it depends on the kind of problem we are dealing it. At times, accuracy might not be a good idea for evaluation and we need to focus on certain aspects of the results rather than the accuracy as a whole. The most common metrics used for the purpose are –

Confusion matrix

A confusion matrix is one of the evaluation techniques for machine learning models in which you compare the results of all the predicted and actual values. Confusion matrix helps us to derive several different metrics for evaluation purpose such as accuracy, precision, recall, and F1 score which are widely used across different classification use cases.

ROC AUC curve

The probability curve, the Receiver Operator Characteristic (ROC) separates the signal from the noise by plotting the True Positive Rate (TPR) versus the False Positive Positive Rate (FPR) at different threshold values. A classifier's capacity to distinguish between classes is measured by the Area Under the Curve (AUC). The performance of the model at various thresholds between positive and negative classes is improved by a higher AUC. The classifier can correctly discriminate between all Positive and Negative class points when AUC is equal to 1. The classifier would be predicting all negatives as positives and vice versa when AUC is equal to 0.

Jaccard index

Jaccard Index or also known as Jaccard similarity coefficient. If y is the actual label and ŷ is the predicted value then we can define Jaccard index as the size of the intersection by the size of the union of two labelled sets.

Consider if you have a total of 50 observations, out of which your model predicts 41 of them correctly, then the Jaccard index is given as 41 / (50 + 50 - 41) = 0.69. The Jaccard index of 0.69 defines that the model predicts on the test set with an accuracy of 69%. So, a Jaccard index ranges from 0 to 1 where an index value of 1 implies maximum accuracy.

Log loss

Log loss or logarithmic loss measures the performance of a classifier where predicted output is a probability value between 0 and 1. We can calculate the log loss using the log loss equation which measures how far each prediction is from the actual label. It is obvious that most ideal classifiers have a lower value of log loss. So the classifier with lower log loss has better accuracy.

Question 40

What is a confusion matrix and how do you interpret it? Explain type 1 and type 2 error.

Accepted Answer

Confusion matrix is one of the evaluation methods for machine learning models that compares the outcomes of all the expected and actual values.

The figure representing confusion matrix has four different cases:

There are five instances where the predicted value and the actual value are both true. This is referred to as a True Positive case, where True denotes that the values are identical (true and true) and Positive denotes that the situation is true. Example: A diabetes test is positive for a diabetic patient.
There are four instances where both the predicted value and the actual value are false. This is referred to as a True Negative situation, where True denotes identical numbers (false and false) and Negative denotes a negative outcome. Example: A diabetes test is negative for a non-diabetic patient.
In three instances, the projected value is true, but the actual value is false. False denotes that the values are different (false and true), while Positive means that the predicted value is positive. This is referred to as a False Positive event. Example: A diabetes test is positive for a non-diabetic patient.
There are two situations where the projected value is false, whereas the actual value is true. This situation is known as a False Negative Case, where False denotes that the values (true and false) are different, and Negative denotes that the predicted value is negative. Example: A diabetes test is negative for a diabetic patient.

In this matrix, the values in green are correctly identified by the model and the values in red are wrongly identified by the model. Confusion matrix can also be used for non-binary target variables.

The occurrence of Type 1 Error, also known as a False Positive event, occurs when the expected value is positive but it is actually negative. When the actual value is positive when the predicted value is negative, this is known as a False Negative event and results in Type 2 Error. For instance, if we consider rain to be a positive event, then your device's prediction that it would rain today but it didn't actually happen is a type 1 error, while your device's prediction that it wouldn't rain today but it actually did happen is a type 2 error.

Question 41

How to do measure the performance of a regression model?

Accepted Answer

The performance of a regression model is evaluated based on how close or far the predictions are to the actual value. Primarily, there are three metrics widely used to evaluate the performance of regression tasks, namely, Mean Absolute Percentage Error (MAPE), Mean Squared Error (MSE), and Root Mean Squared Error (RMSE). All three of these metrics evaluate the distance between the predictions and the actual values. MAPE is generally used in tasks where there is presence of outliers, like the timeseries data. MSE and RMSE is used when the outliers are not present or relatively less. RMSE is generally preferred over MSE as it provides the metric value in the same unit as the target variable which makes it easy for comparison.

Question 42

What is R2 score and adjusted R2 score? Which one is preferred?

Accepted Answer

The R2 score, often known as the R-squared value, indicates how much of the overall variability can be explained by the regression. It is a relative scale with values ranging from 0 to 1. An R2 score of 0 indicates that our regression model does not explain any of the data variability. An R2 value of 1 indicates that our regression model completely accounts for the data's variability, which is uncommon or practically impossible. The degree of topic complexity and the number of variables employed both affect the R2 score. However, the higher the score, the better it is. However, R2 score has a drawback because it is a monotonically growing function. This implies that the value of the R2 score will increase each time we include a new variable in the regression model, giving the impression that the more variables we include, the better our model will be. This isn't always the case because the additional variable might not have much of an impact on the model. To take care of this, an adjusted R2 score is preferred which penalises the model for using an insignificant variable. This will ensure that the score is higher only if we have used significant variables and have avoided the insignificant ones.

Question 43

What is F1 score?

Accepted Answer

Also known as the F1-Score or the F-Score, the F-Measure is a numerical value. It evaluates how accurate a test is. In a perfect scenario, both the precision and recall values would be high. However, there is always a trade-off between recall and precision, and unfortunately, we must prioritise one over the other. The two components of the F1 score are precision and recall. The F1 score aims to combine the precision and recall measures into a single metric. This F-score is what we use to compare two models. In terms of the formula, F1 score is the harmonic mean of precision and recall and given by –

The value of F1 score ranges between 0 and 1. An F1 score of 1 is regarded as ideal, whereas a score of 0 indicates that the model is a complete failure.

Question 44

Explain SST, SSE and SSR.

Accepted Answer

The sum of the square differences between the observed dependent variable and its mean is known as the sum of square total (SST). It is a measurement of the dataset's overall variability.

The sum of the squares between the predicted value and the dependent variable's mean is known as the sum of squares due to regression (SSR). It explains how well the data fit our regression line. If this value is the same as the SST, our regression model perfectly captures the observed variability.

The difference between the actual value and the predicted value is known as the Sum of Squared Error (SSE). Usually, we wish to reduce the error. The regression's estimating power increases with decreasing error.

The overall variability of the data set provided by SST is equal to the sum of the variability described by the regression line, or SSR, and the unexplained variability, or SSE.

Question 45

What is MAPE, MSE, RMSE?

Accepted Answer

A regression model can be assessed using MAPE, MSE, and RMSE metrics. In statistics, the Mean Absolute Percentage Error, or MAPE, is a measure of forecasting method accuracy. The accuracy is typically expressed as a ratio. If there are several outliers present, it is usually preferred and known as the average absolute deviation. Two vectors—the vector of predictions and the vector of target values—can be measured using the Mean Absolute Error (MAE) and the Root Mean Squared Error (RMSE), respectively. The l2 and l1 norms, respectively, relate to the calculation of the RMSE and MAE. The RMSE works exceptionally well and is typically preferred when the outliers are exceedingly rare, as in a bell-shaped curve.

Question 46

Mention some of the common distance measurements.

Accepted Answer

Distance is a measurement of how far apart two objects are. Therefore, a distance is equivalent to a real number. If both objects are the same, it is zero; otherwise, it is positive. The distance units that are most frequently used are the Euclidean, Manhattan, Minkowski, and Hamming. The Manhattan distance and the Euclidean distance can both be thought of as generalisations of the Minkowski distance. The Minkowski distance for p = 1 is also known as the Manhattan distance, the L1 norm, or the absolute distance. The Euclidean distance is referred to as the distance when p = 2.

Question 47

How do you check for outliers in a dataset?

Accepted Answer

One common way to identify outliers is through the internal limits or whiskers derived from the interquartile range. The interquartile range represents the 50% of the observations. The median is located at the centre of this range. When there are minimum or maximum values at the extreme sides then we can define the cut-off value to determine the outliers using the formula –

Lower limit = First quartile – (1.5 x interquartile range)
Upper limit = Third quartile – (1.5 x interquartile range)

These two data values are known as adjacent points. If we find observations outside of the interval between the lower limit and the upper limit, then it can be termed as outliers in the dataset.

Question 48

Explain precision and recall.

Accepted Answer

Precision is the ratio of the correctly identified positive classes to the sum of the predicted positive classes. The predicted positive classes are the ones which are predicted positive irrespective of the actual value being positive or negative, that is, true positive and false positive classes. This ratio provides information that out of all the positive classes we have predicted correctly, how many are actually positive.

$P r e c i s i o n = \frac{T P}{T P + F P}$

Recall is the ratio of the correctly identified positive classes to the sum of the actual positive classes. The actual positive classes can be predicted as positive or negative that is, true positive and false negative. This ratio provides the information that out of all the positive classes, how much we predicted correctly.

$R e c a l l = \frac{T P}{T P + F N}$

Question 49

Explain the OLS assumptions for a linear regression.

Accepted Answer

The OLS assumptions for a linear regression are divided into five assumptions:

Linearity

The regression assumes that the data is linear in nature. For higher degrees of variables, linear regression will not produce good predictions.

No endogeneity

The issue of endogeneity arises when we have a variable that is related to the target and also the predictors but not included in the model. Therefore, endogeneity is a situation in which a predictor in a linear regression model is correlated to the error term. We call such predictors as endogenous variables.

Normality and homoscedasticity

This assumes that the error term is normally distributed, and the expected value of error is 0, meaning that we expect to have no error on average. Homoscedasticity assumes that the variance is constant for the error term.

No autocorrelation

This assumes that the covariance between two error terms is not zero.

No multicollinearity

When two or more variables in our regression are strongly correlated, this situation is referred to as multicollinearity. The OLS assumptions assume that there are no strongly correlated variables in our analysis.

Question 50

How do you convert a normal distribution to standard normal distribution?

Accepted Answer

One can standardise any distribution. The process of standardisation involves transforming the variables to one with a mean of zero and a standard deviation of one.

Standardization is also possible for normal distributions. The result is known as a standard normal distribution. A standard normal distribution is represented by the letter Z. The Z-score is referred to as the standardised variable. The formula for standardising variables is defined by the Z-score. We first determine a variable's mean and standard deviation. The mean is then subtracted from each observed value of the variable, and then divide by the standard deviation.

Question 51

What do you mean by exploratory data analysis?

Accepted Answer

Exploratory data analysis is a method of data analysis in which the properties and qualities of the data are examined without making any attempt to fit the data to a specific model. This method's guiding principle is to analyse the data before using a particular model. We use numerical and graphical processes in exploratory data analysis to better comprehend the data. The emphasis on visual data representation has been a key component in the development of exploratory data analysis.

Question 52

Mention some of the methods to convert a skewed data to approximate normal distribution.

Accepted Answer

Skewed data refers to the data that contains outliers. Outliers are known to have a negative influence on the model’s predictions and thus needs to be eliminated. However, it is not always advised to remove the outliers, so we need to handle them through certain transformations. The common transformations applied on the data are log transformation, square root transformation, and box-cox transformation.

Log transformation

The logarithmic transformation is one of the most helpful and popular transformations. In fact, it might be a good idea to use the dependent variable's logarithm as a replacement before doing a linear regression. A similar operation would stabilise the target variable's variance and bring the transformed variable's distribution closer to normal.

Square root transformation

If there are any outlier values that are exceptionally large, you might consider using the square root transformation. The transformation can help scaling them down to a much lower value in comparison. A limitation of this transformation is that the square root of a negative number is not a real number.

Box-Cox transformation

Box-cox transformation is yet another transformation method that can help to transform skewed data into normal. It has a controlling parameter lambda which ranges between -5 to 5. Initially, it was only used in presence of positive values, but modifications have been made to the transformation to take care of the negative values as well.

Question 53

What do you mean by conditional probability?

Accepted Answer

The likelihood of an event provided that another event is known to have occurred is known as conditional probability. The expression "probability of A given that B has already occurred" refers to the conditional probability, which is written as P(A|B). When determining the likelihood of an event when a piece of information is already known, either entirely or partially, conditional probabilities are helpful. Examples include, determining the probability of getting the number 5 on the second throw provided that we have already got the number 6 on the first throw, drawing a red ball provided that the first two balls drawn are blue and green.

Question 54

A bag contains 4 white, 5 red, and 6 blue balls. Three balls are drawn at random from the bag. The probability that all of them are red is?

Accepted Answer

In this case, we have total 15 balls. The sample space will be the number of ways in which we can draw three balls from the bag of 15 balls. This can be done in 15C3 ways which is equal to 455. Now, the event of getting three red balls implies that we are trying to draw three red balls out of the 5 red balls present in the bag. We can draw three red balls among 5 red balls in 5C3 ways, equal to 10. The final result will be the number of possibilities of drawing three red balls divided by the total possibilities, that is, 10/455 or 2/91.

Question 55

What is the difference between Spearman Rank correlation and Pearson correlation?

Accepted Answer

The strength of the linear relationship between two random variables is gauged by the simple correlation coefficient, commonly known as Pearson correlation. The range [1; 1] is where values of the correlation coefficient can exist. The two extreme values of this interval represent a positively skewed and negatively skewed perfectly linear relationship between the variables. The number zero indicates that there is no linear relationship. A non-parametric measuring correlation is the Spearman rank correlation coefficient. It is also used to establish the relationship that exists between two pieces of data. The Spearman rank correlation also works with monotonic or non-linear functions, unlike the Pearson coefficient, which requires a linear relationship between the two variables.

Question 56

Explain hypothesis. What do you understand by null and alternate hypothesis?

Accepted Answer

The process of hypothesis testing enables us to either validate the null hypothesis, which serves as the beginning point for our investigation, or to reject it in favour of the alternative hypothesis. A parametric test is a type of hypothesis test that assumes a specific shape for each distribution connected to the underlying populations. In a non-parametric test, the parametric form of the underlying population's distribution is not required to be specified. The null hypothesis is the one that needs to be tested while conducting hypothesis testing. The alternate hypothesis is the opposite argument. If the test results show that the null hypothesis cannot be verified, the alternative hypothesis will be adopted. For example, if the null hypothesis states that “The mean height of men in India is more than 5 feet 6 inches” then the alternate hypothesis will state that, “The mean height of men in India is equal to or less than 5 feet 6 inches”.

Question 57

Explain rejection region and significance level in hypothesis theory.

Accepted Answer

The interval that causes the null hypothesis to be rejected in a hypothesis test is known as the rejection region and is measured in the sampling distribution of the statistic under examination. The rejection zone complements with the acceptance region and is connected to a probability alpha, also known as the test's significance level or type I error. It is a user-fixed parameter of the hypothesis test that establishes the likelihood of rejecting the null hypothesis.

Question 58

What is one-tailed test and two-tailed test? Explain with the help of an example.

Accepted Answer

A one-sided or one-tailed test on a population parameter is a sort of hypothesis test in which the values for which we can reject the null hypothesis, indicated, are exclusively located in one tail of the probability distribution. For instance, if "The mean height of men in India is higher than 5 feet 6 inches" is the null hypothesis, then the alternative hypothesis would be "the mean height of men in India is equal to or less than 5 feet 6 inches." This is a one-sided test because the alternate hypothesis, i.e., equal to or less than 5 feet 6 inches, only considers one end of the distribution.

A two-sided test for a population is a hypothesis test used when comparing an estimate of a parameter to a given value versus the alternative hypothesis that the parameter is not equal to the stated value. If the null hypothesis is, for instance, "The mean height of men in India is equal to 5 feet 6 inches," then the alternative hypothesis would be, "The mean height of men in India is either less than or greater than 5 feet 6 inches but not equal." The alternate hypothesis, greater than or less than 5 feet 6 inches, deals with both extremes of the distribution, making this a two-tailed test.

Question 59

What is p-value? Explain its significance.

Accepted Answer

The probability determined using the null hypothesis is the basis of the p-value. Consider if we are trying to reject the null hypothesis at a certain significance level, alpha. If we are not able to reject the null hypothesis at this significance level, we can reduce the significance level which might allow us to accept the null hypothesis. The p-value is the smallest value of significance level alpha, for which we can reject the null hypothesis. If the p-value is smaller than the alpha, we reject the null hypothesis otherwise we fail to reject the null hypothesis.

Question 60

How do you calculate the confidence interval for a population mean with known and unknown variance?

Accepted Answer

If the sample size is high or the population variance is known, many statistical tests can be conveniently carried out as approximate Z-tests. The Student's t-test would be more appropriate if the population variance is unknown (and must therefore be approximated from the sample itself) and the sample size is small (n < 30). The sample size affects the t-distribution. The distribution of t-distribution approaches the z-distribution as the sample size increases. The t-statistic table becomes nearly identical to the z-statistic after the 30th row, or after 30 degrees of freedom. Therefore, even though the population variance is unknown, we may still apply the z-distribution.

Question 61

Why is harmonic mean used in F1 score?

Accepted Answer

The F1 score is calculated as the harmonic mean of the precision and recall values. The mean or simple average treats all values equally. On the other hand, the harmonic mean gives more weight to the low values. As a result, the classifiers will only get a higher F1 score if both recall and precision is high.

Question 62

What is the purpose of AIC and BIC?

Accepted Answer

The Akaike information criterion (AIC), a refined method based on in-sample fit, is used to determine how likely it is for a model to estimate or predict future values. Another model selection criterion that assesses the trade-off between model fit and complexity is the Bayesian information criterion (BIC). We utilise either the AIC or the BIC, but not both concurrently and interchangeably, to compare models with one another. The model that has the lowest AIC or BIC of all the models is a good model.

Question 63

What is VIF score?

Accepted Answer

When two or more variables in our regression are strongly correlated, this situation is referred to as multicollinearity. The effect of multicollinearity among our variables is measured by the variance inflation factor, or VIF score. It gauges how much a predicted regression coefficient’s variance rises in the presence of correlation. We aren’t performing well if the variance of our model rises. A general rule of thumb that is frequently applied in practise is that high multicollinearity is present if the VIF score is greater than 10.

Question 64

How do you handle imbalance classes?

Accepted Answer

Imbalance classes are skewed classes where a single value might make up a significant amount of the data set, also known as the majority class. Consider that we are utilising a dataset of credit card fraud. The percentage of fraud incidents in the overall amount of data can be as low as 1%. In this situation, our model would still be 99% accurate if it were to blindly forecast each case to be authoritative. In order to prevent this, we either under sample the non-fraudulent instances or oversample the fraudulent ones, allowing them to make up a sizable fraction of the population.

Statistics Interview Questions and Answers for 2024 Data Science

Beginner

Intermediate

Advanced