Ready to face your next Machine Learning interview? Be interviewready with this list of Machine Learning interview questions and answers, carefully curated by industry experts. Be ready to answer different questions like CRISPDM, difference between univariate and bivariate analysis, chisquare test, difference between Type 1 and Type 2 Error, BiasVariance tradeoff. We have gathered a set of interview questions for machine learning that will help you become a machine learning engineer, data engineer.
CRISPDM stands for Cross Industry Standard Process for Data Mining. It is a methodology for data science programs. It has the following phases:
Some phases are iterative in nature and any data science project or program which is end to end typically follows this methodology.
Below is a diagrammatic view for better understanding
In univariate analysis, variables are explored one by one. Method to perform univariate analysis will depend on whether the variable type is categorical or continuous.
In the case of continuous variables, we need to understand the central tendency and spread of the variable. For example central tendency – mean, median, mode, max, min, etc.; a measure of dispersion – range, quartile, IQR, variance, standard deviation, skewness, kurtosis etc; visualization methods – histogram, boxplot etc.
Univariate analysis is also used to highlight missing and outlier values.
The relationship between two variables can be determined using bivariate analysis. How the two variables are associated and/or disassociated are looked into considering the significance level of comparison. Typically bivariate analysis can be performed for:
Different approaches/methods need to be used to handle the above scenarios. Scatter plot can be used irrespective of whether a relationship is linear or nonlinear. In order to figure out how loosely or tightly both variables are correlated, correlation can be performed where the correlation values indicate from 1 to 1. If the value indicates 0, then there is no correlation between the two variables. If it is 1, then there is a perfect ve correlation and if it is a +1 then it is a perfect +ve correlation.
When we want to find out the statistical significance between two variables, then the chisquare test is used to understand the deviation between observed and expected frequency and divided by the expected frequency.
Probability of 0: It indicates that both categorical variables are dependent
Probability of 1: It shows that both variables are independent.
Probability less than 0.05: It indicates that the relationship between the variables is significant at 95% confidence. The chisquare test statistic for a test of independence of two categorical variables is found by:
Square of X = Sum[Square of (O  E) / E]
Where O – Observed frequency
E – Expected frequency under the null hypothesis
We use this between two Categorical variables.
When variables are categorical and continuous, and there are “many samples”, then we should not use the ttest. If sample size n>=30, then we can go for ztest. When there are too many samples and the mean/average of multiple groups are to be compared, then ANOVA can be chosen.
When we don’t have many samples and variance is unknown, then we will use the ttest. In a ttest, the expectation is that the sample size is smaller. Typical n<30, where n is the number of observations or sample size.
The ttest and ztest can be defined as follows. There is a very subtle difference between the two. ztest is used for n>=30 and ttest is used for n<30 scenarios mostly.
ttest = (xbar  mu) / (sd / sqrt(n))
where xbar = sample average or sample mean of x
mu = population average or population mean
sd = standard deviation of a sample
n = number of observations, which is sample size
ztest = (xbar  mu) / (sigma / sqrt(n))
where xbar = sample average or sample mean of x
mu = population average or population mean
sigma = standard deviation of a population
n = number of observations, which is sample size
ANOVA is an analysis of variance. For example, let’s say we are talking about 3 groups.
Class 1  Class 2  Class 3 

8  9  3 
6  2  4 
5  6  3 
8  2  5 
6  7  4 
10  5  4 
6  2  6 
3  8  4 
5  4  5 
7  9  3 
Figure ANOVA
In the “Figure ANOVA” above, we can consider ANOVA for analysis as there are more than 2 sample groups. i.e. 3 groups of samples. There can be many rows in each class. We have considered only 10 each for simple understanding.
Class Group  Count  Sum  Average  Variance 

Class 1  10  64  6.4  3.82 
Class 2  10  54  5.4  8.04 
Class 3  10  41  4.1  0.99 
Missing data in the training data set can reduce the power/fit of a model or can lead to a biased model because we have not analyzed the behavior and relationship with other variables correctly. It can lead to incorrect prediction or classification. Below is a simple example to illustrate this.
Name  Weight  Gender  Play Golf or Not 

AA  55  M  Yes 
BB  62  F  Yes 
CC  58  F  No 
DD  54  No  
EE  54  M  No 
FF  66  F  Yes 
GG  56  Yes  
HH  56  M  Yes 
Figure 1
Gender  # Count  # Play Golf  % Play Golf 

F  3  2  66.67% 
M  3  2  66.67% 
Missing/Blank  2  1  50% 
Figure 2
Please note the missing values in the table shown above: in figure1, we have not treated missing values for our analysis in Figure 2. The inference from this data set is that the chances of playing golf by females and males are similar.
On the other hand, if you look at Figure. 4, which shows data after treatment of missing values (based on gender), we can see that females have higher chances of playing cricket compared to males.
Name  Weight  Gender  Play Golf or Not 

AA  55  M  Yes 
BB  62  F  Yes 
CC  58  F  No 
DD  54  M  No 
EE  54  M  No 
FF  66  F  Yes 
GG  56  M  Yes 
HH  56  M  Yes 
Figure 3
Gender  # Count  # Play Golf  % Play Golf 

F  3  2  66.67% 
M  5  3  60% 
Figure 4
Below are different types of missing values can occur while the data collection process.
When a particular variable is missing in an observation or row, then we delete an entire row. This is called List wise deletion.
When the analysis is performed with all cases of a variable and then only those variable instances are deleted and not the entire row. This is called Pairwise deletion. This works like a correlation matrix.
Generally, pairwise deletion is preferred over listwise deletion as listwise deletion removes the entire row for a particular missing variable.
It is one of the methods to treat missing values other than direct deletion, imputation using a mean/median/mode value, etc. In kNN imputation, the missing values of an attribute are imputed using the given number of attributes that are most similar to the attribute whose values are missing. The similarity of two attributes is determined using a distance function. Pros and Cons are described below.
Pros  Cons 



Outliers can have a significant impact based on the results of the data analysis and statistical modeling. These are as follows:
Here is an example with a sample dataset.
Without Outlier  With Outlier 

Dataset: 1,1,2,2,2,2,3,3,3,4,4 Mean = 2.45 Median = 2.00 Mode = 2.00 Standard deviation = 1.035  Dataset: 1,1,2,2,2,2,3,3,3,4,4,200 Mean = 18.91 Median = 2.50 Mode = 2.00 Standard deviation = 57.03 
If we look at above, inclusion of an outlier shows huge difference in mean / average and standard deviation parameters.
There are various methods.
Others could be as follows: Data points, three or more standard deviations away from the mean are considered as outlier.
There could be many assumptions. Five of them are described below:
A stationary time series has the following characteristics:
This type of time series is typically easy to predict as there not much variations expected in the pattern and trend.
Autocorrelation and partial autocorrelation are a type of measures of association between current time series and past time series values. Both of these provide an indication that older time series values are more useful in predicting future values.
Autocorrelation is the correlation of a Time Series with lags of itself. This is a significant metric because:
While comparing current time series steps to that of prior time series steps, there can be direct and indirect correlations. The indirect correlations are a linear function of correlation of the observation. There could be intervening time series steps. PACF or Partial autocorrelation tries to remove the effect of correlation due to shorter lags.
Both ACF and PACF are useful while trying to understand which model approach could be a relevant and better fit for a prediction solution.
Linear regression can be used to model the Time Series data with linear indices (Ex: 1, 2,...n). The resulting model’s residuals are a representation of the time series devoid of the trend.
In case, if some trend is left over to be seen in the residuals (like what it seems to be with ‘Figure1’ with myData below as an example), then you might wish to add few predictors to the lm() call (like a forecast:: seasonal dummy, forecast::Fourier or may be a lag of the series itself), until the trend is filtered.
Code snippet:
trModel < lm(myData ~ c(1:length(myData))) plot(resid(trModel), type="l") # resid(trModel) contains the detrended series
We can use the Augmented DickeyFuller Test (adf test) to test “stationary” aspect. A pValue of less than 0.05 in adf.test() indicates that it is stationary.
Illustrative code snippet:
library(tseries)
adf.test(myData) # pvalue < 0.05 indicates the TS is stationary kpss.test(myData)
It is a statistical test used to compare two related and matched samples. If a population can not be assumed to be normally distributed, then this test may be useful with the assumption that data are paired and from the same population. Each data pair is chosen randomly. It tries to compare between sample median and hypothetical median.
The boxplot below in R with the “air quality” sample data demonstrates the interpretation of the analysis using this test.
boxplot(Ozone ~ Month, data = airquality)
wilcox.test(Ozone ~ Month, data = airquality, subset = Month %in% c(5, 8))
Wilcoxon rank sum test with continuity correction
data: Ozone by Month
W = 127.5, pvalue = 0.0001208
alternative hypothesis: true location shift is not equal to 0
Interpretation is this:
If pValue < 0.05, reject the null hypothesis and accept the alternative mentioned in your R code’s output.
KolmogorovSmirnov test is used to check whether 2 samples follow the same distribution.
Twosample KolmogorovSmirnov test
data: x and y
D = 0.52, pvalue = 1.581e06
alternative hypothesis: twosided
Twosample KolmogorovSmirnov test
data: x and y
D = 0.1, pvalue = 0.9667
alternative hypothesis: twosided
If pValue < 0.05 (significance level), we reject the null hypothesis that they are drawn from the same distribution. In other words, p < 0.05 implies x and y from different distributions.
Jitter plot is used for correlation. It provides pretty much all points which scatter plots typically do not show up.
We consider mpg dataset with city mileage (cty) and highway mileage (hwy). The original data has 234 data points but a typical scatter plot seems to display fewer points.
This is because there are many overlapping points appearing as a single dot. The fact that both cty and hwy are integers in the source dataset made it all the more convenient to hide this detail.
data(mpg, package="ggplot2") theme_set(theme_bw())
g < ggplot(mpg, aes(cty, hwy))
g + geom_point() +
geom_smooth(method="lm", se=F) +
labs(subtitle="mpg: city vs highway mileage",
y="hwy",
x="cty",
title="Scatterplot with overlapping points",
caption="Source: midwest")
Now we can handle this with a Jitter plot.
We can make a jitter plot with jitter_geom(). As the name suggests, the overlapping points are randomly jittered around its original position based on a threshold controlled by
the width argument.
data(mpg, package="ggplot2")
theme_set(theme_bw()) # preset the bw theme.
g < ggplot(mpg, aes(cty, hwy))
g + geom_jitter(width = .5, size=1) + labs(subtitle="mpg: city vs highway mileage",
y="hwy",
x="cty",
title="Jittered Points")
There are three types of error in any machine learning approach. They are a biased error, variance error, and irreducible error. Generally, the focus is to look at striking a balance between bias and variance and reducing those errors in the model so that accuracy can be improved.
Low Bias  indicates fewer assumptions about the form of the target variable or function. In this case, when we test on new data, it does not give expected results and accuracy can be compromised.
High variance  indicates large changes to the estimate of target variable or target function with changes to the training data.
It is always tricky to handle scenario to balance between these two as increasing the bias will decrease the variance and increasing the variance will decrease the bias. Hence approach that can be followed are as follows:
This can be described in the below table.
kNN  kmeans clustering 

This is supervised machine learning  This is unsupervised machine learning 
This is used for classification and regression problems.  As the name suggests, it is a clustering algorithm. 
This is based on feature similarity.  This divides objects or set of data points into clusters. 
No such mechanism here.  Typically k=3 or based on elbow diagram, k value can be determined 
For example, let’s consider a dataset of football players, their positions, their measurements, etc. We want to assign a position to these players in a new dataset which is unseen by the model which is learned using earlier training data. We may use kNN algorithm since there are measurements, but no positions are known. At the same time, let’s say we have another scenario where we have a dataset of these football players who are to be grouped into some specific groups based on some similarity between them. In this case, kmeans could be used. So, both of these are context specific to the problem we are trying to solve.
In a regression problem, we expect that when we define a solution or mathematical formula, it should explain all possible values or assumption is that most data points should get closer to the line if it is a linear regression.
R square is also known as “goodness of fit”. The higher the value of R square, the better it is. R square explains the amount to which input variables explain the variation of the target variable or predicted variable. If R square is 0.75, then it indicates that 75% of the variation in the target variable is explained by input variables. So higher the Rsquare value, better the explainability of variation in target, hence better the model performance.
Now the problem arises, where we add more input variables. The value of Rsquare keeps increasing. If additional variables do not have an influence in determining the variation of the target variable, then it is a problem and higher Rsquare value, in this case, is misleading. This is where the adjusted R square is being used. The Adjusted R square is an updated version of R square. It penalizes if the addition of more input variables does not improve the existing model and can’t explain the variation in target effectively.
So if we are adding more input variables, we need to ensure they influence target variable, else the gap between Rsquare and Adjusted Rsquare will increase. If there is only one input variable both value will be the same. If there are multiple input variables, it is suggested to consider Adjusted Rsquare value for the goodness of fit.
Tolerance is defined as 1/VIF where VIF stands for Variation Inflation Factor. VIF as the name suggests indicates the inflation in variation. It is a parameter that detects multicollinearity between variables. Based on VIF values, we can determine whether to remove or include all variables without comprising the Adjusted Rsquare value. Hence 1/VIF or Tolerance can be used to gauge which all parameters to be considered in the model to have a better performance.
Type I error is committed when the null hypothesis is true and we reject it, also known as a ‘False Positive’. Type II error is committed when the null hypothesis is false and we accept it, also known as ‘False Negative’.
In the context of the confusion matrix, we can say Type I error occurs when we classify a value as positive (1) when it is actually negative (0). Type II error occurs when we classify a value as negative (0) when it is actually positive(1).
Logistic Regression models can be evaluated as follows:
Machine learning can be of types  supervised, unsupervised and others such as semisupervised, reinforcement learning, etc.
When we look at how to choose which algorithm to select, it depends on input data type primarily and what are we trying to accomplish out of it.
Other types of machine learning also used in different scenarios.
Generative, Graphbased and Heuristic approaches are part of semisupervised learning while reinforcement learning can be active and passive categories.
This is how different machine learning algorithms, methods, approaches can be used at different scenarios at a high level.
Mathematically the error emerging from any model can be broken down into 3 major components.
Error(X) = Square(Bias) + Variance + Irreducible Error
It is important to handle or address the bias error and variance error which is in control. We can’t do much for irreducible error.
Low Bias  indicates fewer assumptions about the form of the target variable or function. In this case, when we test on new data, it does not give expected results and accuracy can be compromised. High Bias indicates high assumptions in a similar context.
High variance  indicates large changes to the estimate of target variable or target function with changes to the training data. Low variance indicates smaller changes to the estimate of the target variable or target function in a similar context.
When we are trying to build a model with greater accuracy, for better performance of the model, it is critical to strike a balance between bias and variance so that errors can be minimized and the gap between actual and predicted outcomes can be reduced.
Hence balance between Bias and Variance needs to be maintained.
OLS stands for Ordinary Least Squares. OLS is a line or estimate which minimizes the error. The sum squared of errors is considered here. Error is the difference between the observed value and its corresponding predicted value. This is typically in a linear regression model scenario.
MLE stands for maximum Likelihood Estimate. MLE is an approach for estimating parameters of a statistical model. Here random error is assumed to follow a distribution, e.g. normal distribution.
MLE is more to select a parameter that can maximize the likelihood or loglikelihood (when we try to normalize based on data values). OLS considers the parameter value that minimizes the error of the model.
There are various key metrics used for evaluation of a logistic regression model. Key metrics are as follows:
Predicted  

Good  Bad  
Actual  Good  True Positive  False Negative 
Bad  False Positive  True Negative 
Accordingly, accuracy, specificity, sensitivity parameters can be derived.
The area under the curve (AUC), referred to as an index of accuracy(A) or concordance index, is a perfect performance metric for ROC curve. Higher the area under the curve, the better is the prediction power of the model.
In a nutshell, while handling missing values, we will have to understand data first and based on that, various mechanisms can be performed to treat them.
There is no specific rule for a particular scenario. It is datadriven and context specific.
For time series datasets, k fold can be troublesome because there might be some pattern in year 4 or 5 which is not in year 3. Resampling the dataset will separate these trends, and we might end up validation on past years, which is incorrect. Instead, we can use forward chaining strategy with 5 fold cross validation as shown below:
For this, the assumption is to have 6 years of historical data available.
Using one hot encoding, the dimensionality (i.e. features) in a dataset get increased because it creates a new variable for each level present in categorical variables. For example: let’s say we have a variable ‘color’. The variable has 3 levels namely Red, Blue, and Green. One hot encoding ‘color’ variable will generate three new variables as Color.Red, Color.Blue and Color.Green containing 0 and 1 value. In label encoding, the levels of categorical variables get encoded as 0 and 1, so no new variable is created. Label encoding is majorly used for binary variables.
This is a scenario where the model overfits and we get perfect accuracy or in other words, the error is almost zero or zero.
When we divide the dataset into training and test and then build our model on the training dataset, our objective is to validate the model that we have built using training dataset, to be fed into a testing dataset which is unseen by the model and new dataset for the model. Based on the features in training dataset that it has learned, if it can perform well in a new dataset with similar features, then that proves the model is performing better with less error.
In this context, when we think about random forest which is a classification algorithm, various hyper parameters are to be considered carefully which is used to build the algorithm and model. The number of trees is one of those parameters and we need to ensure we reduce the number of trees in this case, to enable the model to behave appropriately and do not overfit. Trees can be reduced using kfold crossvalidation approach where k can be 5, 10 or any fold that we wish to make.
No, classical regression techniques can not be used here.
Since a number of variables are greater than a number of observations, it is a high dimension dataset and ordinary least squares cannot be considered for an estimate as standard deviation and variance will be infinite.
We will have to use regression techniques such as Lasso, Ridge, etc. which will penalize coefficients and will reduce variance and standard deviation. Subset regression and/or stepwise regression can also be explored with a forward step approach.
Both Random Forest (RF) and Gradient Boosting (GBM) are treebased supervised machine learning algorithms. Both use a treebased modeling approach and ensemble methods are used.
RF uses decision trees, kind of complex form of a treebased algorithm, which is inclined to overfitting. GBM instead is a boostingbased algorithm approach, which is based on weak classifiers.
Accuracy of RF can be manipulated by modifying variance. GBM will have more hyperparameters to tune for accuracy and can be planned to play for a tradeoff between bias and variance.
We can follow the below steps for variable selection. There could be other ways to accomplish this as well.
It is suggested that in the presence of few variables with medium / large sized effect, lasso regression can be used. In the presence of many variables with small/medium sized effect, ridge regression can be preferred.
Conceptually, lasso regression (L1) does both variable selection and parameter shrinkage, whereas Ridge regression only does parameter shrinkage and end up including all the coefficients in the model. In the presence of correlated variables, ridge regression might be the preferred choice. Additionally, ridge regression works best in situations where the least square estimates have higher variance.
Therefore, it depends on our business goal and model objective as to what is the expectation.
Accordingly, decisions can be taken.
To check multicollinearity, we can create a correlation matrix to identify & remove variables having a correlation above 75% (assuming that deciding a threshold is subjective). In addition, we can use calculate VIF (variance inflation factor) to check the presence of multicollinearity.
VIF value <= 4 suggests no multicollinearity whereas a value of >= 10 implies serious multicollinearity.
Additionally, we can use tolerance as an indicator of multicollinearity.
However, removing correlated variables might lead to loss of information. In order to retain those variables, we can use penalized regression models like ridge or lasso regression. Additionally, we can add some random noise in a correlated variable so that the variables become different from each other. But, adding noise might affect the prediction accuracy, hence this approach should be carefully used with some balancing effect.
Consider universities dataset below. Data for 25 undergraduate programs at business schools in US universities in 1995. The dataset excludes image variables (student satisfaction, employer satisfaction, dean’s opinions, etc.). Given this
SAT  AverageSAT score of new freshmen 
Top10  % new freshmen in top 10% of highschool class 
Accept  % of applicants accepted 
SFRatio  Student to faculty ratio 
SExpenses  Estimated annual expenses 
GradRate  Graduation Rate(%) 
Univ  SAT  Top10  Accept  SFRatio  Expenses  GradRate 

Brown  1310  89  22  13  22,704  94 
CalTech  1415  100  25  6  63,575  81 
CMU  1260  62  59  9  25,026  72 
Columbia  1310  76  24  12  31,510  88 
Comell  1280  83  33  13  21,864  90 
Dartmouth  1340  89  23  10  32,162  95 
Duke  1315  90  30  12  31,585  95 
Georgetown  1255  74  24  12  20,126  92 
Harvard  1400  91  14  11  39,525  97 
JohnHopkins  1305  75  44  7  58,691  87 
MIT  1380  94  30  10  34,870  91 
Northwestern  1260  85  39  11  28,052  89 
NotreDame  1255  81  42  13  15,122  94 
PennState  1081  38  54  18  10,185  80 
Priceton  1375  91  14  8  30,220  95 
Purdue  1005  28  90  19  9,066  69 
Stanford  1360  90  20  12  36,450  93 
TexasA&M  1075  49  67  25  8,704  67 
UCBerkeley  1240  95  40  17  15,140  78 
UChicago  1290  75  50  13  38,380  87 
UMichigan  1180  65  68  16  15,470  85 
UPenn  1285  80  36  11  27,553  90 
UVA  1225  77  44  14  13,349  92 
UWisconsin  1085  40  69  15  11,857  71 
Yale  1375  95  19  11  43,514  96 
Distance between two universities can be derived as follows
Now simple Euclidean distance can be derived as per below.
In order to get a standardized distance, we have to normalize it.
Hence standardized Euclidean distance between CalTech and Cornell are as follows:
We have below data with 10 transactions. What is the performance measure “Support” for “if white then blue”?
Transaction#  Faceplate  Colors  Purchased  

1  red  white  green  
2  white  orange  
3  white  blue  
4  red  white  orange  
5  red  blue  
6  white  blue  
7  white  orange  
8  red  white  blue  green 
9  red  white  blue  
10  yellow 
{white} → {blue}
Support s = 4/10 = 0.4
Hence Support is 40%.
Support of a rule is defined as % (or number) of transactions in which antecedent (If) and consequent (Then) appear in the data.
We have below data with 10 transactions. What is the performance measure “Confidence” for “if white then blue”?
Transactions#  Faceplate  Colors  Purchased  

1  red  white  green  
2  white  orange  
3  white  blue  
4  red  white  orange  
5  red  blue  
6  white  blue  
7  white  orange  
8  red  white  blue  green 
9  red  white  blue  
10  yellow 
{white} → {blue}
Confidence = 4 / 8
Confidence parameter is defined as: % of antecedent (If) transactions that also have the consequent (Then) itemset, same as P (Consequent  Antecedent) = P (C & A) / P (A)
We have below data with 10 transactions. What is the “Lift Ratio” for “if white then blue”?
Transactions#  Faceplate  Colors  Purchased  

1  red  white  green  
2  white  orange  
3  white  blue  
4  red  white  orange  
5  red  blue  
6  white  blue  
7  white  orange  
8  red  white  blue  green 
9  red  white  blue  
10  yellow 
{white} → {blue}
Lift = 0.4 / (0.5 * 0.8) = 0.4 / 0.4 = 1
Lift = confidence / (benchmark confidence)
Benchmark assumes independence between antecedent and consequent
P (Consequent & Antecedent) = P (C) * P (A)
Benchmark confidence
= P (C  A) = P (C & A) / P (A) = P (C) * P (A) / P (A)
Lift = Support (C U A) / [Support(C) * Support(A)]
Lift > 1 indicates a rule that is useful in finding consequent item sets (i.e. more useful than selecting transactions randomly)
CRISPDM stands for Cross Industry Standard Process for Data Mining. It is a methodology for data science programs. It has the following phases:
Some phases are iterative in nature and any data science project or program which is end to end typically follows this methodology.
Below is a diagrammatic view for better understanding
In univariate analysis, variables are explored one by one. Method to perform univariate analysis will depend on whether the variable type is categorical or continuous.
In the case of continuous variables, we need to understand the central tendency and spread of the variable. For example central tendency – mean, median, mode, max, min, etc.; a measure of dispersion – range, quartile, IQR, variance, standard deviation, skewness, kurtosis etc; visualization methods – histogram, boxplot etc.
Univariate analysis is also used to highlight missing and outlier values.
The relationship between two variables can be determined using bivariate analysis. How the two variables are associated and/or disassociated are looked into considering the significance level of comparison. Typically bivariate analysis can be performed for:
Different approaches/methods need to be used to handle the above scenarios. Scatter plot can be used irrespective of whether a relationship is linear or nonlinear. In order to figure out how loosely or tightly both variables are correlated, correlation can be performed where the correlation values indicate from 1 to 1. If the value indicates 0, then there is no correlation between the two variables. If it is 1, then there is a perfect ve correlation and if it is a +1 then it is a perfect +ve correlation.
When we want to find out the statistical significance between two variables, then the chisquare test is used to understand the deviation between observed and expected frequency and divided by the expected frequency.
Probability of 0: It indicates that both categorical variables are dependent
Probability of 1: It shows that both variables are independent.
Probability less than 0.05: It indicates that the relationship between the variables is significant at 95% confidence. The chisquare test statistic for a test of independence of two categorical variables is found by:
Square of X = Sum[Square of (O  E) / E]
Where O – Observed frequency
E – Expected frequency under the null hypothesis
We use this between two Categorical variables.
When variables are categorical and continuous, and there are “many samples”, then we should not use the ttest. If sample size n>=30, then we can go for ztest. When there are too many samples and the mean/average of multiple groups are to be compared, then ANOVA can be chosen.
When we don’t have many samples and variance is unknown, then we will use the ttest. In a ttest, the expectation is that the sample size is smaller. Typical n<30, where n is the number of observations or sample size.
The ttest and ztest can be defined as follows. There is a very subtle difference between the two. ztest is used for n>=30 and ttest is used for n<30 scenarios mostly.
ttest = (xbar  mu) / (sd / sqrt(n))
where xbar = sample average or sample mean of x
mu = population average or population mean
sd = standard deviation of a sample
n = number of observations, which is sample size
ztest = (xbar  mu) / (sigma / sqrt(n))
where xbar = sample average or sample mean of x
mu = population average or population mean
sigma = standard deviation of a population
n = number of observations, which is sample size
ANOVA is an analysis of variance. For example, let’s say we are talking about 3 groups.
Class 1  Class 2  Class 3 

8  9  3 
6  2  4 
5  6  3 
8  2  5 
6  7  4 
10  5  4 
6  2  6 
3  8  4 
5  4  5 
7  9  3 
Figure ANOVA
In the “Figure ANOVA” above, we can consider ANOVA for analysis as there are more than 2 sample groups. i.e. 3 groups of samples. There can be many rows in each class. We have considered only 10 each for simple understanding.
Class Group  Count  Sum  Average  Variance 

Class 1  10  64  6.4  3.82 
Class 2  10  54  5.4  8.04 
Class 3  10  41  4.1  0.99 
Missing data in the training data set can reduce the power/fit of a model or can lead to a biased model because we have not analyzed the behavior and relationship with other variables correctly. It can lead to incorrect prediction or classification. Below is a simple example to illustrate this.
Name  Weight  Gender  Play Golf or Not 

AA  55  M  Yes 
BB  62  F  Yes 
CC  58  F  No 
DD  54  No  
EE  54  M  No 
FF  66  F  Yes 
GG  56  Yes  
HH  56  M  Yes 
Figure 1
Gender  # Count  # Play Golf  % Play Golf 

F  3  2  66.67% 
M  3  2  66.67% 
Missing/Blank  2  1  50% 
Figure 2
Please note the missing values in the table shown above: in figure1, we have not treated missing values for our analysis in Figure 2. The inference from this data set is that the chances of playing golf by females and males are similar.
On the other hand, if you look at Figure. 4, which shows data after treatment of missing values (based on gender), we can see that females have higher chances of playing cricket compared to males.
Name  Weight  Gender  Play Golf or Not 

AA  55  M  Yes 
BB  62  F  Yes 
CC  58  F  No 
DD  54  M  No 
EE  54  M  No 
FF  66  F  Yes 
GG  56  M  Yes 
HH  56  M  Yes 
Figure 3
Gender  # Count  # Play Golf  % Play Golf 

F  3  2  66.67% 
M  5  3  60% 
Figure 4
Below are different types of missing values can occur while the data collection process.
When a particular variable is missing in an observation or row, then we delete an entire row. This is called List wise deletion.
When the analysis is performed with all cases of a variable and then only those variable instances are deleted and not the entire row. This is called Pairwise deletion. This works like a correlation matrix.
Generally, pairwise deletion is preferred over listwise deletion as listwise deletion removes the entire row for a particular missing variable.
It is one of the methods to treat missing values other than direct deletion, imputation using a mean/median/mode value, etc. In kNN imputation, the missing values of an attribute are imputed using the given number of attributes that are most similar to the attribute whose values are missing. The similarity of two attributes is determined using a distance function. Pros and Cons are described below.
Pros  Cons 



Outliers can have a significant impact based on the results of the data analysis and statistical modeling. These are as follows:
Here is an example with a sample dataset.
Without Outlier  With Outlier 

Dataset: 1,1,2,2,2,2,3,3,3,4,4 Mean = 2.45 Median = 2.00 Mode = 2.00 Standard deviation = 1.035  Dataset: 1,1,2,2,2,2,3,3,3,4,4,200 Mean = 18.91 Median = 2.50 Mode = 2.00 Standard deviation = 57.03 
If we look at above, inclusion of an outlier shows huge difference in mean / average and standard deviation parameters.
There are various methods.
Others could be as follows: Data points, three or more standard deviations away from the mean are considered as outlier.
There could be many assumptions. Five of them are described below:
A stationary time series has the following characteristics:
This type of time series is typically easy to predict as there not much variations expected in the pattern and trend.
Autocorrelation and partial autocorrelation are a type of measures of association between current time series and past time series values. Both of these provide an indication that older time series values are more useful in predicting future values.
Autocorrelation is the correlation of a Time Series with lags of itself. This is a significant metric because:
While comparing current time series steps to that of prior time series steps, there can be direct and indirect correlations. The indirect correlations are a linear function of correlation of the observation. There could be intervening time series steps. PACF or Partial autocorrelation tries to remove the effect of correlation due to shorter lags.
Both ACF and PACF are useful while trying to understand which model approach could be a relevant and better fit for a prediction solution.
Linear regression can be used to model the Time Series data with linear indices (Ex: 1, 2,...n). The resulting model’s residuals are a representation of the time series devoid of the trend.
In case, if some trend is left over to be seen in the residuals (like what it seems to be with ‘Figure1’ with myData below as an example), then you might wish to add few predictors to the lm() call (like a forecast:: seasonal dummy, forecast::Fourier or may be a lag of the series itself), until the trend is filtered.
Code snippet:
trModel < lm(myData ~ c(1:length(myData))) plot(resid(trModel), type="l") # resid(trModel) contains the detrended series
We can use the Augmented DickeyFuller Test (adf test) to test “stationary” aspect. A pValue of less than 0.05 in adf.test() indicates that it is stationary.
Illustrative code snippet:
library(tseries)
adf.test(myData) # pvalue < 0.05 indicates the TS is stationary kpss.test(myData)
It is a statistical test used to compare two related and matched samples. If a population can not be assumed to be normally distributed, then this test may be useful with the assumption that data are paired and from the same population. Each data pair is chosen randomly. It tries to compare between sample median and hypothetical median.
The boxplot below in R with the “air quality” sample data demonstrates the interpretation of the analysis using this test.
boxplot(Ozone ~ Month, data = airquality)
wilcox.test(Ozone ~ Month, data = airquality, subset = Month %in% c(5, 8))
Wilcoxon rank sum test with continuity correction
data: Ozone by Month
W = 127.5, pvalue = 0.0001208
alternative hypothesis: true location shift is not equal to 0
Interpretation is this:
If pValue < 0.05, reject the null hypothesis and accept the alternative mentioned in your R code’s output.
KolmogorovSmirnov test is used to check whether 2 samples follow the same distribution.
Twosample KolmogorovSmirnov test
data: x and y
D = 0.52, pvalue = 1.581e06
alternative hypothesis: twosided
Twosample KolmogorovSmirnov test
data: x and y
D = 0.1, pvalue = 0.9667
alternative hypothesis: twosided
If pValue < 0.05 (significance level), we reject the null hypothesis that they are drawn from the same distribution. In other words, p < 0.05 implies x and y from different distributions.
Jitter plot is used for correlation. It provides pretty much all points which scatter plots typically do not show up.
We consider mpg dataset with city mileage (cty) and highway mileage (hwy). The original data has 234 data points but a typical scatter plot seems to display fewer points.
This is because there are many overlapping points appearing as a single dot. The fact that both cty and hwy are integers in the source dataset made it all the more convenient to hide this detail.
data(mpg, package="ggplot2") theme_set(theme_bw())
g < ggplot(mpg, aes(cty, hwy))
g + geom_point() +
geom_smooth(method="lm", se=F) +
labs(subtitle="mpg: city vs highway mileage",
y="hwy",
x="cty",
title="Scatterplot with overlapping points",
caption="Source: midwest")
Now we can handle this with a Jitter plot.
We can make a jitter plot with jitter_geom(). As the name suggests, the overlapping points are randomly jittered around its original position based on a threshold controlled by
the width argument.
data(mpg, package="ggplot2")
theme_set(theme_bw()) # preset the bw theme.
g < ggplot(mpg, aes(cty, hwy))
g + geom_jitter(width = .5, size=1) + labs(subtitle="mpg: city vs highway mileage",
y="hwy",
x="cty",
title="Jittered Points")
There are three types of error in any machine learning approach. They are a biased error, variance error, and irreducible error. Generally, the focus is to look at striking a balance between bias and variance and reducing those errors in the model so that accuracy can be improved.
Low Bias  indicates fewer assumptions about the form of the target variable or function. In this case, when we test on new data, it does not give expected results and accuracy can be compromised.
High variance  indicates large changes to the estimate of target variable or target function with changes to the training data.
It is always tricky to handle scenario to balance between these two as increasing the bias will decrease the variance and increasing the variance will decrease the bias. Hence approach that can be followed are as follows:
This can be described in the below table.
kNN  kmeans clustering 

This is supervised machine learning  This is unsupervised machine learning 
This is used for classification and regression problems.  As the name suggests, it is a clustering algorithm. 
This is based on feature similarity.  This divides objects or set of data points into clusters. 
No such mechanism here.  Typically k=3 or based on elbow diagram, k value can be determined 
For example, let’s consider a dataset of football players, their positions, their measurements, etc. We want to assign a position to these players in a new dataset which is unseen by the model which is learned using earlier training data. We may use kNN algorithm since there are measurements, but no positions are known. At the same time, let’s say we have another scenario where we have a dataset of these football players who are to be grouped into some specific groups based on some similarity between them. In this case, kmeans could be used. So, both of these are context specific to the problem we are trying to solve.
In a regression problem, we expect that when we define a solution or mathematical formula, it should explain all possible values or assumption is that most data points should get closer to the line if it is a linear regression.
R square is also known as “goodness of fit”. The higher the value of R square, the better it is. R square explains the amount to which input variables explain the variation of the target variable or predicted variable. If R square is 0.75, then it indicates that 75% of the variation in the target variable is explained by input variables. So higher the Rsquare value, better the explainability of variation in target, hence better the model performance.
Now the problem arises, where we add more input variables. The value of Rsquare keeps increasing. If additional variables do not have an influence in determining the variation of the target variable, then it is a problem and higher Rsquare value, in this case, is misleading. This is where the adjusted R square is being used. The Adjusted R square is an updated version of R square. It penalizes if the addition of more input variables does not improve the existing model and can’t explain the variation in target effectively.
So if we are adding more input variables, we need to ensure they influence target variable, else the gap between Rsquare and Adjusted Rsquare will increase. If there is only one input variable both value will be the same. If there are multiple input variables, it is suggested to consider Adjusted Rsquare value for the goodness of fit.
Tolerance is defined as 1/VIF where VIF stands for Variation Inflation Factor. VIF as the name suggests indicates the inflation in variation. It is a parameter that detects multicollinearity between variables. Based on VIF values, we can determine whether to remove or include all variables without comprising the Adjusted Rsquare value. Hence 1/VIF or Tolerance can be used to gauge which all parameters to be considered in the model to have a better performance.
Type I error is committed when the null hypothesis is true and we reject it, also known as a ‘False Positive’. Type II error is committed when the null hypothesis is false and we accept it, also known as ‘False Negative’.
In the context of the confusion matrix, we can say Type I error occurs when we classify a value as positive (1) when it is actually negative (0). Type II error occurs when we classify a value as negative (0) when it is actually positive(1).
Logistic Regression models can be evaluated as follows:
Machine learning can be of types  supervised, unsupervised and others such as semisupervised, reinforcement learning, etc.
When we look at how to choose which algorithm to select, it depends on input data type primarily and what are we trying to accomplish out of it.
Other types of machine learning also used in different scenarios.
Generative, Graphbased and Heuristic approaches are part of semisupervised learning while reinforcement learning can be active and passive categories.
This is how different machine learning algorithms, methods, approaches can be used at different scenarios at a high level.
Mathematically the error emerging from any model can be broken down into 3 major components.
Error(X) = Square(Bias) + Variance + Irreducible Error
It is important to handle or address the bias error and variance error which is in control. We can’t do much for irreducible error.
Low Bias  indicates fewer assumptions about the form of the target variable or function. In this case, when we test on new data, it does not give expected results and accuracy can be compromised. High Bias indicates high assumptions in a similar context.
High variance  indicates large changes to the estimate of target variable or target function with changes to the training data. Low variance indicates smaller changes to the estimate of the target variable or target function in a similar context.
When we are trying to build a model with greater accuracy, for better performance of the model, it is critical to strike a balance between bias and variance so that errors can be minimized and the gap between actual and predicted outcomes can be reduced.
Hence balance between Bias and Variance needs to be maintained.
OLS stands for Ordinary Least Squares. OLS is a line or estimate which minimizes the error. The sum squared of errors is considered here. Error is the difference between the observed value and its corresponding predicted value. This is typically in a linear regression model scenario.
MLE stands for maximum Likelihood Estimate. MLE is an approach for estimating parameters of a statistical model. Here random error is assumed to follow a distribution, e.g. normal distribution.
MLE is more to select a parameter that can maximize the likelihood or loglikelihood (when we try to normalize based on data values). OLS considers the parameter value that minimizes the error of the model.
There are various key metrics used for evaluation of a logistic regression model. Key metrics are as follows:
Predicted  

Good  Bad  
Actual  Good  True Positive  False Negative 
Bad  False Positive  True Negative 
Accordingly, accuracy, specificity, sensitivity parameters can be derived.
The area under the curve (AUC), referred to as an index of accuracy(A) or concordance index, is a perfect performance metric for ROC curve. Higher the area under the curve, the better is the prediction power of the model.
In a nutshell, while handling missing values, we will have to understand data first and based on that, various mechanisms can be performed to treat them.
There is no specific rule for a particular scenario. It is datadriven and context specific.
For time series datasets, k fold can be troublesome because there might be some pattern in year 4 or 5 which is not in year 3. Resampling the dataset will separate these trends, and we might end up validation on past years, which is incorrect. Instead, we can use forward chaining strategy with 5 fold cross validation as shown below:
For this, the assumption is to have 6 years of historical data available.
Using one hot encoding, the dimensionality (i.e. features) in a dataset get increased because it creates a new variable for each level present in categorical variables. For example: let’s say we have a variable ‘color’. The variable has 3 levels namely Red, Blue, and Green. One hot encoding ‘color’ variable will generate three new variables as Color.Red, Color.Blue and Color.Green containing 0 and 1 value. In label encoding, the levels of categorical variables get encoded as 0 and 1, so no new variable is created. Label encoding is majorly used for binary variables.
This is a scenario where the model overfits and we get perfect accuracy or in other words, the error is almost zero or zero.
When we divide the dataset into training and test and then build our model on the training dataset, our objective is to validate the model that we have built using training dataset, to be fed into a testing dataset which is unseen by the model and new dataset for the model. Based on the features in training dataset that it has learned, if it can perform well in a new dataset with similar features, then that proves the model is performing better with less error.
In this context, when we think about random forest which is a classification algorithm, various hyper parameters are to be considered carefully which is used to build the algorithm and model. The number of trees is one of those parameters and we need to ensure we reduce the number of trees in this case, to enable the model to behave appropriately and do not overfit. Trees can be reduced using kfold crossvalidation approach where k can be 5, 10 or any fold that we wish to make.
No, classical regression techniques can not be used here.
Since a number of variables are greater than a number of observations, it is a high dimension dataset and ordinary least squares cannot be considered for an estimate as standard deviation and variance will be infinite.
We will have to use regression techniques such as Lasso, Ridge, etc. which will penalize coefficients and will reduce variance and standard deviation. Subset regression and/or stepwise regression can also be explored with a forward step approach.
Both Random Forest (RF) and Gradient Boosting (GBM) are treebased supervised machine learning algorithms. Both use a treebased modeling approach and ensemble methods are used.
RF uses decision trees, kind of complex form of a treebased algorithm, which is inclined to overfitting. GBM instead is a boostingbased algorithm approach, which is based on weak classifiers.
Accuracy of RF can be manipulated by modifying variance. GBM will have more hyperparameters to tune for accuracy and can be planned to play for a tradeoff between bias and variance.
We can follow the below steps for variable selection. There could be other ways to accomplish this as well.
It is suggested that in the presence of few variables with medium / large sized effect, lasso regression can be used. In the presence of many variables with small/medium sized effect, ridge regression can be preferred.
Conceptually, lasso regression (L1) does both variable selection and parameter shrinkage, whereas Ridge regression only does parameter shrinkage and end up including all the coefficients in the model. In the presence of correlated variables, ridge regression might be the preferred choice. Additionally, ridge regression works best in situations where the least square estimates have higher variance.
Therefore, it depends on our business goal and model objective as to what is the expectation.
Accordingly, decisions can be taken.
To check multicollinearity, we can create a correlation matrix to identify & remove variables having a correlation above 75% (assuming that deciding a threshold is subjective). In addition, we can use calculate VIF (variance inflation factor) to check the presence of multicollinearity.
VIF value <= 4 suggests no multicollinearity whereas a value of >= 10 implies serious multicollinearity.
Additionally, we can use tolerance as an indicator of multicollinearity.
However, removing correlated variables might lead to loss of information. In order to retain those variables, we can use penalized regression models like ridge or lasso regression. Additionally, we can add some random noise in a correlated variable so that the variables become different from each other. But, adding noise might affect the prediction accuracy, hence this approach should be carefully used with some balancing effect.
Distance between two universities can be derived as follows
Now simple Euclidean distance can be derived as per below.
In order to get a standardized distance, we have to normalize it.
Hence standardized Euclidean distance between CalTech and Cornell are as follows:
{white} → {blue}
Support s = 4/10 = 0.4
Hence Support is 40%.
Support of a rule is defined as % (or number) of transactions in which antecedent (If) and consequent (Then) appear in the data.
{white} → {blue}
Confidence = 4 / 8
Confidence parameter is defined as: % of antecedent (If) transactions that also have the consequent (Then) itemset, same as P (Consequent  Antecedent) = P (C & A) / P (A)
{white} → {blue}
Lift = 0.4 / (0.5 * 0.8) = 0.4 / 0.4 = 1
Lift = confidence / (benchmark confidence)
Benchmark assumes independence between antecedent and consequent
P (Consequent & Antecedent) = P (C) * P (A)
Benchmark confidence
= P (C  A) = P (C & A) / P (A) = P (C) * P (A) / P (A)
Lift = Support (C U A) / [Support(C) * Support(A)]
Lift > 1 indicates a rule that is useful in finding consequent item sets (i.e. more useful than selecting transactions randomly)
Machine Learning is the field of study that provides the computers the capability to learn without being explicitly programmed. It is one of the most exciting technologies that one would have never come across. Machine Learning has become one of the most popular career choices today. According to a recent report from Gartner, Artificial Intelligence will create more than 2.3 million jobs by 2020.
A LinkdeIn study suggests that there are currently 1,829 jobs opening for Machine Learning Engineering positions. Another study conducted by Analytical India Magazine reveals that there are more than 78,000 jobs in the Data Science and Machine Learning jobs lying across India. The demand for Machine Learning is growing at a faster pace. There are many factors contributing to increase in the demand of Machine Learning. Most companies are investing in machine learning. Companies are looking to hire more ML experts.
Jobs in machine learning rapidly increasing due to the increase in machine learning industry. The report from International Data Corporation estimates states that investing on Machine Learning and Artificial Intelligence will increase from $12B in 2017 to $57.6 B in 2021. Jobs in machine learning are highly paid since, the job is creative and unstructured, companies pay employees really well. The report from Glassdoor, states the average salary of machine learning engineers for freshers is between INR 4.5 lakhs to INR 7 lakhs, it might reach upto INR 16 lakhs for experienced professionals.
If you’re looking for interview questions and answers on machine learning for experienced and freshers, then you are at the right place. There are a lot of opportunities in many reputed companies across the globe. Good handson knowledge concepts will put you forward in the interview. You can find job opportunities everywhere. Our Machine Learning interview questions are exclusively designed for supporting employees in clearing interviews. We have tried to cover almost all the main topics related to Machine Learning.
Here, we have characterized the questions based on the level of expertise you’re looking for. Preparing for your interview with these interview questions on Machine Learning will give you an edge over other interviewees and will help you crack the Machine Learning interview easily. To get indepth knowledge on Machine Learning you can also enroll for Machine Learning course.
All the best!
Submitted questions and answers are subjecct to review and editing,and may or may not be selected for posting, at the sole discretion of Knowledgehut.