Data Science Interview Questions [2025]

All Courses

Introduction

Ready to face your next Machine Learning interview? Be interview-ready with this list of Machine Learning interview questions and answers, carefully curated by industry experts. Be ready to answer different questions like CRISP-DM, difference between univariate and bivariate analysis, chi-square test, difference between Type 1 and Type 2 Error, Bias-Variance trade-off. We have gathered a set of interview questions for machine learning that will help you become a machine learning engineer, data engineer.

Machine Learning Interview Questions and Answers

Intermediate

1. There is an ask to evaluate a regression model based on parameters such as R square, Adjusted R square, and Tolerance? Explain what will be the criteria.

In a regression problem, we expect that when we define a solution or mathematical formula, it should explain all possible values or assumption is that most data points should get closer to the line if it is a linear regression.

R square is also known as “goodness of fit”. The higher the value of R square, the better it is. R square explains the amount to which input variables explain the variation of the target variable or predicted variable. If R square is 0.75, then it indicates that 75% of the variation in the target variable is explained by input variables. So higher the R-square value, better the explainability of variation in target, hence better the model performance.

Now the problem arises, where we add more input variables. The value of R-square keeps increasing. If additional variables do not have an influence in determining the variation of the target variable, then it is a problem and higher R-square value, in this case, is misleading. This is where the adjusted R square is being used. The Adjusted R square is an updated version of R square. It penalizes if the addition of more input variables does not improve the existing model and can’t explain the variation in target effectively.

So if we are adding more input variables, we need to ensure they influence target variable, else the gap between R-square and Adjusted R-square will increase. If there is only one input variable both value will be the same. If there are multiple input variables, it is suggested to consider Adjusted R-square value for the goodness of fit.

Tolerance is defined as 1/VIF where VIF stands for Variation Inflation Factor. VIF as the name suggests indicates the inflation in variation. It is a parameter that detects multicollinearity between variables. Based on VIF values, we can determine whether to remove or include all variables without comprising the Adjusted R-square value. Hence 1/VIF or Tolerance can be used to gauge which all parameters to be considered in the model to have a better performance.

2. What is the difference between Type 1 and Type 2 Error? Explain briefly.

Type I error is committed when the null hypothesis is true and we reject it, also known as a ‘False Positive’. Type II error is committed when the null hypothesis is false and we accept it, also known as ‘False Negative’.

In the context of the confusion matrix, we can say Type I error occurs when we classify a value as positive (1) when it is actually negative (0). Type II error occurs when we classify a value as negative (0) when it is actually positive(1).

3. How is the logistic regression model evaluated? Explain at least 3 points.

Logistic Regression models can be evaluated as follows:

First and foremost key parameter for evaluation is AUC-ROC curve. This is the Area under Curve. The confusion matrix can be built or generated based on actual and predicted values from the model solution. Based on that, the AUC-ROC curve can be plotted to see the model performance. ROC stands for Receiver Operating Characteristic. For an ideal model, the perfect True positive rate score will be 1 and False Positive rate will be 0. The more inclined the ROC curve towards 1, the better it is.
Secondly, another important metrics is AIC which stands for Akaike Information Criteria. This is related to the Adjusted R square value. When we look at R square and Adjusted R square, we understand that when there are more input variables being added without improving the variation explanation of target variable, then metric such as Adjusted R square penalizes if we add input variables just for the sake of adding and no value in terms of model performance. Hence in such cases, Adjusted R square is a better interpretation compared to R square and hence it is followed. AIC value is dependent on Adjusted R square. Hence, AIC is the goodness of fit and it penalizes if more variables are added to a model without adding value.
Null deviance and Residual deviance are other metrics which are important to evaluate a logistic regression model. Both should be low which will indicate the model is better.

4. There are multiple algorithms available in machine learning – supervised, unsupervised and other learning. How do you determine which one to use?

Machine learning can be of types - supervised, unsupervised and others such as semi-supervised, reinforcement learning, etc.

When we look at how to choose which algorithm to select, it depends on input data type primarily and what are we trying to accomplish out of it.

If the target variable is continuous, then we will use regression algorithms (which are part of supervised learning). e.g. Simple Linear Regression, Multiple Linear Regression, etc.
If the target variable is categorical, then we will use classification algorithms (this is also part of supervised learning). e.g. Logistic Regression, Random Forest, Decision Trees, kNN, Neural Network, Support Vector Machine, Naive Bayes, etc.
If the target variable is not available, then we will use any of the unsupervised learning such as Clustering or Association or Recommendation Algorithms.

Other types of machine learning also used in different scenarios.

Generative, Graph-based and Heuristic approaches are part of semi-supervised learning while reinforcement learning can be active and passive categories.

This is how different machine learning algorithms, methods, approaches can be used at different scenarios at a high level.

5. What is Bias-Variance trade-off? Explain.

Mathematically the error emerging from any model can be broken down into 3 major components.

Error(X) = Square(Bias) + Variance + Irreducible Error

It is important to handle or address the bias error and variance error which is in control. We can’t do much for irreducible error.

Low Bias - indicates fewer assumptions about the form of the target variable or function. In this case, when we test on new data, it does not give expected results and accuracy can be compromised. High Bias indicates high assumptions in a similar context.
High variance - indicates large changes to the estimate of target variable or target function with changes to the training data. Low variance indicates smaller changes to the estimate of the target variable or target function in a similar context.

When we are trying to build a model with greater accuracy, for better performance of the model, it is critical to strike a balance between bias and variance so that errors can be minimized and the gap between actual and predicted outcomes can be reduced.

Hence balance between Bias and Variance needs to be maintained.

7. What are the parameters to evaluate Logistic Regression? Explain briefly.

There are various key metrics used for evaluation of a logistic regression model. Key metrics are as follows:

AUC-ROC curve - First and foremost key parameter for evaluation is AUC-ROC curve. This is the Area under Curve. The confusion matrix can be built or generated based on actual and predicted values from the model solution. Based on that, the AUC-ROC curve can be plotted to see the model performance. ROC stands for Receiver Operating Characteristic. For an ideal model, the perfect True positive rate score will be 1 and False Positive rate will be 0. The more inclined the ROC curve towards 1, the better it is.
AIC - Secondly, important metrics is AIC which stands for Akaike Information Criteria. This is related to the Adjusted R square value. When we look at R square and Adjusted R square, we understand that when there are more input variables being added without improving the variation explanation of target variable, then metric such as Adjusted R square penalizes if we add input variables just for the sake of adding and no value in terms of model performance. Hence in such cases, Adjusted R square is a better interpretation compared to R square and hence it is followed. AIC value is dependent on the Adjusted R square. Hence AIC is the goodness of fit and it penalizes if more variables are added to a model without adding value.
Null and Residual Deviance - Null deviance and Residual deviance are other metrics which are important to evaluate a logistic regression model. Both should be low which will indicate the model is better.

		Predicted
		Good	Bad
Actual	Good	True Positive	False Negative
Bad	False Positive	True Negative

Accordingly, accuracy, specificity, sensitivity parameters can be derived.

The area under the curve (AUC), referred to as an index of accuracy(A) or concordance index, is a perfect performance metric for ROC curve. Higher the area under the curve, the better is the prediction power of the model.

Advanced

1. What is CRISP-DM? Explain various stages

CRISP-DM stands for Cross Industry Standard Process for Data Mining. It is a methodology for data science programs. It has the following phases:

Business understanding – (Typical tasks are: Determine business objective, Assess Situation, Determine Data mining goals, project plan)
Data understanding – (Collect initial data, Describe data, Explore Data, Verify Data Quality)
Data preparation – (Select data, Clean data, Construct data, Integrate data, Format data)
Modelling or Model development – (Select Modelling techniques, Generate test design, Build model, Assess model)
Model evaluation – (Evaluate results, Review process, Determine next steps)
Deployment – (Plan deployment, Plan monitoring & maintenance, Product final report & Review Project)

Some phases are iterative in nature and any data science project or program which is end to end typically follows this methodology.

Below is a diagrammatic view for better understanding

What is CRISP-DM

2. What is the difference between univariate and bivariate analysis? Explain briefly.

In univariate analysis, variables are explored one by one. Method to perform univariate analysis will depend on whether the variable type is categorical or continuous.

In the case of continuous variables, we need to understand the central tendency and spread of the variable. For example- central tendency – mean, median, mode, max, min, etc.; a measure of dispersion – range, quartile, IQR, variance, standard deviation, skewness, kurtosis etc; visualization methods – histogram, boxplot etc.

Univariate analysis is also used to highlight missing and outlier values.

The relationship between two variables can be determined using bivariate analysis. How the two variables are associated and/or dis-associated are looked into considering the significance level of comparison. Typically bivariate analysis can be performed for:

two categorical variables
categorical and continuous variables
two continuous variables

Different approaches/methods need to be used to handle the above scenarios. Scatter plot can be used irrespective of whether a relationship is linear or nonlinear. In order to figure out how loosely or tightly both variables are correlated, correlation can be performed where the correlation values indicate from -1 to 1. If the value indicates 0, then there is no correlation between the two variables. If it is -1, then there is a perfect -ve correlation and if it is a +1 then it is a perfect +ve correlation.

3. What is the chi-square test? When do we use this?

When we want to find out the statistical significance between two variables, then the chi-square test is used to understand the deviation between observed and expected frequency and divided by the expected frequency.

Probability of 0: It indicates that both categorical variables are dependent
Probability of 1: It shows that both variables are independent.
Probability less than 0.05: It indicates that the relationship between the variables is significant at 95% confidence. The chi-square test statistic for a test of independence of two categorical variables is found by:
- Square of X = Sum[Square of (O - E) / E]
- Where O – Observed frequency
- E – Expected frequency under the null hypothesis

We use this between two Categorical variables.

4. What type of bivariate analysis will you perform if variables are categorical and continuous?

When variables are categorical and continuous, and there are “many samples”, then we should not use the t-test. If sample size n>=30, then we can go for z-test. When there are too many samples and the mean/average of multiple groups are to be compared, then ANOVA can be chosen.

When we don’t have many samples and variance is unknown, then we will use the t-test. In a t-test, the expectation is that the sample size is smaller. Typical n<30, where n is the number of observations or sample size.

The t-test and z-test can be defined as follows. There is a very subtle difference between the two. z-test is used for n>=30 and t-test is used for n<30 scenarios mostly.

t-test = (x-bar - mu) / (sd / sqrt(n))

where x-bar = sample average or sample mean of x
mu = population average or population mean
sd = standard deviation of a sample
n = number of observations, which is sample size

z-test = (x-bar - mu) / (sigma / sqrt(n))

where x-bar = sample average or sample mean of x
mu = population average or population mean
sigma = standard deviation of a population
n = number of observations, which is sample size

ANOVA is an analysis of variance. For example, let’s say we are talking about 3 groups.

Class 1	Class 2	Class 3
8	9	3
6	2	4
5	6	3
8	2	5
6	7	4
10	5	4
6	2	6
3	8	4
5	4	5
7	9	3

Figure ANOVA

In the “Figure ANOVA” above, we can consider ANOVA for analysis as there are more than 2 sample groups. i.e. 3 groups of samples. There can be many rows in each class. We have considered only 10 each for simple understanding.

Class Group	Count	Sum	Average	Variance
Class 1	10	64	6.4	3.82
Class 2	10	54	5.4	8.04
Class 3	10	41	4.1	0.99

5. Why missing values treatment is required?

Missing data in the training data set can reduce the power/fit of a model or can lead to a biased model because we have not analyzed the behavior and relationship with other variables correctly. It can lead to incorrect prediction or classification. Below is a simple example to illustrate this.

Name	Weight	Gender	Play Golf or Not
AA	55	M	Yes
BB	62	F	Yes
CC	58	F	No
DD	54		No
EE	54	M	No
FF	66	F	Yes
GG	56		Yes
HH	56	M	Yes

Figure 1

Gender	# Count	# Play Golf	% Play Golf
F	3	2	66.67%
M	3	2	66.67%
Missing/Blank	2	1	50%

Figure 2

Please note the missing values in the table shown above: in figure1, we have not treated missing values for our analysis in Figure 2. The inference from this data set is that the chances of playing golf by females and males are similar.

On the other hand, if you look at Figure. 4, which shows data after treatment of missing values (based on gender), we can see that females have higher chances of playing cricket compared to males.

Name	Weight	Gender	Play Golf or Not
AA	55	M	Yes
BB	62	F	Yes
CC	58	F	No
DD	54	M	No
EE	54	M	No
FF	66	F	Yes
GG	56	M	Yes
HH	56	M	Yes

Figure 3

Gender	# Count	# Play Golf	% Play Golf
F	3	2	66.67%
M	5	3	60%

Figure 4

6. Missing values in data can cause issues and there are different strategies to handle missing values. What are the different types of missing values at the time of data collection? Explain.

Below are different types of missing values can occur while the data collection process.

Missing values completely at random - If the probability of missing variable is the same across all observations, then it falls into this category. For example students determine that they will declare their preference whether to go to a cultural festival or not after tossing a fair coin. If a head occurs, then will declare that they will either go or do not decide to go and vice versa. Each observation has an equal chance of missing value whether to go or not go.
Missing values at random - This is different than “a” mentioned above. If the variable is missing at random and the missing ratio differs for different values of input variables, then this scenario occurs. For example: in a fair coin example setup, we have information of a set of people in a locality about their demographics, age, sex, locality type - busy/very busy/moderate busy, etc and if a female has a higher missing value of other parameters compared to male.
The missing value that depends on unobserved predictors - This case is possible when missing values are not completely at random. The phenomenon is based on unobserved input variable. Let’s say for example there is a mathematics examination and because of the complex level of examination, the expectation is that there will be fewer students who will go and appear the exam. Out of 100 students, 30 do not appear because of the “complexity level” of examination. This type of missing value is not at random. Instead, this is due to “complexity level” unless this parameter is not taken into account as a cause already.
The missing value which depends on missing value itself - This is a scenario when the probability of a missing value is correlated with the missing value itself. For example Students with higher or lower marks in graded exam in one subject are likely to appear/disappear in competitive exam for the same subject for another purpose/competition.

8. What is kNN imputation and what are its pros & cons?

It is one of the methods to treat missing values other than direct deletion, imputation using a mean/median/mode value, etc. In kNN imputation, the missing values of an attribute are imputed using the given number of attributes that are most similar to the attribute whose values are missing. The similarity of two attributes is determined using a distance function. Pros and Cons are described below.

Pros	Cons
It can predict both qualitative & quantitative attributes. Creation of predictive model for each attribute with missing data is not required. Attributes with multiple missing values can be easily treated. Correlation structure of the data is taken into consideration.	It is very time-consuming in analysing large database. It searches through all the dataset looking for the most similar instances. Hence complex and takes time. Choice of k-value is very critical. Higher value of k would include attributes which are significantly different from what we need whereas lower value of k implies missing out of significant attributes

9. What impact outliers have in a dataset? Explain with an example.

Outliers can have a significant impact based on the results of the data analysis and statistical modeling. These are as follows:

Outliers can decrease normality as they are non-randomly distributed
Error variance increases with a relative comparison and that provides an incorrect estimate of the overall population.
Power of statistical tests are also reduced because of the impact in standard deviation.
ANOVA, different relevant statistical model assumptions are impacted.

Here is an example with a sample dataset.

Without Outlier	With Outlier
Dataset: 1,1,2,2,2,2,3,3,3,4,4 Mean = 2.45 Median = 2.00 Mode = 2.00 Standard deviation = 1.035	Dataset: 1,1,2,2,2,2,3,3,3,4,4,200 Mean = 18.91 Median = 2.50 Mode = 2.00 Standard deviation = 57.03

Without Outlier

With Outlier

Dataset: 1,1,2,2,2,2,3,3,3,4,4

Mean = 2.45

Median = 2.00

Mode = 2.00

Standard deviation = 1.035

Dataset: 1,1,2,2,2,2,3,3,3,4,4,200

Mean = 18.91

Median = 2.50

Mode = 2.00

Standard deviation = 57.03

If we look at above, inclusion of an outlier shows huge difference in mean / average and standard deviation parameters.

20. What is the difference between kNN and k means clustering?

This can be described in the below table.

kNN	k-means clustering
This is supervised machine learning	This is unsupervised machine learning
This is used for classification and regression problems.	As the name suggests, it is a clustering algorithm.
This is based on feature similarity.	This divides objects or set of data points into clusters.
No such mechanism here.	Typically k=3 or based on elbow diagram, k value can be determined

For example, let’s consider a dataset of football players, their positions, their measurements, etc. We want to assign a position to these players in a new dataset which is unseen by the model which is learned using earlier training data. We may use kNN algorithm since there are measurements, but no positions are known. At the same time, let’s say we have another scenario where we have a dataset of these football players who are to be grouped into some specific groups based on some similarity between them. In this case, k-means could be used. So, both of these are context specific to the problem we are trying to solve.

Want to Know More?

Full Name*

Email*

+91

Phone Number*

United States +1

India +91

Canada +1

Australia +61

Singapore +65

New Zealand +64

Germany +49

United Arab Emirates +971

Hong Kong +852

Ireland +353

Afghanistan +93

Aland Islands +358

Albania +355

Algeria +213

AmericanSamoa +1684

Andorra +376

Angola +244

Anguilla +1264

Antarctica +672

Antigua and Barbuda +1268

Argentina +54

Armenia +374

Aruba +297

Ascension Island +247

Austria +43

Azerbaijan +994

Bahamas +1242

Bahrain +973

Bangladesh +880

Barbados +1246

Belarus +375

Belgium +32

Belize +501

Benin +229

Bermuda +1441

Bhutan +975

Bolivia +591

Bosnia and Herzegovina +387

Botswana +267

Brazil +55

British Indian Ocean Territory +246

Brunei Darussalam +673

Bulgaria +359

Burkina Faso +226

Burundi +257

Cambodia +855

Cameroon +237

Cape Verde +238

Cayman Islands +1345

Central African Republic +236

Chad +235

Chile +56

China +86

Christmas Island +61

Cocos (Keeling) Islands +61

Colombia +57

Comoros +269

Congo +242

Cook Islands +682

Costa Rica +506

Cote d'Ivoire +225

Croatia +385

Cuba +53

Cyprus +357

Czech Republic +420

Democratic Republic of the Congo +243

Denmark +45

Djibouti +253

Dominica +1767

Dominican Republic +1849

Ecuador +593

Egypt +20

El Salvador +503

Equatorial Guinea +240

Eritrea +291

Estonia +372

Eswatini +268

Ethiopia +251

Falkland Islands (Malvinas) +500

Faroe Islands +298

Fiji +679

Finland +358

France +33

French Guiana +594

French Polynesia +689

Gabon +241

Gambia +220

Georgia +995

Ghana +233

Gibraltar +350

Greece +30

Greenland +299

Grenada +1473

Guadeloupe +590

Guam +1671

Guatemala +502

Guernsey +44

Guinea +224

Guinea-Bissau +245

Guyana +592

Haiti +509

Holy See (Vatican City State) +379

Honduras +504

Hungary +36

Iceland +354

Indonesia +62

Iran +98

Iraq +964

Isle of Man +44

Israel +972

Italy +39

Jamaica +1876

Japan +81

Jersey +44

Jordan +962

Kazakhstan +77

Kenya +254

Kiribati +686

Korea, Democratic People's Republic of Korea +850

Korea, Republic of South Korea +82

Kosovo +383

Kyrgyzstan +996

Laos +856

Latvia +371

Lebanon +961

Lesotho +266

Liberia +231

Libya +218

Liechtenstein +423

Lithuania +370

Luxembourg +352

Macau +853

Madagascar +261

Malawi +265

Malaysia +60

Maldives +960

Mali +223

Malta +356

Marshall Islands +692

Martinique +596

Mauritania +222

Mauritius +230

Mayotte +262

Mexico +52

Micronesia, Federated States of Micronesia +691

Moldova +373

Monaco +377

Mongolia +976

Montenegro +382

Montserrat +1664

Morocco +212

Mozambique +258

Myanmar +95

Namibia +264

Nauru +674

Nepal +977

Netherlands +31

New Caledonia +687

Nicaragua +505

Niger +227

Nigeria +234

Niue +683

Norfolk Island +672

North Macedonia +389

Northern Mariana Islands +1670

Norway +47

Oman +968

Pakistan +92

Palau +680

Palestine +970

Papua New Guinea +675

Paraguay +595

Peru +51

Philippines +63

Pitcairn +872

Poland +48

Portugal +351

Puerto Rico +1939

Qatar +974

Reunion +262

Romania +40

Russia +7

Rwanda +250

Saint Barthelemy +590

Saint Helena, Ascension and Tristan Da Cunha +290

Saint Kitts and Nevis +1869

Saint Lucia +1758

Saint Martin +590

Saint Pierre and Miquelon +508

Saint Vincent and the Grenadines +1784

Samoa +685

San Marino +378

Sao Tome and Principe +239

Saudi Arabia +966

Senegal +221

Serbia +381

Seychelles +248

Sierra Leone +232

Sint Maarten +1721

Slovakia +421

Slovenia +386

Solomon Islands +677

Somalia +252

South Africa +27

South Georgia and the South Sandwich Islands +500

South Sudan +211

Spain +34

Sri Lanka +94

Sudan +249

Suriname +597

Svalbard and Jan Mayen +47

Sweden +46

Switzerland +41

Syrian Arab Republic +963

Taiwan +886

Tajikistan +992

Tanzania, United Republic of Tanzania +255

Thailand +66

Timor-Leste +670

Togo +228

Tokelau +690

Tonga +676

Trinidad and Tobago +1868

Tunisia +216

Turkey +90

Turkmenistan +993

Turks and Caicos Islands +1649

Tuvalu +688

Uganda +256

Ukraine +380

United Kingdom +44

Uruguay +598

Uzbekistan +998

Vanuatu +678

Venezuela, Bolivarian Republic of Venezuela +58

Vietnam +84

Virgin Islands, British +1284

Virgin Islands, U.S. +1340

Wallis and Futuna +681

Yemen +967

Zambia +260

Zimbabwe +263

By Signing up, you agree to ourTerms & Conditionsand ourPrivacy and Policy

Description

Machine Learning is the field of study that provides the computers the capability to learn without being explicitly programmed. It is one of the most exciting technologies that one would have never come across. Machine Learning has become one of the most popular career choices today. According to a recent report from Gartner, Artificial Intelligence will create more than 2.3 million jobs by 2020.

A LinkdeIn study suggests that there are currently 1,829 jobs opening for Machine Learning Engineering positions. Another study conducted by Analytical India Magazine reveals that there are more than 78,000 jobs in the Data Science and Machine Learning jobs lying across India. The demand for Machine Learning is growing at a faster pace. There are many factors contributing to increase in the demand of Machine Learning. Most companies are investing in machine learning. Companies are looking to hire more ML experts.

Jobs in machine learning rapidly increasing due to the increase in machine learning industry. The report from International Data Corporation estimates states that investing on Machine Learning and Artificial Intelligence will increase from $12B in 2017 to $57.6 B in 2021. Jobs in machine learning are highly paid since, the job is creative and unstructured, companies pay employees really well. The report from Glassdoor, states the average salary of machine learning engineers for freshers is between INR 4.5 lakhs to INR 7 lakhs, it might reach upto INR 16 lakhs for experienced professionals.

If you’re looking for interview questions and answers on machine learning for experienced and freshers, then you are at the right place. There are a lot of opportunities in many reputed companies across the globe. Good hands-on knowledge concepts will put you forward in the interview. You can find job opportunities everywhere. Our Machine Learning interview questions are exclusively designed for supporting employees in clearing interviews. We have tried to cover almost all the main topics related to Machine Learning.

Here, we have characterized the questions based on the level of expertise you’re looking for. Preparing for your interview with these interview questions on Machine Learning will give you an edge over other interviewees and will help you crack the Machine Learning interview easily. To get in-depth knowledge on Machine Learning you can also enroll for Machine Learning course.

All the best!

Recommended Courses

Learners Enrolled For

Got more questions? We've got answers.

Book Your Free Counselling Session Today.

Name	Weight	Gender	Play Golf or Not
AA	55	M	Yes
BB	62	F	Yes
CC	58	F	No
DD	54	M	No
EE	54	M	No
FF	66	F	Yes
GG	56	M	Yes
HH	56	M	Yes

Name	Weight	Gender	Play Golf or Not
AA	55	M	Yes
BB	62	F	Yes
CC	58	F	No
DD	54	M	No
EE	54	M	No
FF	66	F	Yes
GG	56	M	Yes
HH	56	M	Yes

Machine Learning Interview Questions and Answers Data Science

Introduction

Intermediate

Advanced

1. There is an ask to evaluate a regression model based on parameters such as R square, Adjusted R square, and Tolerance? Explain what will be the criteria.

2. What is the difference between Type 1 and Type 2 Error? Explain briefly.

3. How is the logistic regression model evaluated? Explain at least 3 points.

4. There are multiple algorithms available in machine learning – supervised, unsupervised and other learning. How do you determine which one to use?

5. What is Bias-Variance trade-off? Explain.

6. What is the difference between OLS and Maximum Likelihood? Explain briefly.

7. What are the parameters to evaluate Logistic Regression? Explain briefly.

8. We have a dataset comprising of variables having more than 30% missing values. Let’s say, for example, we have 100 variables and 16 variables have missing values of more than 30%. How will you deal with this scenario?

9. We have time series data provided to us. What cross-validation techniques are to be followed?

10. What is the difference between one hot encoding and label encoding? Explain.

11. We have developed a Random Forest model with 10000 trees. We have got training error as 0. However, the validation error seems to be around 34~35. Any thoughts? Do you feel the model has not trained appropriately?

12. We have got a dataset where a number of variables is greater than the number of observations or rows. Can we use classical Regression techniques here? How would you deal with this situation?

13. What is the difference between Random Forest and Gradient Boosting algorithms? Explain briefly.

14. What are the key methods for variable selection? Explain briefly.

15. When is Ridge regression used and when is Lasso regression (ideally)?

16. We have trained/executed our model with the given dataset. We have noticed that we have used a regression model and it is suffering from multicollinearity. Is it possible to improvise on our model without losing any information?

17. Consider universities dataset below. Data for 25 undergraduate programs at business schools in US universities in 1995. The dataset excludes image variables (student satisfaction, employer satisfaction, dean’s opinions, etc.). Given this

18. We have below data with 10 transactions. What is the performance measure “Support” for “if white then blue”?

19. We have below data with 10 transactions. What is the performance measure “Confidence” for “if white then blue”?

20. We have below data with 10 transactions. What is the “Lift Ratio” for “if white then blue”?

1. What is CRISP-DM? Explain various stages

2. What is the difference between univariate and bivariate analysis? Explain briefly.

3. What is the chi-square test? When do we use this?

4. What type of bivariate analysis will you perform if variables are categorical and continuous?

5. Why missing values treatment is required?

6. Missing values in data can cause issues and there are different strategies to handle missing values. What are the different types of missing values at the time of data collection? Explain.

7. What is the difference between “listwise deletion” and “pairwise deletion”?

8. What is kNN imputation and what are its pros & cons?

9. What impact outliers have in a dataset? Explain with an example.

10. Provide at least three ways to detect outliers in a dataset?

11. Provide five assumptions of Linear regression.

12. What is a stationary time series?

13. What is auto-correlation and partial auto-correlation?

14. How will you detrend a time series?

15. How do we test if a time series data stationary or not programmatically?

16. What is the Wilcoxon Signed Rank Test?

17. What is the Kolmogorov And Smirnov Test?

18. What is the Jitter Plot? Explain with an example.

19. The model is suffering from low bias and high variance. What approach should be used to tackle this scenario and why?

20. What is the difference between kNN and k means clustering?

Want to Know More?

Description

Recommended Courses

Name	Weight	Gender	Play Golf or Not
AA	55	M	Yes
BB	62	F	Yes
CC	58	F	No
DD	54	M	No
EE	54	M	No
FF	66	F	Yes
GG	56	M	Yes
HH	56	M	Yes