Data Science Interview Questions [2024]

All Courses

Introduction

Data science is an interdisciplinary field that involves the use of statistical, computational, and analytical methods to extract insights and knowledge from large and complex data sets. Data Scientists combine knowledge and skills from various disciplines, including computer science, mathematics, statistics, and domain expertise, to solve real-world problems using data-driven approaches. We have listed the top data science interview questions with answers.

Practice data science questions on A/B testing, machine learning algorithms, gradient descent, regression and classification, data manipulation, variable transformation, data clustering, NLP, data science algorithms, PCA, model evaluation techniques, functions, power analysis, and more in this article. These topics make this guide suitable for freshers, intermediate and experts in the field of data science. With Data Science Interview Questions, you can be confident that you will be well-prepared for your next interview. So, if you are looking to advance your career in data science, this guide is the perfect resource for you.

Data Science Interview Questions and Answers

Beginner

1. What is A/B testing?

An A/B test is a randomized experiment, where "A" and "B" refer to 2 variants, undertaken in order to determine which variant is the more "effective." A/B testing is a very celebrated method for figuring out the best online promotional and marketing strategies for your business. It can be used to test everything from website copy to sales emails to search ads. And the advantages A/B testing provide are enough to offset the additional time it takes.

One big caveat for A/B testing is “ beware of the results based on the small sample size”. Sample sizes for A/B testing is a tricky business, and not as straightforward as most think (or would hope). But this is really only one piece of a larger puzzle related to statistical confidence, which can only come with both the necessary number of samples and required time for the experiment to play out. Properly experiment design will take into account the number of samples and conversions required for a desired statistical confidence, and will allow the experiment to play out fully, without pulling the plug ahead of time because there appears to be a winner.

2. What are categorical variables?

A categorical variable (sometimes called a nominal variable) is one that has two or more categories, but there is no intrinsic ordering to the categories. For example, gender is a categorical variable having let’s say two categories (male and female) and there is no intrinsic ordering to the categories. Hair colour is also a categorical variable having a number of categories (blonde, brown, brunette, red, etc.) and again, there is no agreed way to order these from highest to lowest. A purely categorical variable is one that simply allows you to assign categories but you cannot clearly order the variables. If the variable has a clear ordering, then that variable would be an ordinal variable, as described below.

Ordinal Variable - An ordinal variable is similar to a categorical variable. The difference between the two is that there is a clear ordering of the variables.
Interval Variable - An ordinal variable is similar to a categorical variable. The difference between the two is that there is a clear ordering of the variables.

Why does it matter if a variable is categorical, ordinal or interval?

Statistical computations and analyses assume that the variables have specific levels of measurement. For example, it would not make sense to compute an average hair colour. An average of a categorical variable does not make much sense because there is no intrinsic ordering of the levels of the categories. Moreover, if you tried to compute the average of educational experience as defined in the ordinal section above, you would also obtain a nonsensical result. Because the spacing between the four levels of educational experience is very uneven, the meaning of this average would be very questionable. In short, an average requires a variable to be interval. Sometimes you have variables that are “in between” ordinal and interval, for example, a five-point Likert scale with values “strongly agree”, “agree”, “neutral”, “disagree” and “strongly disagree”. If we cannot be sure that the intervals between each of these five values are the same, then we would not be able to say that this is an interval variable, but we would say that it is an ordinal variable. However, in order to be able to use statistics that assume the variable is interval, we will assume that the intervals are equally spaced.

3. What is Machine Learning?

Machine learning arises from this question: could a computer go beyond “what we know how to order it to perform” and learn on its own how to perform a specified task? Could a computer do things or learn as human being does? Rather than programmers crafting data-processing rules by hand, could a computer automatically learn these rules by looking at data?

“A machine-learning system is trained rather than explicitly programmed. It’s presented with many examples relevant to the task, and it finds statistical structure in these examples that eventually allows the system to come up with rules for automating the task. For instance, if you wished to automate the task of tagging your vacation pictures, you could present a machine-learning system with many examples of pictures already tagged by humans, and the system would learn statistical rules for associating specific pictures to specific tags.”

(Please refer to the Book – “Deep Learning with Python” by Francois Chollet)

4. What is Gradient Descent?

Gradient Descent variants:

Gradient descent is one of the most popular algorithms to perform optimization and widely used to optimize neural networks. At the same time, every state-of-the-art Deep Learning library contains implementations of various algorithms to optimize gradient descent (e.g. lasagne's, caffe's, and keras' documentation). Gradient descent is a way to minimize an objective function J(θ) parameterized by a model's parameters θ∈Rd by updating the parameters in the opposite direction of the gradient of the objective function ∇θJ(θ) w.r.t. to the parameters. The learning rate η determines the size of the steps we take to reach a (local) minimum. In other words, we follow the direction of the slope of the surface created by the objective function downhill until we reach a valley.

Batch Gradient Descent -

Batch gradient descent, computes the gradient of the cost function w.r.t. to the parameters θ for the entire training dataset:

θ=θ−η⋅∇θJ(θ)

As we need to calculate the gradients for the whole dataset to perform just one update, batch gradient descent can be very slow and is intractable for datasets that don't fit in memory. Batch gradient descent also doesn't allow us to update our model online, i.e. with new examples on-the-fly.

Stochastic Gradient Descent -

Stochastic gradient descent (SGD) in contrast performs a parameter update for each training example x(i) and label y(i):

θ = θ−η⋅∇θJ(θ; x(i); y(i))

Batch gradient descent performs redundant computations for large datasets, as it recomputes gradients for similar examples before each parameter update. SGD does away with this redundancy by performing one update at a time. It is therefore usually much faster and can also be used to learn online.

Mini - Batch Gradient Descent -

Mini-batch gradient descent considers the best of both worlds and performs an update for every mini-batch of n training examples:

θ=θ−η⋅∇θJ(θ ; x(i:i+n) ; y(i:i+n))

This way, it a) helps in reducing the variance of the parameter updates, which can lead to more stable convergence; and b) can make an effective use of highly-optimized matrix optimizations common to state-of-the-art deep learning libraries that make computing the gradient w.r.t. a mini-batch very efficient. Common mini-batch sizes range between 50 and 256, but can vary for different applications. Mini-batch gradient descent is typically the algorithm of choice when training a neural network and the term SGD usually is employed also when mini-batches are used.

5. What does P-value signify about the statistical data?

Don't be surprised if this question pops up as one of the top interview questions for data science in your next interview.

P-value in the parlance of statistics can be defined as “Lowest level of probability at which the null hypothesis can be rejected”. For key statistics like t-stat, P<=0.05 indicates that the underlying null hypothesis can be rejected in favour of alternative hypothesis at 5% level of significance and for p>0.05 indicates that we have less than absolute evidence that the null hypothesis is not true.

Intermediate

1. What is CRISP-DM? How does it help in an end-to-end Data Science execution? Briefly explain.

Response:

CRISP- DM stands for "Cross Industry Standard Process for Data Mining". This is a standard methodology used for end-to-end Data Science project or program execution. It follows various stages which involve different type of activities or tasks that are carried out during the program execution.

Business understanding – typical tasks include the following: determining business objective or goals of what needs to be accomplished, assessing the situation, determining data mining goals and trying to convert business problem into data problem, defining project plan with various tasks etc.
Data understanding – typical tasks include the following: collecting initial data, describing data, exploring data, verifying data quality etc. This helps in preparing exploratory data analysis and acts as an interim step to show what patterns, variations exist in the data and can be shown to respective stakeholders.
Data preparation – typical tasks include selecting specific data needed for modelling purposes, cleaning data, constructing data, integrating data and formatting data as needed per requirement and scope. Feature engineering is performed as part of this process step and prepared as an input to the next phase.
Modelling or Model development – typical tasks include selecting modelling techniques, generating test design, building model, assessing model etc. This phase is used to build models using various algorithms or methods.
Model evaluation – typical tasks include evaluating results, reviewing process, determining next steps etc. Various metrics are being used to evaluate multiple models or multiple experiments that were created as part of the previous step or phase.
Deployment – typical tasks include plan deployment, plan monitoring & maintenance, presenting product final report & reviewing the project etc. This refers to the operationalization phase of an existing model or solution which was created and evaluated as the best experiment to be elevated to the production environment for usage and consumption purposes.

These are iterative. Below diagram depicts a view of the process methodology.

view of the process methodology.

2. What is regularization and why it is helpful in the context of data science?

Response:

The process of adding a tuning parameter to a model or algorithm to induce smoothness to prevent and address overfitting issues is called "Regularization". Regularization term is added to a mathematical equation to prevent the coefficients to fit perfectly,avoiding the risk of overfitting.

This is primarily performed by including a constant multiple to an existing weight vector. This constant is often either the L1 (Lasso) or L2 (Ridge), however, it can in actuality get into any norm. The model predictions should then minimize the mean of the loss or error function calculated on the regularized training set.

L1 or Lasso regularization helps perform feature selection in sparse feature spaces, and that is a good practical reason to use L1 in some situations. However, beyond that particular reason, L1 may not perform better than L2 in practice. Even in a situation where you might benefit from L1's sparsity to do feature selection, using L2 on the remaining variables is likely to give better results than L1 by itself.

3. You have observed outliers in your dataset. What approaches will you consider to make your model more robust to outliers?

Response:

There are multiple ways to make a model more robust to outliers, from different aspects either from data preparation perspective or from a model-building perspective.

An outlier is assumed as being unwanted, unexpected, or a must-be-incorrect value to the human's knowledge so far (e.g. no one can live longer than 150 years of age) rather than a rare event which is possible but rare. Outliers are usually defined as the sample distribution. Hence, outliers could be removed in the pre-processing step (before any learning phase happens), by using standard deviations(sd) such as (Mean +/- 2*sd), it can be used for normality. Otherwise, interquartile ranges from Q1 - Q3, where Q1 - is the "middle" value in the first half of the rank-ordered data set, Q3 - is the "middle" value in the second half of the rank-ordered data set. It can be used for not normal/unknown as threshold levels.

Below diagram shows typical outliers encircled with red circles for sample illustration purposes.

Residuals versus the fitted values

Image Ref

Additionally, data transformation (e.g. log transformation) may help if data have a noticeable tail. When outliers are related to the sensitivity of the collecting instrument which may not precisely record small values, Winsorization may be useful. Winsorizing or winsorization is the transformation of statistics by limiting extreme values in the statistical data to reduce the effect of possibly spurious outliers.

This type of transformation has the same effect as clipping signals (i.e. replaces extreme data values with less extreme values). Another option to reduce the influence of outliers is using mean absolute difference rather mean squared error.

For model building purposes, some models are resistant to outliers (e.g. tree-based approaches) or non-parametric tests. Tree models typically divide each node into two parts in each split, which is similar to the median effect. Therefore, at each split, all data points in a bucket could be equally treated regardless of the extreme values they may have.

4. You have been provided with a dataset which has some missing values. What are the possible methods to treat missing values in the dataset? Briefly explain.

This is a common yet one of the most important data science interview questions and answers for experienced professionals, don't miss this one.

Response:

There are multiple ways to deal with missing values in dataset depending on the nature of missing values.

Some of the key methods are as follows:

Deletion methods are used when there are listwise and pairwise deletions. Here nature of missing data is missing completely at random. In listwise deletion, observations are deleted where any of the variables are missing. In pairwise deletion, analysis is performed with all cases in which the variables of interest are present.
Impute data by replacing with mean/mode/ median values. Imputation is a method to fill in the missing values with estimated values. The goal is to employ known relationships that can be identified as invalid values of the data set to assist in estimating the missing values. Mean / Mode / Median imputation is one of the most frequently used methods. It consists of replacing the missing data for a given attribute by the mean or median (quantitative attribute) or mode (qualitative attribute) of all known values of that variable.
kNN imputation – Another way is to treat using kNN imputation method. The missing values of an attribute are imputed using the given number of attributes that are most similar to the attribute whose values are missing. The similarity of two attributes is determined using a distance function.
Prediction model is one of the sophisticated approaches for handling missing data. Here, we create a predictive model to estimate values that will substitute the missing data. In this case, we divide our data set into two sets: One set with no missing values for the variable and another one with missing values.

5. What do you mean by anomaly detection? What are the different types of anomalies?

Response:

In data mining, anomaly detection is referred to as the identification of items or events that do not conform to an expected pattern or other items present in the dataset. This is an uncommon behaviour or pattern in the data.

Three types of anomalies can be categorized broadly.

Point anomalies
Contextual anomalies
Collective anomalies

A single instance of data is considered to be nomalous if it's too far off from the rest. One of the examples of a typical business use case is about detecting credit card fraud based on "amount spent." This is a point anomaly.

When the abnormality is context-specific, then it is tagged as contextual anomaly. This type of anomaly is quite common in time-series forecasting related datasets. One of the examples of a typical business use case is that spending 100 USD on food every day during the holiday season is normal, however, it may be odd otherwise. Assume we have seen a spike in sales during Thanksgiving or Christmas vacation times, this may be genuine and expected. However, observing such a surge in a non-festive season could be anomalous.

When a set of data instances collectively helps in detecting anomalies, then it is categorized under "collective anomaly". One of the examples of a typical business use case is that someone is trying to perform a financial transaction form a remote machine accessing a source or host unexpectedly where he/she does not have the authority to do so, an anomaly that would be flagged as a potential fraud attack.

6. What are different ways to check model performance? Explain one of them briefly?

Response:

There are various ways to check the performance of a model that is being developed. Some of the key approaches are as follows:

Confusion Matrix
Accuracy
Precision and Recall
F1 score
ROC or Receiver Operating Characteristic Curve
Precision-Recall Curve vs ROC curve

For example, we can consider a binary classification scenario and will explain Precision / Recall in that case.

Assume that, there are 100 positive cases among 10,000 cases. You want to predict which ones are positive, and you pick 200 to have a better chance of catching many of the 100 positive cases. You record the IDs of your predictions, and when you get the actual results you sum up how many times you were correct or incorrect. There are four ways of being correct or incorrect.

TN / True Negative: case was negative and predicted negative
TP / True Positive: case was positive and predicted positive
FN / False Negative: case was positive but predicted negative
FP / False Positive: case was negative but predicted positive

	Predicted Negative	Predicted Positive
Actual Negative Cases	9770 (TN)	130 (FP)
Actual Positive Cases	30 (FN)	70 (TP)

Now in the above example, if we compute:

What percent of your predictions were correct? Answer: the "accuracy" was (9770+70) out of 10,000 = 98.4%
What percent of the positive cases did you catch? Answer: the "recall" was 70 out of 100 = 70%
What percent of positive predictions were correct? Answer: the "precision" was 70 out of 200 = 35%

Advanced

1. What is word-embeddings? Can you talk about some state-of-the art techniques for Word Embeddings?

Wikipedia defines word embedding as collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers. Word embeddings are a way to transform words in text to numerical vectors so that they can be analysed by standard machine learning algorithms that require vectors as numerical input.

Now vectorisation can be done in many ways – One-hot-encoding, Latent Semantic Analysis (LSA),TF-IDF (Term Frequency, Inverse document frequency etc.). However, these representations capture a slightly different document-centric idea of semantic similarity.

Distributed Representation :

Distributed representations attempt to capture the meaning of a word by considering its relations with other words in its context. The idea is captured in this quote from J. R. Firth (for more information refer to the article: Document Embedding with Paragraph Vectors, by Andrew M. Dai, Christopher Olah, and Quoc V. Le, arXiv:1507.07998, 2015), a linguist who first proposed this idea: “You shall know the word by the company it keeps”.

Consider the following pair of sentences:

Paris is the capital of France. Berlin is the capital of Germany.

Even assuming you have no knowledge of world geography (or English for that matter), you would still conclude without too much effort that the word pairs (Paris, Berlin) and (France, Germany) were related in some way, and that corresponding words in each pair were related in the same way to each other, that is:

Paris : France :: Berlin : Germany

Thus, the aim of distributed representations is to find a general transformation function φ to convert each word to its associated vector such that relations of the following form hold true:

Word2vec:

The word2vec group of models was created in 2013 by a team of researchers at Google led by Tomas Mikolov. The models are basically unsupervised, taking as input a large corpus of text and producing a vector space of words. The dimensionality of the word2vec embedding space is usually lower than the dimensionality of the one-hot embedding space, which is the size of the vocabulary. The embedding space is also more dense compared to the sparse embedding of the one-hot embedding space.

The two architectures for word2vec are as follows:

Continuous Bag Of Words (CBOW)
Skip-gram

In the CBOW architecture, the model predicts the current word given a window of surrounding words. In addition, the order of the context words does not influence the prediction (that is, the bag of words assumption). In the case of skip-gram architecture, the model predicts the surrounding words given the centre word. According to the authors, CBOW is faster but skip-gram does a better job at predicting infrequent words.

2. What is Curse of Dimensionality?

For various distance-based measures like KNN (K-Nearest neighbour) method, the performance or predictive power of the model deteriorates with the increase in numbers of features required for prediction. This is an obvious fact that high- dimensional spaces are vast. Points in high-dimensional spaces tend to be dispersing from each other more compared with the points in low-dimensional space.

It is apparent that with the increase in dimensions, distance between points increases logarithmically, which gives us the hint that we need to have an exponential increase in data points with the increase in dimensions in order to make machine learning algorithms work correctly.

It can be proved that with the increase in dimensions, mean distance increases logarithmically. Hence the higher the dimensions, the more data is needed to overcome the curse of dimensionality!

3. What is Box-Cox transformation and its use?

Box-Cox transform function belongs to the Power Transform family of functions. These functions are primarily used to create monotonic data transformations, but their main significance lies in the fact that they help in stabilizing variance by adhering closely to the normal distribution and making the data independent of the mean based on its distribution. This function has one prerequisite that the numeric values to be transformed must be positive (similar to what even log transform expects). In case they are negative, shifting using a constant value helps. Mathematically, the Box-Cox transform function can be defined as:
Box-Cox transformation Such that the resulted transformed output y is a function of input x and transformation parameter λ such that when λ= 0, the resultant transform is the natural log transform, which we discussed earlier. The optimal value of λ is usually determined using a maximum likelihood or log-likelihood estimation.

4. What is Time Series Data and How it is different from Cross -Sectional Data?

This is one of the most frequently asked data science coding interview questions and answers for freshers in recent times.

Data Come in various shapes and sizes, and measure different things at different times. Financial analysts are often interested in particular types of data, such as time-series data or cross-sectional data or panel data.

Time Series Data: A time series dataset is one where the observations are time-dependent. For instance, let us now suppose that a researcher collects salary data across a city on a month-by-month basis. The observations in the dataset will now differ at various time points.
Cross – Sectional Data: A cross-sectional dataset is one where all data is treated as being at one point in time. Let's consider that you have a dataset of salaries across a city - they have all been gathered at one point in time and thus we refer to the data as cross-sectional.
Panel Data: Pooled (or panel) data is where the two are combined together. i.e. a salary dataset can contain observations collected at one point in time, as well as across different time periods.

Few additional points to bear in mind in this regard – The most common issues when working with cross-sectional data are multicollinearity and heteroscedasticity. Multicollinearity is where two or more independent variables are correlated with each other. Heteroscedasticity is where the variance of the error term is not constant (e.g. salaries are typically higher in bigger vs. smaller cities, skewing results towards bigger cities).

For time series data, serial correlation (also known as autocorrelation) is an issue. This happens when correlations exist across the error term across different time periods. e.g. if salaries are growing across time as a worker gets more experience, this does not allow us to identify important differences between salaries across different observations.

Various methods and techniques are there to deal with each of these problems.

5. What is Stationarity and Why is this important?

A type of stochastic process that has received a great deal of attention and scrutiny by time series analysts is the so-called stationary stochastic process. Broadly speaking, a stochastic process is said to be stationary if its mean and variance are constant over time and the value of the covariance between the two time periods depends only on the distance or gap or lag between the two time periods and not the actual time at which the covariance is computed. In the time series literature, such a stochastic process is known as a weakly stationary, or covariance stationary, or second-order stationary, or wide sense, stochastic process.

In short, if a time series is stationary, its mean, variance, and autocovariance (at various lags) remain the same no matter at what point we measure them; that is, they are time invariant. Such a time series will tend to return to its mean (called mean reversion) and fluctuations around this mean (measured by its variance) will have a broadly constant amplitude.7 If a time series is not stationary in the sense just defined, it is called a nonstationary time series (keep in mind we are talking only about weak stationarity). In other words, a nonstationary time series will have a time-varying mean or a time-varying variance or both.

Why are stationary time series so important? Because if a time series is nonstationary, we can study its behaviour only for the time period under consideration. Each set of time series data will therefore be for a particular episode. As a consequence, it is not possible to generalize it to other time periods. Therefore, for the purpose of forecasting, such (nonstationary) time series may be of little practical value.

There are various ways to study non-stationarity of time series data – Augmented Dicky Fuller (ADF) test one of those very popular test to determine the nature of stationarity.

12. What is K-L divergence and what is its relevance in Machine Learning algorithm?

If we have two different probability distributions P(x) and Q(x) over the same random variable x, we can measure how diﬀerent these two distributions are using the Kullback-Leibler (KL) divergence:

Kullback-Leibler

In the case of discrete variables, it is the extra amount of information (measured in bits if we use the base-2 logarithm, but in machine learning we usually use nats and the natural logarithm) needed to send a message containing symbols drawn from probability distribution P, when we use a code that was designed to minimize the length of messages drawn from probability distribution Q. The KL divergence has many useful properties, most notably being non-negative. The KL divergence is 0 if and only if P and Q are the same distribution in the case of discrete variables, or equal “almost everywhere” in the case of continuous variables. Because the KL divergence is non-negative and measures the diﬀerence between two distributions, it is often conceptualized as measuring some sort of distance between these distributions.

One use for KL-divergence in the context of discovering correlations is to calculate the Mutual Information (MI) of two variables which can reveal some pattern between two different variables and provide idea about the correlation structure.

Another use for Kullback-Leibler divergence is in the domain of variational inference, where an optimization problem is constructed that to minimize the KL-divergence between the intractable target distribution P and a sought element Q from a class of tractable distributions.

Many approximating algorithms (which can also be used to fit probabilistic models to data) can be interpreted using KL divergence. Among those are Mean Field, (Loopy) Belief Propagation (generalizing forward-backward and Viterbi for HMMs), Expectation Propagation, Junction graph/tree, tree-reweighted Belief Propagation.

(Please refer to: Wainwright, M. J. and Jordan, M. I. Graphical models, exponential families, and variational inference, Foundations and Trends text registered in Machine Learning, Now Publishers Inc., 2008, Vol. 1(1-2), pp. 1-305)

13. What is Cross-Validation and its relevance in the domain of Machine Learning?

One of the key steps in building a machine learning model is to estimate its performance on data that the model hasn't seen before. Let's assume that we t our model on a training dataset and use the same data to estimate how well it performs on new data.

A typical model may either suffer from underfitting (high bias) if the model is too simple, or it can overfit if the training data (high variance) if the model is too complex for the underlying training data. To find an acceptable bias-variance trade-off, we need to evaluate our model carefully. In this section, you will learn about the common cross-validation techniques holdout cross-validation and k-fold cross-validation, which can help us obtain reliable estimates of the model's generalization performance, that is, how well the model performs on unseen data.

The Holdout Method:

A classic and popular approach for estimating the generalization performance of machine learning models is holdout cross-validation. Using the holdout method, we split our initial dataset into a separate training and test dataset—the former is used for model training, and the latter is used to estimate its generalization performance. However, in typical machine learning applications, we are also interested in tuning and comparing different parameter settings to further improve the performance for making predictions on unseen data.

A disadvantage of the holdout method is that the performance estimate may be very sensitive to how we partition the training set into the training and validation subsets; the estimate will vary for different samples of the data.

The K-fold cross validation Method:

In k-fold cross-validation, we randomly split the training dataset into k folds without replacement, where k — 1 folds are used for the model training, and one fold is used for performance evaluation. This procedure is repeated k times so that we obtain k models and performance estimates. We then calculate the average performance of the models based on the different, independent folds to obtain a performance estimate that is less sensitive to the sub-partitioning of the training data compared to the holdout method. Typically, we use k-fold cross-validation for model tuning, that is, finding the optimal hyperparameter values that yields a satisfying generalization performance.

Since k-fold cross-validation is a resampling technique without replacement, the advantage of this approach is that each sample point will be used for training and validation (as part of a test fold) exactly once, which yields a lower-variance estimate of the model performance than the holdout method.

A good standard value for k in k-fold cross-validation is 10, as empirical evidence shows. For instance, experiments by Ron Kohavi on various real-world datasets suggest that 10-fold cross-validation offers the best trade-off between bias and variance (A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection, Kohavi, Ron, International Joint Conference on Arti cial Intelligence (IJCAI), 14 (12): 1137-43, 1995).

A special case of k-fold cross-validation is the Leave-one-out cross-validation (LOOCV) method. In LOOCV, we set the number of folds equal to the number of training samples (k = n) so that only one training sample is used for testing during each iteration, which is a recommended approach for working with very small datasets.

14. How to choose optimal value of K in K-Means clustering?

A must-know for anyone looking for top data science interview questions, this is one of the frequently asked questions.

Following methods can be considered for finding optimal value of K in K-means clustering.

Approximate Expected Overall R-square: Approximate Expected Overall R-Square is calculated based on the hypothesis that all the explanatory variables used for Clustering are independent. Hence if there is a lot of difference between Observed Overall R-square and Approximate Expected Overall R-square, we can suspect high correlation among the independent variables.

Cubic Clustering Criterion:

Comparative measure of the deviation of the clusters from the distribution expected if data points were obtained from a uniform distribution
Larger positive values of the CCC indicate a better solution, as it shows a larger difference from a uniform (no clusters) distribution.
Large negative Values indicate the presence of Outliers

Pseudo F:

The pseudo F statistic measures the separation among all the clusters at the current level
Relatively large values indicate a stopping point. Reading down the PSF column, find all possible stopping points (where PSF is very large compared to other values).

The optimal number of clusters is found at a point where CCC and Pseudo-F reach maximum and Overall R-Square tapers off.

Elbow Method: The Elbow method is a method of interpretation and validation of consistency within cluster analysis designed to help finding the appropriate number of clusters in the data set. One simple heuristic is to compute the total within sum of squares (WSS) for different values of k and look for an “elbow” in the curve. Define the cluster’s centroid as the point that is the mean value of all the points in the cluster. The within sum of squares for a single cluster is the average squared distance of each point in the cluster from the cluster’s centroid. The total within sum of squares is the sum of the within sum of squares of all the clusters. The total WSS will decrease as the number of clusters increases, because each cluster will be smaller and tighter. The hope is that the rate at which the WSS decreases will slow down for k beyond the optimal number of clusters. In other words, the graph of WSS versus k should flatten out beyond the optimal k, so the optimal k will be at the “elbow” of the graph. Unfortunately, this elbow can be difficult to see.

CH Index (Calinski-Harabasz): The Calinski-Harabasz index of a clustering is the ratio of the between-cluster variance (which is essentially the variance of all the cluster centroids from the dataset’s grand centroid) to the total within-cluster variance (basically, the average WSS of the clusters). For a given dataset, the total sum of squares (TSS) is the squared distance of all the data points from the dataset’s centroid. The TSS is independent of the clustering. If WSS(k) is the total WSS of a clustering with k clusters, then the between sum of squares BSS(k) of the clustering is given by BSS(k) = TSS - WSS(k). WSS(k) measures how close the points in a cluster are to each other. BSS(k) measures how far apart the clusters are from each other. A good clustering has a small WSS(k) and a large BSS(k).The within-cluster variance W is given by WSS(k)/(n-k), where n is the number of points in the dataset. The between-cluster variance B is given by BSS(k)/(k-1). The within-cluster variance will decrease as “K” increases; the rate of decrease should slow down past the optimal k. The between-cluster variance will increase as k, but the rate of increase should slow down past the optimal k. So in theory, the ratio of B to W should be maximized at the optimal k.

All these metrics can be evaluated to decide on the final value for K.

Want to Know More?

Full Name*

Email*

+91

Phone Number*

United States +1

India +91

Canada +1

Australia +61

Singapore +65

New Zealand +64

Germany +49

United Arab Emirates +971

Hong Kong +852

Ireland +353

Afghanistan +93

Aland Islands +358

Albania +355

Algeria +213

AmericanSamoa +1684

Andorra +376

Angola +244

Anguilla +1264

Antarctica +672

Antigua and Barbuda +1268

Argentina +54

Armenia +374

Aruba +297

Ascension Island +247

Austria +43

Azerbaijan +994

Bahamas +1242

Bahrain +973

Bangladesh +880

Barbados +1246

Belarus +375

Belgium +32

Belize +501

Benin +229

Bermuda +1441

Bhutan +975

Bolivia +591

Bosnia and Herzegovina +387

Botswana +267

Brazil +55

British Indian Ocean Territory +246

Brunei Darussalam +673

Bulgaria +359

Burkina Faso +226

Burundi +257

Cambodia +855

Cameroon +237

Cape Verde +238

Cayman Islands +1345

Central African Republic +236

Chad +235

Chile +56

China +86

Christmas Island +61

Cocos (Keeling) Islands +61

Colombia +57

Comoros +269

Congo +242

Cook Islands +682

Costa Rica +506

Cote d'Ivoire +225

Croatia +385

Cuba +53

Cyprus +357

Czech Republic +420

Democratic Republic of the Congo +243

Denmark +45

Djibouti +253

Dominica +1767

Dominican Republic +1849

Ecuador +593

Egypt +20

El Salvador +503

Equatorial Guinea +240

Eritrea +291

Estonia +372

Eswatini +268

Ethiopia +251

Falkland Islands (Malvinas) +500

Faroe Islands +298

Fiji +679

Finland +358

France +33

French Guiana +594

French Polynesia +689

Gabon +241

Gambia +220

Georgia +995

Ghana +233

Gibraltar +350

Greece +30

Greenland +299

Grenada +1473

Guadeloupe +590

Guam +1671

Guatemala +502

Guernsey +44

Guinea +224

Guinea-Bissau +245

Guyana +592

Haiti +509

Holy See (Vatican City State) +379

Honduras +504

Hungary +36

Iceland +354

Indonesia +62

Iran +98

Iraq +964

Isle of Man +44

Israel +972

Italy +39

Jamaica +1876

Japan +81

Jersey +44

Jordan +962

Kazakhstan +77

Kenya +254

Kiribati +686

Korea, Democratic People's Republic of Korea +850

Korea, Republic of South Korea +82

Kosovo +383

Kyrgyzstan +996

Laos +856

Latvia +371

Lebanon +961

Lesotho +266

Liberia +231

Libya +218

Liechtenstein +423

Lithuania +370

Luxembourg +352

Macau +853

Madagascar +261

Malawi +265

Malaysia +60

Maldives +960

Mali +223

Malta +356

Marshall Islands +692

Martinique +596

Mauritania +222

Mauritius +230

Mayotte +262

Mexico +52

Micronesia, Federated States of Micronesia +691

Moldova +373

Monaco +377

Mongolia +976

Montenegro +382

Montserrat +1664

Morocco +212

Mozambique +258

Myanmar +95

Namibia +264

Nauru +674

Nepal +977

Netherlands +31

New Caledonia +687

Nicaragua +505

Niger +227

Nigeria +234

Niue +683

Norfolk Island +672

North Macedonia +389

Northern Mariana Islands +1670

Norway +47

Oman +968

Pakistan +92

Palau +680

Palestine +970

Papua New Guinea +675

Paraguay +595

Peru +51

Philippines +63

Pitcairn +872

Poland +48

Portugal +351

Puerto Rico +1939

Qatar +974

Reunion +262

Romania +40

Russia +7

Rwanda +250

Saint Barthelemy +590

Saint Helena, Ascension and Tristan Da Cunha +290

Saint Kitts and Nevis +1869

Saint Lucia +1758

Saint Martin +590

Saint Pierre and Miquelon +508

Saint Vincent and the Grenadines +1784

Samoa +685

San Marino +378

Sao Tome and Principe +239

Saudi Arabia +966

Senegal +221

Serbia +381

Seychelles +248

Sierra Leone +232

Sint Maarten +1721

Slovakia +421

Slovenia +386

Solomon Islands +677

Somalia +252

South Africa +27

South Georgia and the South Sandwich Islands +500

South Sudan +211

Spain +34

Sri Lanka +94

Sudan +249

Suriname +597

Svalbard and Jan Mayen +47

Sweden +46

Switzerland +41

Syrian Arab Republic +963

Taiwan +886

Tajikistan +992

Tanzania, United Republic of Tanzania +255

Thailand +66

Timor-Leste +670

Togo +228

Tokelau +690

Tonga +676

Trinidad and Tobago +1868

Tunisia +216

Turkey +90

Turkmenistan +993

Turks and Caicos Islands +1649

Tuvalu +688

Uganda +256

Ukraine +380

United Kingdom +44

Uruguay +598

Uzbekistan +998

Vanuatu +678

Venezuela, Bolivarian Republic of Venezuela +58

Vietnam +84

Virgin Islands, British +1284

Virgin Islands, U.S. +1340

Wallis and Futuna +681

Yemen +967

Zambia +260

Zimbabwe +263

By Signing up, you agree to ourTerms & Conditionsand ourPrivacy and Policy

15% OFF

Coupon Code "SELF15"

Coupon Expires 23/03

Copy

Description

Being a Data Scientist is not an easy role to get into. Also just having a degree in mathematics/engineering is not enough, a data scientist also needs to develop all the skills mandated by the industry. If you are aspiring to become a Data Scientist but finding it difficult to crack the interview, these Data Science interview questions will be helpful for you. A Data Science with Python course will help you ace your interview as it offers you effective interview prep experiences.

These top Data Science Interview Questions and Answers will prepare for Data Science interview. If you are already working in Data Science projects and you want to learn Python and R programming language to increase your skill-set, you can still practice these interview questions and answers for Data Science. Preparing these Data Science interview questions during your data sciences courses will increase your visibility to potential employers.

Recommended Courses

Learners Enrolled For

Got more questions? We've got answers.

Book Your Free Counselling Session Today.

Data Science

Introduction

Beginner

Intermediate

Advanced

1. What is A/B testing?

2. What are categorical variables?

3. What is Machine Learning?

4. What is Gradient Descent?

5. What does P-value signify about the statistical data?

6. What is Linear Regression?

1. What is CRISP-DM? How does it help in an end-to-end Data Science execution? Briefly explain.

2. What is regularization and why it is helpful in the context of data science?

3. You have observed outliers in your dataset. What approaches will you consider to make your model more robust to outliers?

4. You have been provided with a dataset which has some missing values. What are the possible methods to treat missing values in the dataset? Briefly explain.

5. What do you mean by anomaly detection? What are the different types of anomalies?

6. What are different ways to check model performance? Explain one of them briefly?

7. What is p-value? Explain briefly. If the p-value is relatively small, what would the statistical significance indicate in the context of the hypothesis?

8. What is the difference between Type 1 error and Type 2 error? Explain briefly with examples.

9. Which one is better – too many false positives or too many false negatives? Explain briefly.

10. How do you handle unbalanced class issue in a binary classification context? Explain briefly.

11. What is population and what are summary measures?

12. The weight of an object is normally distributed with a mean of 30 pounds and a standard deviation of 9.8 pounds. A person randomly selected 36 objects and loaded them into a vehicle. What is the probability that the bunch of objects will weigh > 1010 pounds?

13. What is variable transformation? When to use variable transformation?

14. Provide examples of three methods of variable transformation?

15. What is clustering? What are the different types? Briefly explain.

16. What are the different types of machine learning categories in Data Science? Explain briefly about various algorithms/techniques within those?

17. What is Ensemble method in Data Science?

18. What is the difference between Regression and Classification? When should one use Classification compared to Regression?

19. What is cross-validation? What is meant by 5-fold cross-validation?

20. What does pruning refer to in a tree-based algorithm? Why is it used?

21. What is Machine Learning interpretability and what are different types of dataset shifts?

22. Why is dataset shift a problem area? Explain briefly about some ideas that may address those problems in data science?

23. What is Principal Component Analysis (PCA) in Data Science?

24. Write down 6 model evaluation techniques? Highlight two of those which can be used to measure or evaluate Linear Regression models.

25. What is the Gini Coefficient and why is it used in a binary classification scenario?

1. What is word-embeddings? Can you talk about some state-of-the art techniques for Word Embeddings?

2. What is Curse of Dimensionality?

3. What is Box-Cox transformation and its use?

4. What is Time Series Data and How it is different from Cross -Sectional Data?

5. What is Stationarity and Why is this important?

6. What is Power Analysis?

7. What are the basic assumptions of Linear Regression?

8. What is Deep Learning?

9. What is Reinforcement Learning?

10. How will you find the correlation between a binary variable and a continuous variable ?

11. What is ROC Curve and how it is used in Machine Learning?

12. What is K-L divergence and what is its relevance in Machine Learning algorithm?

13. What is Cross-Validation and its relevance in the domain of Machine Learning?

14. How to choose optimal value of K in K-Means clustering?

Want to Know More?

15% OFF

Description

Recommended Courses