This guide contains all the expert-designed data science interview questions you may expect when interviewing for a Data Scientist role. Following is the list of the Data Science interview questions and answers that is broken into 2 categories- Basic and Advanced interview questions and answers on Data Science.
An A/B test is a randomized experiment, where "A" and "B" refer to 2 variants, undertaken in order to determine which variant is the more "effective." A/B testing is a very celebrated method for figuring out the best online promotional and marketing strategies for your business. It can be used to test everything from website copy to sales emails to search ads. And the advantages A/B testing provide are enough to offset the additional time it takes.
One big caveat for A/B testing is “ beware of the results based on the small sample size”. Sample sizes for A/B testing is a tricky business, and not as straightforward as most think (or would hope). But this is really only one piece of a larger puzzle related to statistical confidence, which can only come with both the necessary number of samples and required time for the experiment to play out. Properly experiment design will take into account the number of samples and conversions required for a desired statistical confidence, and will allow the experiment to play out fully, without pulling the plug ahead of time because there appears to be a winner.
A categorical variable (sometimes called a nominal variable) is one that has two or more categories, but there is no intrinsic ordering to the categories. For example, gender is a categorical variable having let’s say two categories (male and female) and there is no intrinsic ordering to the categories. Hair colour is also a categorical variable having a number of categories (blonde, brown, brunette, red, etc.) and again, there is no agreed way to order these from highest to lowest. A purely categorical variable is one that simply allows you to assign categories but you cannot clearly order the variables. If the variable has a clear ordering, then that variable would be an ordinal variable, as described below.
Ordinal Variable - An ordinal variable is similar to a categorical variable. The difference between the two is that there is a clear ordering of the variables.
Interval Variable - An ordinal variable is similar to a categorical variable. The difference between the two is that there is a clear ordering of the variables.
Why does it matter if a variable is categorical, ordinal or interval?
Statistical computations and analyses assume that the variables have specific levels of measurement. For example, it would not make sense to compute an average hair colour. An average of a categorical variable does not make much sense because there is no intrinsic ordering of the levels of the categories. Moreover, if you tried to compute the average of educational experience as defined in the ordinal section above, you would also obtain a nonsensical result. Because the spacing between the four levels of educational experience is very uneven, the meaning of this average would be very questionable. In short, an average requires a variable to be interval. Sometimes you have variables that are “in between” ordinal and interval, for example, a five-point Likert scale with values “strongly agree”, “agree”, “neutral”, “disagree” and “strongly disagree”. If we cannot be sure that the intervals between each of these five values are the same, then we would not be able to say that this is an interval variable, but we would say that it is an ordinal variable. However, in order to be able to use statistics that assume the variable is interval, we will assume that the intervals are equally spaced.
Machine learning arises from this question: could a computer go beyond “what we know how to order it to perform” and learn on its own how to perform a specified task? Could a computer do things or learn as human being does? Rather than programmers crafting data-processing rules by hand, could a computer automatically learn these rules by looking at data?
“A machine-learning system is trained rather than explicitly programmed. It’s presented with many examples relevant to the task, and it finds statistical structure in these examples that eventually allows the system to come up with rules for automating the task. For instance, if you wished to automate the task of tagging your vacation pictures, you could present a machine-learning system with many examples of pictures already tagged by humans, and the system would learn statistical rules for associating specific pictures to specific tags.”
(Please refer to the Book – “Deep Learning with Python” by Francois Chollet)
Gradient descent is one of the most popular algorithms to perform optimization and widely used to optimize neural networks. At the same time, every state-of-the-art Deep Learning library contains implementations of various algorithms to optimize gradient descent (e.g. lasagne's, caffe's, and keras' documentation). Gradient descent is a way to minimize an objective function J(θ) parameterized by a model's parameters θ∈Rd by updating the parameters in the opposite direction of the gradient of the objective function ∇θJ(θ) w.r.t. to the parameters. The learning rate η determines the size of the steps we take to reach a (local) minimum. In other words, we follow the direction of the slope of the surface created by the objective function downhill until we reach a valley.
Gradient Descent variants:
Batch Gradient Descent - Batch gradient descent, computes the gradient of the cost function w.r.t. to the parameters θ for the entire training dataset:
As we need to calculate the gradients for the whole dataset to perform just one update, batch gradient descent can be very slow and is intractable for datasets that don't fit in memory. Batch gradient descent also doesn't allow us to update our model online, i.e. with new examples on-the-fly.
Stochastic Gradient Descent - Stochastic gradient descent (SGD) in contrast performs a parameter update for each training example x(i) and label y(i):
θ = θ−η⋅∇θJ(θ; x(i); y(i))
Batch gradient descent performs redundant computations for large datasets, as it recomputes gradients for similar examples before each parameter update. SGD does away with this redundancy by performing one update at a time. It is therefore usually much faster and can also be used to learn online.
Mini - Batch Gradient Descent - Mini-batch gradient descent considers the best of both worlds and performs an update for every mini-batch of n training examples:
θ=θ−η⋅∇θJ(θ ; x(i:i+n) ; y(i:i+n))
This way, it a) helps in reducing the variance of the parameter updates, which can lead to more stable convergence; and b) can make an effective use of highly-optimized matrix optimizations common to state-of-the-art deep learning libraries that make computing the gradient w.r.t. a mini-batch very efficient. Common mini-batch sizes range between 50 and 256, but can vary for different applications. Mini-batch gradient descent is typically the algorithm of choice when training a neural network and the term SGD usually is employed also when mini-batches are used.
P-value in the parlance of statistics can be defined as “Lowest level of probability at which the null hypothesis can be rejected”. For key statistics like t-stat, P<=0.05 indicates that the underlying null hypothesis can be rejected in favour of alternative hypothesis at 5% level of significance and for p>0.05 indicates that we have less than absolute evidence that the null hypothesis is not true.
Linear Regression is the oldest, simple and widely used supervised machine learning algorithm for predictive analysis. It’s a method to predict a target variable by fitting the best linear relationship between the dependent and independent variable.
The best fitting line can be found by making sure that the sum of all the distances between the shape and the actual observations at each point is as small as possible. The fit of the shape is “best” in the sense that no other position would produce less error given the choice of shape.
Types of Linear Regression:
Wikipedia defines word embedding as collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers. Word embeddings are a way to transform words in text to numerical vectors so that they can be analysed by standard machine learning algorithms that require vectors as numerical input.
Now vectorisation can be done in many ways – One-hot-encoding, Latent Semantic Analysis (LSA),TF-IDF (Term Frequency, Inverse document frequency etc.). However, these representations capture a slightly different document-centric idea of semantic similarity.
Distributed Representation :
Distributed representations attempt to capture the meaning of a word by considering its relations with other words in its context. The idea is captured in this quote from J. R. Firth (for more information refer to the article: Document Embedding with Paragraph Vectors, by Andrew M. Dai, Christopher Olah, and Quoc V. Le, arXiv:1507.07998, 2015), a linguist who first proposed this idea: “You shall know the word by the company it keeps”.
Consider the following pair of sentences:
Paris is the capital of France. Berlin is the capital of Germany.
Even assuming you have no knowledge of world geography (or English for that matter), you would still conclude without too much effort that the word pairs (Paris, Berlin) and (France, Germany) were related in some way, and that corresponding words in each pair were related in the same way to each other, that is:
Paris : France :: Berlin : Germany
Thus, the aim of distributed representations is to find a general transformation function φ to convert each word to its associated vector such that relations of the following form hold true:
The word2vec group of models was created in 2013 by a team of researchers at Google led by Tomas Mikolov. The models are basically unsupervised, taking as input a large corpus of text and producing a vector space of words. The dimensionality of the word2vec embedding space is usually lower than the dimensionality of the one-hot embedding space, which is the size of the vocabulary. The embedding space is also more dense compared to the sparse embedding of the one-hot embedding space.
The two architectures for word2vec are as follows:
In the CBOW architecture, the model predicts the current word given a window of surrounding words. In addition, the order of the context words does not influence the prediction (that is, the bag of words assumption). In the case of skip-gram architecture, the model predicts the surrounding words given the centre word. According to the authors, CBOW is faster but skip-gram does a better job at predicting infrequent words.
For various distance-based measures like KNN (K-Nearest neighbour) method, the performance or predictive power of the model deteriorates with the increase in numbers of features required for prediction. This is an obvious fact that high- dimensional spaces are vast. Points in high-dimensional spaces tend to be dispersing from each other more compared with the points in low-dimensional space.
It is apparent that with the increase in dimensions, distance between points increases logarithmically, which gives us the hint that we need to have an exponential increase in data points with the increase in dimensions in order to make machine learning algorithms work correctly.
It can be proved that with the increase in dimensions, mean distance increases logarithmically. Hence the higher the dimensions, the more data is needed to overcome the curse of dimensionality!
Box-Cox transform function belongs to the Power Transform family of functions. These functions are primarily used to create monotonic data transformations, but their main significance lies in the fact that they help in stabilizing variance by adhering closely to the normal distribution and making the data independent of the mean based on its distribution. This function has one prerequisite that the numeric values to be transformed must be positive (similar to what even log transform expects). In case they are negative, shifting using a constant value helps. Mathematically, the Box-Cox transform function can be defined as:
Such that the resulted transformed output y is a function of input x and transformation parameter λ such that when λ= 0, the resultant transform is the natural log transform, which we discussed earlier. The optimal value of λ is usually determined using a maximum likelihood or log-likelihood estimation.
Data Come in various shapes and sizes, and measure different things at different times. Financial analysts are often interested in particular types of data, such as time-series data or cross-sectional data or panel data.
Few additional points to bear in mind in this regard – The most common issues when working with cross-sectional data are multicollinearity and heteroscedasticity. Multicollinearity is where two or more independent variables are correlated with each other. Heteroscedasticity is where the variance of the error term is not constant (e.g. salaries are typically higher in bigger vs. smaller cities, skewing results towards bigger cities).
For time series data, serial correlation (also known as autocorrelation) is an issue. This happens when correlations exist across the error term across different time periods. e.g. if salaries are growing across time as a worker gets more experience, this does not allow us to identify important differences between salaries across different observations.
Various methods and techniques are there to deal with each of these problems.
A type of stochastic process that has received a great deal of attention and scrutiny by time series analysts is the so-called stationary stochastic process. Broadly speaking, a stochastic process is said to be stationary if its mean and variance are constant over time and the value of the covariance between the two time periods depends only on the distance or gap or lag between the two time periods and not the actual time at which the covariance is computed. In the time series literature, such a stochastic process is known as a weakly stationary, or covariance stationary, or second-order stationary, or wide sense, stochastic process.
In short, if a time series is stationary, its mean, variance, and autocovariance (at various lags) remain the same no matter at what point we measure them; that is, they are time invariant. Such a time series will tend to return to its mean (called mean reversion) and fluctuations around this mean (measured by its variance) will have a broadly constant amplitude.7 If a time series is not stationary in the sense just defined, it is called a nonstationary time series (keep in mind we are talking only about weak stationarity). In other words, a nonstationary time series will have a time-varying mean or a time-varying variance or both.
Why are stationary time series so important? Because if a time series is nonstationary, we can study its behaviour only for the time period under consideration. Each set of time series data will therefore be for a particular episode. As a consequence, it is not possible to generalize it to other time periods. Therefore, for the purpose of forecasting, such (nonstationary) time series may be of little practical value.
There are various ways to study non-stationarity of time series data – Augmented Dicky Fuller (ADF) test one of those very popular test to determine the nature of stationarity.
The main goals of power analysis are two folds in the process of designing an experiment, (a) how large a sample is required for making statistical judgments that are accurate and reliable and (b) how likely your statistical test will be to detect effects for a given size in a particular situation.
In other words, Power analysis is a very crucial aspect of experimental design. It helps us to determine the sample size required to detect an effect of a given size with a given degree of confidence. Conversely, it allows us to determine the probability of detecting an effect of a given size with a given level of confidence, under sample size constraints. If the probability is unacceptably low, we would be advised to alter or abandon the experiment.
The following four quantities are most important so far Power analysis is concerned:
Given any three, we can determine the fourth.
The LR model is based on certain assumptions, some of which refers to the distribution of the random variable (error term : e) and finally some refer to the relationship between e and the explanatory variables. We will group them in two categories (i) Stochastic Assumptions (ii) Other assumptions.
(Please refer to the Book – “The theory of econometrics – 2nd Edition by A. Koutsoyiannis”)
As Francois Chollet in his book “Deep Learning with Python” has defined “Deep learning is a specific subfield of machine learning: a new take on learning representations from data that puts an emphasis on learning successive layers of increasingly meaningful representations.” The deep in deep learning does not necessarily refer to any kind of “deeper understanding achieved by the approach; rather, it stands for the idea of successive layers of representations. How many layers contribute to a model of the data is called the depth of the model. Other appropriate names for the field could have been layered representations learning and hierarchical representations learning. Modern deep learning often involves tens or even hundreds of successive layers of representations— and they’ve all learned automatically from exposure to training data. Meanwhile, other approaches to machine learning tend to focus on learning only one or two layers of representations of the data; hence, they’re sometimes called shallow learning.” (Please refer to the Book – “Deep Learning with Python” by Francois Chollet)
Reinforcement Learning is a special brunch of Machine Learning that has received a lot of attention in recent times after Google DeepMind successfully applied it to learning to play Atari games (and, later, learning to play Go at the highest level). Typically RL refers to a framework where an agent receives information about its environment and learns to choose actions that will maximize some reward. For instance, a neural network that “looks” at a videogame screen and outputs game actions in order to maximize its score can be trained via reinforcement learning.
Currently, reinforcement learning is one of the most researched area and yet to be significantly successful beyond games. In time, however, we expect to see reinforcement learning take over an increasingly large range of real-world applications: self-driving cars, robotics, resource management, education, and so on. It’s an idea whose time has come, or will come soon.
One approach would be to calculate “Point Bi-serial Correlation” which will give you an estimate to measure the degree of association between a binary variable and continuous variable. The Point-Biserial Correlation Coefficient is a correlation measure of the strength of association/coherence between a continuous-level variable (ratio or interval data) and a binary variable. Binary variables are variables of nominal scale with only two values. They are also called dichotomous variables or dummy variables in Regression Analysis.
Mathematically, the Point-Biserial Correlation Coefficient is calculated just as the Pearson’s Bivariate Correlation Coefficient would be calculated, wherein the dichotomous variable of the two variables is either 0 or 1—which is why it is also called the binary variable. Since we use the same mathematical concept, we do need to fulfil the same assumptions, which are normal distribution of the continuous variable and homoscedasticity.
An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve basically plots two parameters:
An ROC curve plots TPR and FPR at different classification/probability thresholds. Lowering the classification threshold classifies more items as positive, thus increasing both False Positives and alternatively True Positives. In order to compute the points in an ROC curve, one could evaluate a logistic regression model many times with different classification thresholds, but this would be inefficient. Fortunately, there's an efficient, sorting-based algorithm that can provide this information for us, called AUC.
AUC stands for "Area under the ROC Curve." i.e. , AUC measures the two-dimensional area underneath the entire ROC curve ranging from (0,0) to (1,1). AUC provides an aggregate measure of performance across all possible classification thresholds. One way of interpreting AUC is as the probability that the model ranks a random positive example more highly than a random negative example.
AUC ranges in value from 0 to 1. A model whose predictions are 100% incorrect has an AUC of 0.0; one whose predictions are 100% correct has an AUC of 1.0.
If we have two different probability distributions P(x) and Q(x) over the same random variable x, we can measure how diﬀerent these two distributions are using the Kullback-Leibler (KL) divergence:
In the case of discrete variables, it is the extra amount of information (measured in bits if we use the base-2 logarithm, but in machine learning we usually use nats and the natural logarithm) needed to send a message containing symbols drawn from probability distribution P, when we use a code that was designed to minimize the length of messages drawn from probability distribution Q. The KL divergence has many useful properties, most notably being non-negative. The KL divergence is 0 if and only if P and Q are the same distribution in the case of discrete variables, or equal “almost everywhere” in the case of continuous variables. Because the KL divergence is non-negative and measures the diﬀerence between two distributions, it is often conceptualized as measuring some sort of distance between these distributions.
One use for KL-divergence in the context of discovering correlations is to calculate the Mutual Information (MI) of two variables which can reveal some pattern between two different variables and provide idea about the correlation structure.
Another use for Kullback-Leibler divergence is in the domain of variational inference, where an optimization problem is constructed that to minimize the KL-divergence between the intractable target distribution P and a sought element Q from a class of tractable distributions.
Many approximating algorithms (which can also be used to fit probabilistic models to data) can be interpreted using KL divergence. Among those are Mean Field, (Loopy) Belief Propagation (generalizing forward-backward and Viterbi for HMMs), Expectation Propagation, Junction graph/tree, tree-reweighted Belief Propagation.
(Please refer to: Wainwright, M. J. and Jordan, M. I. Graphical models, exponential families, and variational inference, Foundations and Trends text registered in Machine Learning, Now Publishers Inc., 2008, Vol. 1(1-2), pp. 1-305)
One of the key steps in building a machine learning model is to estimate its performance on data that the model hasn't seen before. Let's assume that we t our model on a training dataset and use the same data to estimate how well it performs on new data.
A typical model may either suffer from underfitting (high bias) if the model is too simple, or it can overfit if the training data (high variance) if the model is too complex for the underlying training data. To find an acceptable bias-variance trade-off, we need to evaluate our model carefully. In this section, you will learn about the common cross-validation techniques holdout cross-validation and k-fold cross-validation, which can help us obtain reliable estimates of the model's generalization performance, that is, how well the model performs on unseen data.
The Holdout Method:
A classic and popular approach for estimating the generalization performance of machine learning models is holdout cross-validation. Using the holdout method, we split our initial dataset into a separate training and test dataset—the former is used for model training, and the latter is used to estimate its generalization performance. However, in typical machine learning applications, we are also interested in tuning and comparing different parameter settings to further improve the performance for making predictions on unseen data.
A disadvantage of the holdout method is that the performance estimate may be very sensitive to how we partition the training set into the training and validation subsets; the estimate will vary for different samples of the data.
The K-fold cross validation Method:
In k-fold cross-validation, we randomly split the training dataset into k folds without replacement, where k — 1 folds are used for the model training, and one fold is used for performance evaluation. This procedure is repeated k times so that we obtain k models and performance estimates. We then calculate the average performance of the models based on the different, independent folds to obtain a performance estimate that is less sensitive to the sub-partitioning of the training data compared to the holdout method. Typically, we use k-fold cross-validation for model tuning, that is, finding the optimal hyperparameter values that yields a satisfying generalization performance.
Since k-fold cross-validation is a resampling technique without replacement, the advantage of this approach is that each sample point will be used for training and validation (as part of a test fold) exactly once, which yields a lower-variance estimate of the model performance than the holdout method.
A good standard value for k in k-fold cross-validation is 10, as empirical evidence shows. For instance, experiments by Ron Kohavi on various real-world datasets suggest that 10-fold cross-validation offers the best trade-off between bias and variance (A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection, Kohavi, Ron, International Joint Conference on Arti cial Intelligence (IJCAI), 14 (12): 1137-43, 1995).
A special case of k-fold cross-validation is the Leave-one-out cross-validation (LOOCV) method. In LOOCV, we set the number of folds equal to the number of training samples (k = n) so that only one training sample is used for testing during each iteration, which is a recommended approach for working with very small datasets.
Following methods can be considered for finding optimal value of K in K-means clustering.
Approximate Expected Overall R-square: Approximate Expected Overall R-Square is calculated based on the hypothesis that all the explanatory variables used for Clustering are independent. Hence if there is a lot of difference between Observed Overall R-square and Approximate Expected Overall R-square, we can suspect high correlation among the independent variables.
Cubic Clustering Criterion:
The optimal number of clusters is found at a point where CCC and Pseudo-F reach maximum and Overall R-Square tapers off.
Elbow Method: The Elbow method is a method of interpretation and validation of consistency within cluster analysis designed to help finding the appropriate number of clusters in the data set. One simple heuristic is to compute the total within sum of squares (WSS) for different values of k and look for an “elbow” in the curve. Define the cluster’s centroid as the point that is the mean value of all the points in the cluster. The within sum of squares for a single cluster is the average squared distance of each point in the cluster from the cluster’s centroid. The total within sum of squares is the sum of the within sum of squares of all the clusters. The total WSS will decrease as the number of clusters increases, because each cluster will be smaller and tighter. The hope is that the rate at which the WSS decreases will slow down for k beyond the optimal number of clusters. In other words, the graph of WSS versus k should flatten out beyond the optimal k, so the optimal k will be at the “elbow” of the graph. Unfortunately, this elbow can be difficult to see.
CH Index (Calinski-Harabasz): The Calinski-Harabasz index of a clustering is the ratio of the between-cluster variance (which is essentially the variance of all the cluster centroids from the dataset’s grand centroid) to the total within-cluster variance (basically, the average WSS of the clusters). For a given dataset, the total sum of squares (TSS) is the squared distance of all the data points from the dataset’s centroid. The TSS is independent of the clustering. If WSS(k) is the total WSS of a clustering with k clusters, then the between sum of squares BSS(k) of the clustering is given by BSS(k) = TSS - WSS(k). WSS(k) measures how close the points in a cluster are to each other. BSS(k) measures how far apart the clusters are from each other. A good clustering has a small WSS(k) and a large BSS(k).The within-cluster variance W is given by WSS(k)/(n-k), where n is the number of points in the dataset. The between-cluster variance B is given by BSS(k)/(k-1). The within-cluster variance will decrease as “K” increases; the rate of decrease should slow down past the optimal k. The between-cluster variance will increase as k, but the rate of increase should slow down past the optimal k. So in theory, the ratio of B to W should be maximized at the optimal k.
All these metrics can be evaluated to decide on the final value for K.
Being a Data Scientist is not an easy role to get into. Also just having a degree in mathematics/engineering is not enough, a data scientist also needs to develop all the skills mandated by the industry. If you are aspiring to become a Data Scientist but finding it difficult to crack the interview, these Data Science interview questions will be helpful for you.
These top Data Science Interview Questions and Answers will prepare for Data Science interview. If you are already working in Data Science projects and you want to learn Python and R programming language to increase your skill-set, you can still practice these interview questions and answers for Data Science. Preparing these Data Science interview questions will increase your visibility to the potential employers.