
Domains
Agile Management
Master Agile methodologies for efficient and timely project delivery.
View All Agile Management Coursesicon-refresh-cwCertifications
Scrum Alliance
16 Hours
Best Seller
Certified ScrumMaster (CSM) CertificationScrum Alliance
16 Hours
Best Seller
Certified Scrum Product Owner (CSPO) CertificationScaled Agile
16 Hours
Trending
Leading SAFe 6.0 CertificationScrum.org
16 Hours
Professional Scrum Master (PSM) CertificationScaled Agile
16 Hours
SAFe 6.0 Scrum Master (SSM) CertificationAdvanced Certifications
Scaled Agile, Inc.
32 Hours
Recommended
Implementing SAFe 6.0 (SPC) CertificationScaled Agile, Inc.
24 Hours
SAFe 6.0 Release Train Engineer (RTE) CertificationScaled Agile, Inc.
16 Hours
Trending
SAFe® 6.0 Product Owner/Product Manager (POPM)IC Agile
24 Hours
ICP Agile Certified Coaching (ICP-ACC)Scrum.org
16 Hours
Professional Scrum Product Owner I (PSPO I) TrainingMasters
32 Hours
Trending
Agile Management Master's Program32 Hours
Agile Excellence Master's ProgramOn-Demand Courses
Agile and ScrumRoles
Scrum MasterTech Courses and Bootcamps
Full Stack Developer BootcampAccreditation Bodies
Scrum AllianceTop Resources
Scrum TutorialProject Management
Gain expert skills to lead projects to success and timely completion.
View All Project Management Coursesicon-standCertifications
PMI
36 Hours
Best Seller
Project Management Professional (PMP) CertificationAxelos
32 Hours
PRINCE2 Foundation & Practitioner CertificationAxelos
16 Hours
PRINCE2 Foundation CertificationAxelos
16 Hours
PRINCE2 Practitioner CertificationSkills
Change ManagementMasters
Job Oriented
45 Hours
Trending
Project Management Master's ProgramUniversity Programs
45 Hours
Trending
Project Management Master's ProgramOn-Demand Courses
PRINCE2 Practitioner CourseRoles
Project ManagerAccreditation Bodies
PMITop Resources
Theories of MotivationCloud Computing
Learn to harness the cloud to deliver computing resources efficiently.
View All Cloud Computing Coursesicon-cloud-snowingCertifications
AWS
32 Hours
Best Seller
AWS Certified Solutions Architect - AssociateAWS
32 Hours
AWS Cloud Practitioner CertificationAWS
24 Hours
AWS DevOps CertificationMicrosoft
16 Hours
Azure Fundamentals CertificationMicrosoft
24 Hours
Best Seller
Azure Administrator CertificationMicrosoft
45 Hours
Recommended
Azure Data Engineer CertificationMicrosoft
32 Hours
Azure Solution Architect CertificationMicrosoft
40 Hours
Azure DevOps CertificationAWS
24 Hours
Systems Operations on AWS Certification TrainingAWS
24 Hours
Developing on AWSMasters
Job Oriented
48 Hours
New
AWS Cloud Architect Masters ProgramBootcamps
Career Kickstarter
100 Hours
Trending
Cloud Engineer BootcampRoles
Cloud EngineerOn-Demand Courses
AWS Certified Developer Associate - Complete GuideAuthorized Partners of
AWSTop Resources
Scrum TutorialIT Service Management
Understand how to plan, design, and optimize IT services efficiently.
View All DevOps Coursesicon-git-commitCertifications
Axelos
16 Hours
Best Seller
ITIL 4 Foundation CertificationAxelos
16 Hours
ITIL Practitioner CertificationPeopleCert
16 Hours
ISO 14001 Foundation CertificationPeopleCert
16 Hours
ISO 20000 CertificationPeopleCert
24 Hours
ISO 27000 Foundation CertificationAxelos
24 Hours
ITIL 4 Specialist: Create, Deliver and Support TrainingAxelos
24 Hours
ITIL 4 Specialist: Drive Stakeholder Value TrainingAxelos
16 Hours
ITIL 4 Strategist Direct, Plan and Improve TrainingOn-Demand Courses
ITIL 4 Specialist: Create, Deliver and Support ExamTop Resources
ITIL Practice TestData Science
Unlock valuable insights from data with advanced analytics.
View All Data Science Coursesicon-dataBootcamps
Job Oriented
6 Months
Trending
Data Science BootcampJob Oriented
289 Hours
Data Engineer BootcampJob Oriented
6 Months
Data Analyst BootcampJob Oriented
288 Hours
New
AI Engineer BootcampSkills
Data Science with PythonRoles
Data ScientistOn-Demand Courses
Data Analysis Using ExcelTop Resources
Machine Learning TutorialDevOps
Automate and streamline the delivery of products and services.
View All DevOps Coursesicon-terminal-squareCertifications
DevOps Institute
16 Hours
Best Seller
DevOps Foundation CertificationCNCF
32 Hours
New
Certified Kubernetes AdministratorDevops Institute
16 Hours
Devops LeaderSkills
KubernetesRoles
DevOps EngineerOn-Demand Courses
CI/CD with Jenkins XGlobal Accreditations
DevOps InstituteTop Resources
Top DevOps ProjectsBI And Visualization
Understand how to transform data into actionable, measurable insights.
View All BI And Visualization Coursesicon-microscopeBI and Visualization Tools
Certification
24 Hours
Recommended
Tableau CertificationCertification
24 Hours
Data Visualization with Tableau CertificationMicrosoft
24 Hours
Best Seller
Microsoft Power BI CertificationTIBCO
36 Hours
TIBCO Spotfire TrainingCertification
30 Hours
Data Visualization with QlikView CertificationCertification
16 Hours
Sisense BI CertificationOn-Demand Courses
Data Visualization Using Tableau TrainingTop Resources
Python Data Viz LibsCyber Security
Understand how to protect data and systems from threats or disasters.
View All Cyber Security Coursesicon-refresh-cwCertifications
CompTIA
40 Hours
Best Seller
CompTIA Security+EC-Council
40 Hours
Certified Ethical Hacker (CEH v12) CertificationISACA
22 Hours
Certified Information Systems Auditor (CISA) CertificationISACA
40 Hours
Certified Information Security Manager (CISM) Certification(ISC)²
40 Hours
Certified Information Systems Security Professional (CISSP)(ISC)²
40 Hours
Certified Cloud Security Professional (CCSP) Certification16 Hours
Certified Information Privacy Professional - Europe (CIPP-E) CertificationISACA
16 Hours
COBIT5 Foundation16 Hours
Payment Card Industry Security Standards (PCI-DSS) CertificationOn-Demand Courses
CISSPTop Resources
Laptops for IT SecurityWeb Development
Learn to create user-friendly, fast, and dynamic web applications.
View All Web Development Coursesicon-codeBootcamps
Career Kickstarter
6 Months
Best Seller
Full-Stack Developer BootcampJob Oriented
3 Months
Best Seller
UI/UX Design BootcampEnterprise Recommended
6 Months
Java Full Stack Developer BootcampCareer Kickstarter
490+ Hours
Front-End Development BootcampCareer Accelerator
4 Months
Backend Development Bootcamp (Node JS)Skills
ReactOn-Demand Courses
Angular TrainingTop Resources
Top HTML ProjectsBlockchain
Understand how transactions and databases work in blockchain technology.
View All Blockchain Coursesicon-stop-squareBlockchain Certifications
40 Hours
Blockchain Professional Certification32 Hours
Blockchain Solutions Architect Certification32 Hours
Blockchain Security Engineer Certification24 Hours
Blockchain Quality Engineer Certification5+ Hours
Blockchain 101 CertificationOn-Demand Courses
NFT Essentials 101: A Beginner's GuideTop Resources
Blockchain Interview QsProgramming
Learn to code efficiently and design software that solves problems.
View All Programming Coursesicon-codeSkills
Python CertificationInterview Prep
Career Accelerator
3 Months
Software Engineer Interview PrepOn-Demand Courses
Data Structures and Algorithms with JavaScriptTop Resources
Python TutorialData Science
4.6 Rating 45 Questions 30 mins read5 Readers

An A/B test is a randomized experiment, where "A" and "B" refer to 2 variants, undertaken in order to determine which variant is the more "effective." A/B testing is a very celebrated method for figuring out the best online promotional and marketing strategies for your business. It can be used to test everything from website copy to sales emails to search ads. And the advantages A/B testing provide are enough to offset the additional time it takes.
One big caveat for A/B testing is “ beware of the results based on the small sample size”. Sample sizes for A/B testing is a tricky business, and not as straightforward as most think (or would hope). But this is really only one piece of a larger puzzle related to statistical confidence, which can only come with both the necessary number of samples and required time for the experiment to play out. Properly experiment design will take into account the number of samples and conversions required for a desired statistical confidence, and will allow the experiment to play out fully, without pulling the plug ahead of time because there appears to be a winner.
A categorical variable (sometimes called a nominal variable) is one that has two or more categories, but there is no intrinsic ordering to the categories. For example, gender is a categorical variable having let’s say two categories (male and female) and there is no intrinsic ordering to the categories. Hair colour is also a categorical variable having a number of categories (blonde, brown, brunette, red, etc.) and again, there is no agreed way to order these from highest to lowest. A purely categorical variable is one that simply allows you to assign categories but you cannot clearly order the variables. If the variable has a clear ordering, then that variable would be an ordinal variable, as described below.
Why does it matter if a variable is categorical, ordinal or interval?
Statistical computations and analyses assume that the variables have specific levels of measurement. For example, it would not make sense to compute an average hair colour. An average of a categorical variable does not make much sense because there is no intrinsic ordering of the levels of the categories. Moreover, if you tried to compute the average of educational experience as defined in the ordinal section above, you would also obtain a nonsensical result. Because the spacing between the four levels of educational experience is very uneven, the meaning of this average would be very questionable. In short, an average requires a variable to be interval. Sometimes you have variables that are “in between” ordinal and interval, for example, a five-point Likert scale with values “strongly agree”, “agree”, “neutral”, “disagree” and “strongly disagree”. If we cannot be sure that the intervals between each of these five values are the same, then we would not be able to say that this is an interval variable, but we would say that it is an ordinal variable. However, in order to be able to use statistics that assume the variable is interval, we will assume that the intervals are equally spaced.
Machine learning arises from this question: could a computer go beyond “what we know how to order it to perform” and learn on its own how to perform a specified task? Could a computer do things or learn as human being does? Rather than programmers crafting data-processing rules by hand, could a computer automatically learn these rules by looking at data?
“A machine-learning system is trained rather than explicitly programmed. It’s presented with many examples relevant to the task, and it finds statistical structure in these examples that eventually allows the system to come up with rules for automating the task. For instance, if you wished to automate the task of tagging your vacation pictures, you could present a machine-learning system with many examples of pictures already tagged by humans, and the system would learn statistical rules for associating specific pictures to specific tags.”
(Please refer to the Book – “Deep Learning with Python” by Francois Chollet)
Gradient descent is one of the most popular algorithms to perform optimization and widely used to optimize neural networks. At the same time, every state-of-the-art Deep Learning library contains implementations of various algorithms to optimize gradient descent (e.g. lasagne's, caffe's, and keras' documentation). Gradient descent is a way to minimize an objective function J(θ) parameterized by a model's parameters θ∈Rd by updating the parameters in the opposite direction of the gradient of the objective function ∇θJ(θ) w.r.t. to the parameters. The learning rate η determines the size of the steps we take to reach a (local) minimum. In other words, we follow the direction of the slope of the surface created by the objective function downhill until we reach a valley.
Batch gradient descent, computes the gradient of the cost function w.r.t. to the parameters θ for the entire training dataset:
θ=θ−η⋅∇θJ(θ)
As we need to calculate the gradients for the whole dataset to perform just one update, batch gradient descent can be very slow and is intractable for datasets that don't fit in memory. Batch gradient descent also doesn't allow us to update our model online, i.e. with new examples on-the-fly.
Stochastic gradient descent (SGD) in contrast performs a parameter update for each training example x(i) and label y(i):
θ = θ−η⋅∇θJ(θ; x(i); y(i))
Batch gradient descent performs redundant computations for large datasets, as it recomputes gradients for similar examples before each parameter update. SGD does away with this redundancy by performing one update at a time. It is therefore usually much faster and can also be used to learn online.
Mini-batch gradient descent considers the best of both worlds and performs an update for every mini-batch of n training examples:
θ=θ−η⋅∇θJ(θ ; x(i:i+n) ; y(i:i+n))
This way, it a) helps in reducing the variance of the parameter updates, which can lead to more stable convergence; and b) can make an effective use of highly-optimized matrix optimizations common to state-of-the-art deep learning libraries that make computing the gradient w.r.t. a mini-batch very efficient. Common mini-batch sizes range between 50 and 256, but can vary for different applications. Mini-batch gradient descent is typically the algorithm of choice when training a neural network and the term SGD usually is employed also when mini-batches are used.
Don't be surprised if this question pops up as one of the top interview questions for data science in your next interview.
P-value in the parlance of statistics can be defined as “Lowest level of probability at which the null hypothesis can be rejected”. For key statistics like t-stat, P<=0.05 indicates that the underlying null hypothesis can be rejected in favour of alternative hypothesis at 5% level of significance and for p>0.05 indicates that we have less than absolute evidence that the null hypothesis is not true.
Response:
CRISP- DM stands for "Cross Industry Standard Process for Data Mining". This is a standard methodology used for end-to-end Data Science project or program execution. It follows various stages which involve different type of activities or tasks that are carried out during the program execution.
These are iterative. Below diagram depicts a view of the process methodology.

Response:
The process of adding a tuning parameter to a model or algorithm to induce smoothness to prevent and address overfitting issues is called "Regularization". Regularization term is added to a mathematical equation to prevent the coefficients to fit perfectly,avoiding the risk of overfitting.
This is primarily performed by including a constant multiple to an existing weight vector. This constant is often either the L1 (Lasso) or L2 (Ridge), however, it can in actuality get into any norm. The model predictions should then minimize the mean of the loss or error function calculated on the regularized training set.
L1 or Lasso regularization helps perform feature selection in sparse feature spaces, and that is a good practical reason to use L1 in some situations. However, beyond that particular reason, L1 may not perform better than L2 in practice. Even in a situation where you might benefit from L1's sparsity to do feature selection, using L2 on the remaining variables is likely to give better results than L1 by itself.
Response:
There are multiple ways to make a model more robust to outliers, from different aspects either from data preparation perspective or from a model-building perspective.
An outlier is assumed as being unwanted, unexpected, or a must-be-incorrect value to the human's knowledge so far (e.g. no one can live longer than 150 years of age) rather than a rare event which is possible but rare. Outliers are usually defined as the sample distribution. Hence, outliers could be removed in the pre-processing step (before any learning phase happens), by using standard deviations(sd) such as (Mean +/- 2*sd), it can be used for normality. Otherwise, interquartile ranges from Q1 - Q3, where Q1 - is the "middle" value in the first half of the rank-ordered data set, Q3 - is the "middle" value in the second half of the rank-ordered data set. It can be used for not normal/unknown as threshold levels.
Below diagram shows typical outliers encircled with red circles for sample illustration purposes.

Additionally, data transformation (e.g. log transformation) may help if data have a noticeable tail. When outliers are related to the sensitivity of the collecting instrument which may not precisely record small values, Winsorization may be useful. Winsorizing or winsorization is the transformation of statistics by limiting extreme values in the statistical data to reduce the effect of possibly spurious outliers.
This type of transformation has the same effect as clipping signals (i.e. replaces extreme data values with less extreme values). Another option to reduce the influence of outliers is using mean absolute difference rather mean squared error.
For model building purposes, some models are resistant to outliers (e.g. tree-based approaches) or non-parametric tests. Tree models typically divide each node into two parts in each split, which is similar to the median effect. Therefore, at each split, all data points in a bucket could be equally treated regardless of the extreme values they may have.
This is a common yet one of the most important data science interview questions and answers for experienced professionals, don't miss this one.
Response:
There are multiple ways to deal with missing values in dataset depending on the nature of missing values.
Some of the key methods are as follows:
Response:
In data mining, anomaly detection is referred to as the identification of items or events that do not conform to an expected pattern or other items present in the dataset. This is an uncommon behaviour or pattern in the data.
Three types of anomalies can be categorized broadly.
A single instance of data is considered to be nomalous if it's too far off from the rest. One of the examples of a typical business use case is about detecting credit card fraud based on "amount spent." This is a point anomaly.
When the abnormality is context-specific, then it is tagged as contextual anomaly. This type of anomaly is quite common in time-series forecasting related datasets. One of the examples of a typical business use case is that spending 100 USD on food every day during the holiday season is normal, however, it may be odd otherwise. Assume we have seen a spike in sales during Thanksgiving or Christmas vacation times, this may be genuine and expected. However, observing such a surge in a non-festive season could be anomalous.
When a set of data instances collectively helps in detecting anomalies, then it is categorized under "collective anomaly". One of the examples of a typical business use case is that someone is trying to perform a financial transaction form a remote machine accessing a source or host unexpectedly where he/she does not have the authority to do so, an anomaly that would be flagged as a potential fraud attack.
Wikipedia defines word embedding as collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers. Word embeddings are a way to transform words in text to numerical vectors so that they can be analysed by standard machine learning algorithms that require vectors as numerical input.
Now vectorisation can be done in many ways – One-hot-encoding, Latent Semantic Analysis (LSA),TF-IDF (Term Frequency, Inverse document frequency etc.). However, these representations capture a slightly different document-centric idea of semantic similarity.
Distributed Representation :
Distributed representations attempt to capture the meaning of a word by considering its relations with other words in its context. The idea is captured in this quote from J. R. Firth (for more information refer to the article: Document Embedding with Paragraph Vectors, by Andrew M. Dai, Christopher Olah, and Quoc V. Le, arXiv:1507.07998, 2015), a linguist who first proposed this idea: “You shall know the word by the company it keeps”.
Consider the following pair of sentences:
Paris is the capital of France. Berlin is the capital of Germany.
Even assuming you have no knowledge of world geography (or English for that matter), you would still conclude without too much effort that the word pairs (Paris, Berlin) and (France, Germany) were related in some way, and that corresponding words in each pair were related in the same way to each other, that is:
Paris : France :: Berlin : Germany
Thus, the aim of distributed representations is to find a general transformation function φ to convert each word to its associated vector such that relations of the following form hold true:
Word2vec:
The word2vec group of models was created in 2013 by a team of researchers at Google led by Tomas Mikolov. The models are basically unsupervised, taking as input a large corpus of text and producing a vector space of words. The dimensionality of the word2vec embedding space is usually lower than the dimensionality of the one-hot embedding space, which is the size of the vocabulary. The embedding space is also more dense compared to the sparse embedding of the one-hot embedding space.
The two architectures for word2vec are as follows:
In the CBOW architecture, the model predicts the current word given a window of surrounding words. In addition, the order of the context words does not influence the prediction (that is, the bag of words assumption). In the case of skip-gram architecture, the model predicts the surrounding words given the centre word. According to the authors, CBOW is faster but skip-gram does a better job at predicting infrequent words.
For various distance-based measures like KNN (K-Nearest neighbour) method, the performance or predictive power of the model deteriorates with the increase in numbers of features required for prediction. This is an obvious fact that high- dimensional spaces are vast. Points in high-dimensional spaces tend to be dispersing from each other more compared with the points in low-dimensional space.
It is apparent that with the increase in dimensions, distance between points increases logarithmically, which gives us the hint that we need to have an exponential increase in data points with the increase in dimensions in order to make machine learning algorithms work correctly.
It can be proved that with the increase in dimensions, mean distance increases logarithmically. Hence the higher the dimensions, the more data is needed to overcome the curse of dimensionality!
Box-Cox transform function belongs to the Power Transform family of functions. These functions are primarily used to create monotonic data transformations, but their main significance lies in the fact that they help in stabilizing variance by adhering closely to the normal distribution and making the data independent of the mean based on its distribution. This function has one prerequisite that the numeric values to be transformed must be positive (similar to what even log transform expects). In case they are negative, shifting using a constant value helps. Mathematically, the Box-Cox transform function can be defined as:
Such that the resulted transformed output y is a function of input x and transformation parameter λ such that when λ= 0, the resultant transform is the natural log transform, which we discussed earlier. The optimal value of λ is usually determined using a maximum likelihood or log-likelihood estimation.
This is one of the most frequently asked data science coding interview questions and answers for freshers in recent times.
Data Come in various shapes and sizes, and measure different things at different times. Financial analysts are often interested in particular types of data, such as time-series data or cross-sectional data or panel data.
Few additional points to bear in mind in this regard – The most common issues when working with cross-sectional data are multicollinearity and heteroscedasticity. Multicollinearity is where two or more independent variables are correlated with each other. Heteroscedasticity is where the variance of the error term is not constant (e.g. salaries are typically higher in bigger vs. smaller cities, skewing results towards bigger cities).
For time series data, serial correlation (also known as autocorrelation) is an issue. This happens when correlations exist across the error term across different time periods. e.g. if salaries are growing across time as a worker gets more experience, this does not allow us to identify important differences between salaries across different observations.
Various methods and techniques are there to deal with each of these problems.
A type of stochastic process that has received a great deal of attention and scrutiny by time series analysts is the so-called stationary stochastic process. Broadly speaking, a stochastic process is said to be stationary if its mean and variance are constant over time and the value of the covariance between the two time periods depends only on the distance or gap or lag between the two time periods and not the actual time at which the covariance is computed. In the time series literature, such a stochastic process is known as a weakly stationary, or covariance stationary, or second-order stationary, or wide sense, stochastic process.
In short, if a time series is stationary, its mean, variance, and autocovariance (at various lags) remain the same no matter at what point we measure them; that is, they are time invariant. Such a time series will tend to return to its mean (called mean reversion) and fluctuations around this mean (measured by its variance) will have a broadly constant amplitude.7 If a time series is not stationary in the sense just defined, it is called a nonstationary time series (keep in mind we are talking only about weak stationarity). In other words, a nonstationary time series will have a time-varying mean or a time-varying variance or both.
Why are stationary time series so important? Because if a time series is nonstationary, we can study its behaviour only for the time period under consideration. Each set of time series data will therefore be for a particular episode. As a consequence, it is not possible to generalize it to other time periods. Therefore, for the purpose of forecasting, such (nonstationary) time series may be of little practical value.
There are various ways to study non-stationarity of time series data – Augmented Dicky Fuller (ADF) test one of those very popular test to determine the nature of stationarity.