Top 50 Data Analyst Interview Questions and Answers for 2024

Beginner
Advanced

Beginner

1.
What is data ? Explain with a real world example.

Here interviewer wants to assess your basic knowledge of data and how well you understand its practical aspects.

Hence we need to answer it with our understanding of all kind of data along with real world scenario to showcase in-depth practical knowledge.

Data are collected observations or measurements represented as text, numbers or multimedia. Data can be field notes, photographs, documents, audio recordings, videos and transcripts.

Data is different depending upon your area of work or research. If your objective is to find out graduation rates of college students with faculty mentors, your data might be the number of graduates each year and amount of time taken to complete the graduation. Hence data will be different based on what you study.

Sharpen your DevOps skills with these DevOps online classes.

2.
What are categories of data ? Explain with examples.

Here they don’t expect you to just give the theoretical definition for categories of data, rather check whether you’re also aware of data’s application in real world.

We need to exhibit the same.

Data can be broadly categorized as qualitative and quantitative.

Quantitative Data: This data can be expressed as a number, counted or compared on numerical scale. Examples include number of attendees at an event, count of words in a book, temperatures observed, land measurements gathered and gradient scales from surveys.
Qualitative Data: This data is non-numerical or categorical in nature and describes the attributes or properties that an object possesses such as social class, marital status, method of treatment etc. Examples include maps, transcripts, pictures and textual descriptions.

3.
Is data the same as statistics ? What are the benefits of analyzing the data?

Data is not the same as statistics. Statistics are the result of data analysis and interpretations, so we can’t use the two words interchangeably.

Analyzing and interpreting the data can help you:

Identify patterns and trends
Offer solutions
Understand scientific phenomena

4.
What are must-have data analyst skills?

Here intent would be to see the awareness about skillset of data analyst. Better to answer it with separate categories, so that awareness about skillset is conveyed clearly.

Must-have data analyst skills include both soft-skills and hard-skills to be able perform data analysis efficiently.

Soft Skills:

Communication
Critical Thinking
Story Telling
Decision making
Fast coding
Collaboration

Hard Skills:

Linear algebra and Calculus
SQL and NoSQL
Matlab, R and Python
Microsoft Excel
Data Visualization

5.
What are the various steps involved in any analytics project?

Here interviewer wants to evaluate your understanding of the entire data analysis process or all the steps of any analytics project. Hence explain all the steps accordingly.

Understanding the problem – Identify the right question to solve, understand expectations
Data collection – This step is about gathering accurate data from all sources. Examples include opinion polls, sales records and surveys etc.
Data cleaning – This includes removal or fixing of incomplete, corrupted, incorrect, wrongly formatted or duplicate data
Data exploration and analysis (EDA) - Objective of this step is to analyze and investigate the data and summarize its main characteristics
Interpret the results – This involves employing data visualization techniques ultimately to discover trends, patterns or to cross check assumptions

6.
Which technical tools have you used for analysis and presentation purposes?

Being a data analyst, you are expected to have knowledge of the below tools for analysis and presentation purposes. Attend this KnowledgeHut DevOps online classes to master DevOps skills.

Python
Tableau
Microsoft Excel
MySQL
Microsoft PowerPoint or equivalent
Microsoft SQL Server
IBM SPSS

7.
What is the objective and significance of EDA (Exploratory Data Analysis)?

Exploratory Data Analysis or EDA is one of the important steps in the data processing life cycle and it is nothing but a data exploration technique to understand the various aspects of the data. It is basically used to filter the data from redundancies.

Exploratory Data Analysis helps to understand the data better
It helps you to gain confidence in your data to a point where you are ready to apply a machine learning algorithm
It allows you to refine your selection of feature variables that will be used later for building the model
You can discover hidden patterns, trends and insights from the data

8.
What are the best practices for data cleaning?

Here interviewer wants to know whether you know standard practices followed for data cleaning and also some of the preventive mechanisms. Some of the best practices used in data cleaning are:

Preparation of data cleaning plan by understanding where the common errors take place and we need to keep communications open
We need to identify and remove duplicates before we work with the data. This will ensure data analysis process is effective
As a data analyst it’s our responsibility to focus on the accuracy of the data. Also maintain the value types of data, provide mandatory constraints and set cross-field validation
We need to standardize the data at entry point so that it is less chaotic and you will be able ensure that all the information is standardized, leading to fewer errors on entry

9.
What is data validation?

Here we need to explain our understanding of data validation including the steps or processed involved.

We need to start with formal definition and then talk about processed involved.

Data validation is the process that involves validating or determining the accuracy of the data at hand and also the quality of the sources.

There are many processes in data validation step, and the main ones include data screening and data verification.

Data Screening: This process is all about making use of a variety of models to ensure that the data under consideration is accurate and there are no redundancies present.
Data Verification: If there is a redundancy found at the screening process, then it is evaluated based on multiple steps and later a call is taken to ensure the presence of the data item.

10.
How can you handle missing values in a dataset?

This question is asked to assess our knowledge of corrective mechanisms to address missing values in dataset.

Missing values in a dataset is one of the big problems in real life scenarios. This situation will arise when no information is provided for one or more items or for a whole unit.

Some of the ways to handle missing values in a dataset are:

Listwise Deletion: In this method, an entire record is excluded from analysis if any single value is missing
Average Imputation: Use the average value of the responses from the other participants to fill in the missing value
Regression Substitution: One can use multiple-regression analysis to estimate a missing value
Multiple Imputation: It creates plausible values based on the correlations for the missing data and then averages the simulated datasets by incorporating random errors in your prediction

11.
What is data profiling?

This question asked to see our overall understanding of data profiling, not just brief bookish definition. Hence we need to proceed accordingly.

Data profiling is a methodology which involves analyzing entire set of entities present across data to a greater depth. The main objective of this process is to provide highly accurate information based on the data and its attributes such as the datatype, frequency or occurrence, and more.

This process involves following key steps:

Collect the data from one or more sources
Perform initial data cleansing work to consolidate and unify data sets from different sources
Eventually run profiling tasks to collect statistics on the data and at last identify characteristics and issues

12.
What are some of the popular tools used in Big Data?

This question is asked to take a look at our depth towards awareness of the all the tools, frameworks, technologies used for big data and relevant processes. It would be ideal to brief about what they’re used for along with listing down the tools.

Hadoop – used to efficiently store and process large datasets
Spark - unified analytics engine for large-scale data processing
Hive - allows users to read, write, and manage petabytes of data using SQL-like query languages
Flume – intended for high volume data ingestion to Hadoop of event-based data
Mahout – designed to create scalable machine learning algorithms
Flink - unified stream-processing and batch-processing framework
Tableau - interactive data visualization software
Microsoft PowerBI - interactive data visualization software developed by Microsoft
QlikView - business analytics platform

13.
What is time series analysis ? What are its components?

This question is asked not just to check our understanding of Time Series Analysis but also its various components.

Time series analysis is a statistical method that deals with an ordered sequence of values of a variable at equally spaced time intervals. It is a widely used technique while working with trend analysis and time series data in particular.

Components of TSA (Time Series Analysis) include:

Long-term movement or trend
Short-term movements - seasonal variations and cyclic variations
Random or irregular movements

14.
What are outliers and what are different types of outliers?

This question is intended to assess our knowledge towards the topic of outliers including the different types.

An outlier is a value in a dataset which is considered to be far away from the mean of the characteristic feature of the dataset. I.e. a value that is much larger or smaller in a set of data.

For example – in following set of numbers 2 and 98 are outliers

2, 38, 40, 41, 44, 46, 98

There are two types of outliers:

Univariate – scenario that consists of an extreme value on one variable
Multivariate - combination of unusual scores on at least two variables

15.
How are outliers detected?

This question is intended to assess our knowledge of outlier detection

techniques, so accordingly we should be talking about at least two most widely used and popular methodologies.

Multiple methodologies can be used for detecting outliers, but two most commonly used methods are as follows:

Standard Deviation Method: Here the value is considered as an outlier if the value is lower or higher than three standard deviations from the mean value
Box Plot Method: Here, a value is considered to be an outlier if it is lesser or higher than 1.5 times the IQR (Interquartile Range)

16.
What is the K-means algorithm?

Along with the definition and how it works ensure to talk about the meaning behind ‘K’ used in K-means algorithm.

K-means algorithm clusters data into different sets based on how close the data points are to each other. The number of clusters is indicated by ‘K’ in the K-means algorithm.

It tries to maintain a good amount of separation between each of the clusters. However, since it works in an unsupervised nature, the clusters will not have any sorts of labels to work with.

17.
What is hypothesis testing and what are different types of it, explain with examples?

So main focus behind this question would be not just to see our understanding of types of hypothesis testing, but mainly to see our understanding towards how they’re used in real world scenario.

Hypothesis testing is the procedure used by statisticians and scientists to accept or reject statistical hypotheses.

Null Hypothesis: It states that there’s no relation between predictor and outcome variables in the population. It is denoted by HO. For example there’s no association between a patient's BMI and diabetes.

Alternative Hypothesis: It states that there’s some relation between the predictor and outcome variables in the population. It is denoted by H1. Example to this is there could be no association between a patient's BMI and diabetes.

18.
What are the common problems that data analysts encounter during analysis?

Objective behind this question would be see how well we understand dataanalysis process end to end including problems faced on daily basis and some of the commonly faced problems by data analysts across the world.

Some of the common problems that data analysts encounter:

Handling duplicate and missing values
Collecting the meaningful right data and the right time
Making data secure and dealing with compliance issues
Handling data purging and storage problems

19.
What are the uses of the Pivot table?

This question is asked to take a sneak peek into our depth of knowledge when it comes to Microsoft Excell sheets. We need to highlight uses, advantages and capabilities of pivot table here.

Uses of the Pivot table include:

Pivot tables are one of the powerful and key features of Excel sheet
They are leveraged to see comparisons, patterns, and trends in your data
They are employed to be able to view and summarize entirety of large datasets in a simple manner
They help with quick creation of reports through simple drag-and-drop operations

20.
What are the ways to filter the data?

Here we need to list down all possible ways to filter the data in Excel.

As per management requirement
Period wise filter - week wise, month and year wise
Current period comparison with any period (past/future) - this quarter’s performance Vs Last quarter’s performance
Filter for error summary - for example there’s abnormal increase in raw material wastage, then we can filter for error summary and highlight it in the report for management
Filter for performance - for example while analyzing company’s growth

21.
What is data security ? Why is it important?

Awareness and knowledge about data security and measures taken to ensure that are equally important for data analysts. This question is intended to check for that aspect.

Data Security safeguards digital data from unwanted access, corruption, or theft. Data security is critical to public and private sector organizations because there’s legal and moral obligation to protect users and a reputational risk of a data breach.

Protecting the data from internal or external corruption and illegal access helps to protect an organization from reputational harm, financial loss, consumer trust degradation, and brand erosion.

22.
What is a primary key and foreign key in SQL ? Explain their relation with Child and Parent tables.

Through this answer we need to convey along with formal definitions of primary key and foreign key, our practical knowledge about them when we speak of SQL.

A PRIMARY KEY is a column or a group of columns in a table that uniquely identifies the rows of data in that table.
A FOREIGN KEY is a column or group of columns in one table, that refers to the PRIMARY KEY in another table. It maintains referential integrity in the database.
Table with the FOREIGN KEY is called the child table, and a table with a PRIMARY KEY is called a reference or parent table.

23.
What is the difference between data joining and data blending?

Here interviewer would be happy to listen if we explain the differences through examples.

Data blending allows a combination of data from different data sources to be linked. Whereas, Data Joining works only with data from one and the same source.

For example: If the data is from an Excel sheet and a SQL database, then Data Blending is the only option to combine the two types of data. However if the data is from two excel sheets, you can use either data blending or data joining.

Data blending is also the only choice available when ‘joining’ the tables is impractical. This impracticality occurs when the dataset is humongous. When joins might create duplicate data or when using databases such as Salesforce and Cubes which do not support joins.

24.
What are Eigenvectors and Eigenvalues?

Eigenvectors: Eigenvectors are basically used to understand linear transformations. These are calculated for a correlation or a covariance matrix.

For definition purposes, you can say that Eigenvectors are the directions along which a specific linear transformation acts either by flipping, compressing or stretching.

Eigenvalues: Eigenvalues can be referred to as the strength of the transformation or the factor by which the compression occurs in the direction of eigenvectors.

Let A be a n × n matrix.

An eigenvector of A is a nonzero vector v in Rn such that Av = λ v , for some scalar λ .
An eigenvalue of A is a scalar λ such that the equation Av = λ v has a nontrivial solution.

If Av = λ v for v

$\neq$

0, we say that λ is the eigenvalue for v ,and that v is an eigenvector for λ .

25.
What is hierarchical clustering?

Here along with definition and understanding of clustering, let’s explain why is it done, it’s objective.

Hierarchical clustering or hierarchical cluster analysis, is an algorithm that groups similar objects into common groups called clusters.
The goal is to create a set of clusters, where each cluster is different from the other and, individually, they contain similar entities.

Advanced

1.
What is the criteria to say whether a developed data model is good or not?

A good model should be intuitive, insightful and self-explanatory
It should be derived from the correct data points and sources
The model developed should be able to easily consumed by the clients for actionable and profitable results
A good model should easily adapt to changes according to business requirements
If the data gets updated, the model should be able to scale according to the new data
A good model provides predictable performance
A good data model will display minimal redundancy with regard to repeated entry types, data redundancy, and many-to-many relationships

2.
What is the difference between WHERE clause and HAVING clause?

WHERE clause	HAVING clause
It works on row data	It works on aggregated data
In this clause, the filter occurs before any groupings are made	This is used to filter values from a group
SELECT column1, column2,.. FROM table_name WHERE condition;	SELECT column_name(s) FROM table_name WHERE condition GROUP BY column_name(s) HAVING condition ORDER BY column_name(s)

3.
What is sampling, explain with a real world example and What are different types of sampling techniques used by data analysts?

Sampling is a statistical method to select a subset of data from an entire dataset (population) to estimate the characteristics of the whole population.

For example, if you are researching the opinions of students in your university, you could survey a sample of 100 students. In statistics, sampling allows you to test a hypothesis about the characteristics of a population.

Different types of sampling techniques:

Simple random sampling
Systematic sampling
Cluster sampling
Stratified sampling
Judgmental or purposive sampling

4.
Why is Naive Bayes called ‘naive’?

It is called naive because it makes a general assumption that all the data present are unequivocally important and independent of each other. This is not true and won’t hold good in a real world scenario.

5.
What are the disadvantages of data analytics?

This tricky question is asked to check whether you see other side of things and whether you’re aware of any demerits of analytics.

Ensure that you talk about disadvantages in such a way that you’re aware of them and can take care of them. Answer shouldn’t sound like you’re throwing lot of complaints and don’t make unjustified claims.

Compared to N number of advantages that data analytics offers, there are a very few disadvantages or demerits.

There’s a possibility of data analytics leading to a breach in customer privacy and thereby their information such as transactions, subscriptions and purchases etc.
We should note that some of the tools used for data analytics are bit complex and might require prior training to enable their usage
At times selection of right analytics tool can get tricky as it takes a lot of skills and expertise to select the right tool

6.
Explain the limitation of context filters in tableau

Whenever we set a context filter, Tableau generates a temp table that needs to refresh each and every time, whenever the view is triggered. So, if the context filter is changed in the database, it needs to recompute the temp table, so the performance will be decreased.

7.
What is statistical analysis and which statistical methods have you used in data analysis?

Statistical analysis is a scientific tool that helps collect and analyze large amounts of data to identify common patterns and trends to convert them into meaningful information. In simple words, statistical analysis is a data analysis tool that helps draw meaningful conclusions from raw and unstructured data.

Statistical methods used in data analysis:

Mean
Standard Deviation
Regression
Variance
Sample size
Descriptive and inferential statistics

8.
How to find duplicates in a column in Microsoft Excel?

Use CONDITIONAL formatting to highlight duplicate values. Alternatively, use the COUNTIF function as shown below. For example, values are stored in cells D4:D7.

=COUNTIF(D4:D7,D4)

Apply filter on the column wherein you applied the COUNTIF function and select values greater than 1.

9.
Write the Python code to load the data using Pandas, fetch basic information about the data.

#Load the required libraries 
import pandas as pd 
import numpy as np 
import seaborn as sns 
#Load the data 
df = pd.read_csv('titanic.csv') 
#View the data 
df.head()

The df.info() function will give us the basic information about the dataset.

#Basic information 
df.info() 
#Describe the data 
df.describe()

Using this function, you can see the number of null values, data types, and memory usage as shown in the above outputs along with descriptive statistics.

10.
Find the number of unique values in the column in the above mentioned data.

You can find the number of unique values in the particular column using the unique() function in python.

#unique values  
df['Pclass'].unique() 
df['Survived'].unique() 
df['Sex'].unique()

array([3, 1, 2], dtype=int64) 
array([0, 1], dtype=int64) 
array(['male', 'female'], dtype=object)

The unique() function has returned the unique values which are present in the data

11.
What is CDA and what are the different steps involved?

When it comes to data analysis more often than not we talk about EDA. This question is thrown to see our in-depth knowledge in data analysis, as CDA is lesser known than EDA.

Confirmatory Data Analysis i.e. CDA, is the process that involves evaluation of your evidence using some of the traditional statistical tools such as significance, inference, and confidence.

Confirmatory Data Analysis involves various steps including: testing hypotheses, producing estimates with a specified level of precision, RA (regression analysis), and variance analysis.

Different steps involved in CDA process include:

Defining each and every individual constructs.
Overall measurement model theory development
Designing a study with the intent to produce the empirical results.
Assessing the measurement model validity.

12.
What is an N-gram?

This question is intended to test your knowledge on computational linguistics and probability. Along with a formal definition, it would be advisable to explain it with the help of an example to showcase your knowledge about it.

An N-Gram is a connected sequence of N items in a given text or speech. Precisely, an N-gram is a probabilistic language model used to predict the next item in a particular sequence as in N-1.

13.
How should you tackle multi-source problems?

This question is asked to get your idea about multi-source data analysis.

We should start with explanation of multi-source data. Then go on about how would you tackle multi-source problems.
Multi-source data by characteristics is dynamic, heterogeneous, complex, distributed and very large.
When it comes multi-source problems, each source might contain bad or dirty data and the data in the sources might be represented differently, contradict or overlap.
To tackle multi-source problems, you need to identify similar data records, and combine them into one record that will contain all the useful attributes minus the redundancy.

14.
What are the generally observed missing patterns?

Missing patterns include:

Missing at random
Unobserved input variable missing
Missing due to some particular missing value

15.
Is it possible to highlight Cells Containing Negative Values in an Excel Sheet? If yes, how?

This question assesses your practical knowledge on Excel sheet. Hence we need to explain with appropriate steps required to meet the given objective.

Yes, it is possible to highlight cells with negative values in Microsoft Excel. Steps to do that are as follows:

In the Excel menu, go to the Home option and click on Conditional Formatting.
Within the Highlight Cells Rules option, click on Less Than.
In the dialog box that opens, select a value below which you want to highlight cells.
You can choose the highlight color in the dropdown menu.
Hit OK.

16.
What is a collision in a hash table and how can it be avoided?

Objective of the interviewer here would be to assess your knowledge on data structures by having a discussion about hash tables. Here explanation with diagrammatic representation would be advisable.

In a hash table, a collision occurs when two keys are hashed to the same index. Since every slot in a hash table is supposed to store a single element, collisions are a problem.

Chaining is a technique used for avoiding collisions in hash tables.

The hash table is an array of linked lists as per chaining approach i.e., each index has its own linked list. All key-value pairs mapping to the same index will be stored in the linked list of that index.

17.
When you are creating a statistical model, what is overfitting and how do you prevent it?

Here we need to talk about statistical model overfitting by making use of graphical representation. Also better to explain model overfitting prevention techniques in detail to demonstrate our expertise with statistical modelling.

Overfitting is a scenario, or rather a modeling error in statistics that occurs when a function is too closely aligned to a limited set of data points.

Some of the techniques used to prevent overfitting are:

Early stopping: It helps to stop the training when parameter updates no longer begin to yield improves on a validation set
Cross validation: A statistical method of evaluating and comparing learning algorithms by dividing data into two segments, i.e. one used to learn or train a model and the other one that’s used to validate the model
Data augmentation: It is a set of methods or techniques that are used to increase the amount of data by adding slightly modified copies of existing data or newly created synthetic data from already existing data
Ensembling: Usage of multiple learning algorithms with intent to obtain better predictive performance, than that could be obtained from any of the constituent learning algorithms alone.

18.
What is skewness and what are left-skewed and right-skewed distributions ? Explain with real world examples.

Here again, interviewer wants to see our practical knowledge hence we need explain skewness by taking some real world examples.

Skewness measures the lack of symmetry in data distribution.

A left-skewed distribution is one where a left tail is longer than that of the right tail. It is important to note that in this case:

mean < median < mode

Similarly, the right-skewed distribution is the one where the right tail is longer than the left one. But here:

mean > median > mode

19.
What are a few important ways to improve the performance of tableau?

This is a very tricky question, in the sense for data visualization we would have used tableau as a tool. Interviewer wants to see how well we’ve used the tool and are aware of its cons.

Some of the ways to improve the performance of tableau are:

Use an Extract to make workbooks run faster
Reduce the scope of data to decrease the volume of data
Reduce the number of marks on the view to avoid information overload
Hide unused fields
Use Context filters
Use indexing in tables and use the same fields for filtering

20.
What is the difference between a treemap and heatmap?

This question is asked to assess our knowledge on Tableau. We need to explain the differences through practical knowledge rather than just theoretical definitions.

A heatmap is a two dimensional representation of information with the help of colors. Heatmaps can help the user visualize simple or complex information.

Treemaps are ideal for displaying large amounts of hierarchically structured (tree-structured) data. The space in visualization is split into rectangles that are sized and ordered by a quantitative variable.

21.
Explain what P-value tells about statistical significance?

This is mostly straight forward question asked to validate our depth in statistics as a data analyst. We should always include point about its range during our explanation.

P value for a statistical test helps to decide whether to accept or reject the null hypothesis.

 $0 < = p_v a l u e < = 1$

P-value range is between $[0, 1]$ The threshold for P-value is set to be $0.05$ . When the value is below $0.05$ , the null-hypothesis is rejected.

22.
Given a dataset of test scores, write Python code using Pandas library to return cumulative bucketed scores of <40, <70, <85, <100.

def bucket_test_scores(df): 
    bins = [0, 40, 70, 85, 100] 
    labels=['<40','<70','<85' , '<100'] 
    df['test score'] = pd.cut(df['test score'], bins,labels=labels)

23.
What are some of the limitations of Python?

This answer will give a view about your command over Python as a programming language which is must as a data analyst.

Python is limited in a few ways, including:

Memory consumption - Python is not great for memory intensive applications
Mobile development - Though Python is great for desktop and server applications, it is weaker for mobile development
Speed - Studies have shown that Python is slower than object oriented languages like C++ and Java. However, there are options to make Python faster, like a custom runtime.
Version V2 vs V3 - Python 2 and Python 3 are incompatible

24.
Explain the Constraints in SQL.

This answer will give a view about your fluency over SQL as a query language which is absolutely necessary as a data analyst. Constraints in SQL are used to specify rules for data in the table.

NOT NULL: Ensures that a column cannot have a NULL value
UNIQUE: Ensures that all values in a column are different. It maintains the uniqueness of a column in a table. More than one UNIQUE column can be used in a table.
PRIMERY KEY: A combination of NOT NULL and UNIQUE, and uniquely identifies each row in the table thereby ensuring faster access to the table
FOREIGN KEY: This constraint creates a link between two tables by one specific column of both tables. This is used to uniquely identify row/record in another table
CHECK: This constraint controls the values in the associated column and ensures that all values in a column satisfy a specific condition
DEFAULT: Each column must contain a value ( including a NULL) .This constraint sets a default value for a column when no value is specified
INDEX: Used to create and retrieve the data from the database very quickly.

25.
A coin was flipped 1000 times, and 550 times it showed heads. Do you think the coin is biased?

Here rather than hurrying, we need to give ourselves some time and think about statistical methods that can be applied to be able to tell whether coin is biased as this question is about probability theory and statistical concepts.

To answer this question let's say X is the number of heads and let's assume that the coin is not biased. Since each individual flip is a Bernoulli random variable, we can assume it has a probability of showing up heads as p = 0.5, so this will lead to the following expected number of heads:

1. What is data ? Explain with a real world example.

Here interviewer wants to assess your basic knowledge of data and how well you understand its practical aspects. Hence we need to answer it with our understanding of all kind of data along with real world scenario to showcase in-depth practical knowledge. Data are collected observations or measurements represented as text, numbers or multimedia. Data can be field notes, photographs, documents, audio recordings, videos and transcripts. Data is different depending upon your area of work or research. If your objective is to find out graduation rates of college students with faculty mentors, your data might be the number of graduates each year and amount of time taken to complete the graduation. Hence data will be different based on what you study. Sharpen your DevOps skills with these DevOps online classes.

2. What are categories of data ? Explain with examples.

Here they don’t expect you to just give the theoretical definition for categories of data, rather check whether you’re also aware of data’s application in real world. We need to exhibit the same. Data can be broadly categorized as qualitative and quantitative.

Quantitative Data: This data can be expressed as a number, counted or compared on numerical scale. Examples include number of attendees at an event, count of words in a book, temperatures observed, land measurements gathered and gradient scales from surveys.
Qualitative Data: This data is non-numerical or categorical in nature and describes the attributes or properties that an object possesses such as social class, marital status, method of treatment etc. Examples include maps, transcripts, pictures and textual descriptions.

3. Is data the same as statistics ? What are the benefits of analyzing the data?

We have seen that sometimes data and statistics are used interchangeably, hence here interviewer wants to see your clear understanding of data and not be confused with statistics. We need to answer it accordingly. Also, we need to explain the outcome of data analysis. Data is not the same as statistics. Statistics are the result of data analysis and interpretations, so we can’t use the two words interchangeably. Analyzing and interpreting the data can help you:

Identify patterns and trends
Offer solutions
Understand scientific phenomena

4. What are must-have data analyst skills?

Here intent would be to see the awareness about skillset of data analyst. Better to answer it with separate categories, so that awareness about skillset is conveyed clearly. Must-have data analyst skills include both soft-skills and hard-skills to be able perform data analysis efficiently. Soft Skills:

Communication
Critical Thinking
Story Telling
Decision making
Fast coding
Collaboration

Hard Skills:

Linear algebra and Calculus
SQL and NoSQL
Matlab, R and Python
Microsoft Excel
Data Visualization

5. What are the various steps involved in any analytics project?

Here interviewer wants to evaluate your understanding of the entire data analysis process or all the steps of any analytics project. Hence explain all the steps accordingly.

Understanding the problem – Identify the right question to solve, understand expectations
Data collection – This step is about gathering accurate data from all sources. Examples include opinion polls, sales records and surveys etc.
Data cleaning – This includes removal or fixing of incomplete, corrupted, incorrect, wrongly formatted or duplicate data
Data exploration and analysis (EDA) - Objective of this step is to analyze and investigate the data and summarize its main characteristics
Interpret the results – This involves employing data visualization techniques ultimately to discover trends, patterns or to cross check assumptions

6. Which technical tools have you used for analysis and presentation purposes?

Here you need to list down all possible tools and frameworks you would have used to perform end to end data analysis. This should include programming languages you might have used, tools used for data cleaning and EDA, data visualization tools, query languages etc. Being a data analyst, you are expected to have knowledge of the below tools for analysis and presentation purposes. Attend this KnowledgeHut DevOps online classes to master DevOps skills.

Python
Tableau
Microsoft Excel
MySQL
Microsoft PowerPoint or equivalent
Microsoft SQL Server
IBM SPSS

7. What is the objective and significance of EDA (Exploratory Data Analysis)?

As we know EDA is one of the very important step in data analysis, answer we give to this question depicts our overall in-depth understanding of EDA and it’s contribution towards data analysis process. Exploratory Data Analysis or EDA is one of the important steps in the data processing life cycle and it is nothing but a data exploration technique to understand the various aspects of the data. It is basically used to filter the data from redundancies.

Exploratory Data Analysis helps to understand the data better
It helps you to gain confidence in your data to a point where you are ready to apply a machine learning algorithm
It allows you to refine your selection of feature variables that will be used later for building the model
You can discover hidden patterns, trends and insights from the data

8. What are the best practices for data cleaning?

Here interviewer wants to know whether you know standard practices followed for data cleaning and also some of the preventive mechanisms. Some of the best practices used in data cleaning are:

Preparation of data cleaning plan by understanding where the common errors take place and we need to keep communications open
We need to identify and remove duplicates before we work with the data. This will ensure data analysis process is effective
As a data analyst it’s our responsibility to focus on the accuracy of the data. Also maintain the value types of data, provide mandatory constraints and set cross-field validation
We need to standardize the data at entry point so that it is less chaotic and you will be able ensure that all the information is standardized, leading to fewer errors on entry

9. What is data validation?

Here we need to explain our understanding of data validation including the steps or processed involved. We need to start with formal definition and then talk about processed involved. Data validation is the process that involves validating or determining the accuracy of the data at hand and also the quality of the sources. There are many processes in data validation step, and the main ones include data screening and data verification.

Data Screening: This process is all about making use of a variety of models to ensure that the data under consideration is accurate and there are no redundancies present.
Data Verification: If there is a redundancy found at the screening process, then it is evaluated based on multiple steps and later a call is taken to ensure the presence of the data item.

10. How can you handle missing values in a dataset?

This question is asked to assess our knowledge of corrective mechanisms to address missing values in dataset. Missing values in a dataset is one of the big problems in real life scenarios. This situation will arise when no information is provided for one or more items or for a whole unit. Some of the ways to handle missing values in a dataset are:

Listwise Deletion: In this method, an entire record is excluded from analysis if any single value is missing
Average Imputation: Use the average value of the responses from the other participants to fill in the missing value
Regression Substitution: One can use multiple-regression analysis to estimate a missing value
Multiple Imputation: It creates plausible values based on the correlations for the missing data and then averages the simulated datasets by incorporating random errors in your prediction

11. What is data profiling?

This question asked to see our overall understanding of data profiling, not just brief bookish definition. Hence we need to proceed accordingly. Data profiling is a methodology which involves analyzing entire set of entities present across data to a greater depth. The main objective of this process is to provide highly accurate information based on the data and its attributes such as the datatype, frequency or occurrence, and more. This process involves following key steps:

Collect the data from one or more sources
Perform initial data cleansing work to consolidate and unify data sets from different sources
Eventually run profiling tasks to collect statistics on the data and at last identify characteristics and issues

12. What are some of the popular tools used in Big Data?

Hadoop – used to efficiently store and process large datasets
Spark - unified analytics engine for large-scale data processing
Hive - allows users to read, write, and manage petabytes of data using SQL-like query languages
Flume – intended for high volume data ingestion to Hadoop of event-based data
Mahout – designed to create scalable machine learning algorithms
Flink - unified stream-processing and batch-processing framework
Tableau - interactive data visualization software
Microsoft PowerBI - interactive data visualization software developed by Microsoft
QlikView - business analytics platform

13. What is time series analysis ? What are its components?

This question is asked not just to check our understanding of Time Series Analysis but also its various components. Time series analysis is a statistical method that deals with an ordered sequence of values of a variable at equally spaced time intervals. It is a widely used technique while working with trend analysis and time series data in particular. Components of TSA (Time Series Analysis) include:

Long-term movement or trend
Short-term movements - seasonal variations and cyclic variations
Random or irregular movements

14. What are outliers and what are different types of outliers?

This question is intended to assess our knowledge towards the topic of outliers including the different types. An outlier is a value in a dataset which is considered to be far away from the mean of the characteristic feature of the dataset. I.e. a value that is much larger or smaller in a set of data. For example – in following set of numbers 2 and 98 are outliers 2, 38, 40, 41, 44, 46, 98 There are two types of outliers:

Univariate – scenario that consists of an extreme value on one variable
Multivariate - combination of unusual scores on at least two variables

15. How are outliers detected?

This question is intended to assess our knowledge of outlier detection techniques, so accordingly we should be talking about at least two most widely used and popular methodologies. Multiple methodologies can be used for detecting outliers, but two most commonly used methods are as follows:

Standard Deviation Method: Here the value is considered as an outlier if the value is lower or higher than three standard deviations from the mean value
Box Plot Method: Here, a value is considered to be an outlier if it is lesser or higher than 1.5 times the IQR (Interquartile Range)

16. What is the K-means algorithm?

Along with the definition and how it works ensure to talk about the meaning behind ‘K’ used in K-means algorithm. K-means algorithm clusters data into different sets based on how close the data points are to each other. The number of clusters is indicated by ‘K’ in the K-means algorithm. It tries to maintain a good amount of separation between each of the clusters. However, since it works in an unsupervised nature, the clusters will not have any sorts of labels to work with.

17. What is hypothesis testing and what are different types of it, explain with examples?

So main focus behind this question would be not just to see our understanding of types of hypothesis testing, but mainly to see our understanding towards how they’re used in real world scenario. Hypothesis testing is the procedure used by statisticians and scientists to accept or reject statistical hypotheses. Null Hypothesis: It states that there’s no relation between predictor and outcome variables in the population. It is denoted by HO. For example there’s no association between a patient's BMI and diabetes. Alternative Hypothesis: It states that there’s some relation between the predictor and outcome variables in the population. It is denoted by H1. Example to this is there could be no association between a patient's BMI and diabetes.

18. What are the common problems that data analysts encounter during analysis?

Handling duplicate and missing values
Collecting the meaningful right data and the right time
Making data secure and dealing with compliance issues
Handling data purging and storage problems

19. What are the uses of the Pivot table?

This question is asked to take a sneak peek into our depth of knowledge when it comes to Microsoft Excell sheets. We need to highlight uses, advantages and capabilities of pivot table here. Uses of the Pivot table include:

Pivot tables are one of the powerful and key features of Excel sheet
They are leveraged to see comparisons, patterns, and trends in your data
They are employed to be able to view and summarize entirety of large datasets in a simple manner
They help with quick creation of reports through simple drag-and-drop operations

20. What are the ways to filter the data?

Here we need to list down all possible ways to filter the data in Excel.

As per management requirement
Period wise filter - week wise, month and year wise
Current period comparison with any period (past/future) - this quarter’s performance Vs Last quarter’s performance
Filter for error summary - for example there’s abnormal increase in raw material wastage, then we can filter for error summary and highlight it in the report for management
Filter for performance - for example while analyzing company’s growth

21. What is data security ? Why is it important?

Awareness and knowledge about data security and measures taken to ensure that are equally important for data analysts. This question is intended to check for that aspect. Data Security safeguards digital data from unwanted access, corruption, or theft. Data security is critical to public and private sector organizations because there’s legal and moral obligation to protect users and a reputational risk of a data breach. Protecting the data from internal or external corruption and illegal access helps to protect an organization from reputational harm, financial loss, consumer trust degradation, and brand erosion.

22. What is a primary key and foreign key in SQL ? Explain their relation with Child and Parent tables.

Through this answer we need to convey along with formal definitions of primary key and foreign key, our practical knowledge about them when we speak of SQL.

A PRIMARY KEY is a column or a group of columns in a table that uniquely identifies the rows of data in that table.
A FOREIGN KEY is a column or group of columns in one table, that refers to the PRIMARY KEY in another table. It maintains referential integrity in the database.
Table with the FOREIGN KEY is called the child table, and a table with a PRIMARY KEY is called a reference or parent table.

23. What is the difference between data joining and data blending?

Here interviewer would be happy to listen if we explain the differences through examples. Data blending allows a combination of data from different data sources to be linked. Whereas, Data Joining works only with data from one and the same source. For example: If the data is from an Excel sheet and a SQL database, then Data Blending is the only option to combine the two types of data. However if the data is from two excel sheets, you can use either data blending or data joining. Data blending is also the only choice available when ‘joining’ the tables is impractical. This impracticality occurs when the dataset is humongous. When joins might create duplicate data or when using databases such as Salesforce and Cubes which do not support joins.

24. What are Eigenvectors and Eigenvalues?

At times we might ignore theory of statistics and algebra involved during data analysis process. Through this answer we need to demonstrate how wellacquainted are we in terms of fundamentals of statistics. Eigenvectors: Eigenvectors are basically used to understand linear transformations. These are calculated for a correlation or a covariance matrix. For definition purposes, you can say that Eigenvectors are the directions along which a specific linear transformation acts either by flipping, compressing or stretching. Eigenvalues: Eigenvalues can be referred to as the strength of the transformation or the factor by which the compression occurs in the direction of eigenvectors. Let A be a n × n matrix.

An eigenvector of A is a nonzero vector v in Rn such that Av = λ v , for some scalar λ .
An eigenvalue of A is a scalar λ such that the equation Av = λ v has a nontrivial solution.

If Av = λ v for v

\neq

0, we say that λ is the eigenvalue for v ,and that v is an eigenvector for λ .

25.
What is hierarchical clustering?

Here along with definition and understanding of clustering, let’s explain why is it done, it’s objective.

Hierarchical clustering or hierarchical cluster analysis, is an algorithm that groups similar objects into common groups called clusters.
The goal is to create a set of clusters, where each cluster is different from the other and, individually, they contain similar entities.

1.
What is the criteria to say whether a developed data model is good or not?

A good model should be intuitive, insightful and self-explanatory
It should be derived from the correct data points and sources
The model developed should be able to easily consumed by the clients for actionable and profitable results
A good model should easily adapt to changes according to business requirements
If the data gets updated, the model should be able to scale according to the new data
A good model provides predictable performance
A good data model will display minimal redundancy with regard to repeated entry types, data redundancy, and many-to-many relationships

2.
What is the difference between WHERE clause and HAVING clause?

WHERE clause	HAVING clause
It works on row data	It works on aggregated data
In this clause, the filter occurs before any groupings are made	This is used to filter values from a group
SELECT column1, column2,.. FROM table_name WHERE condition;	SELECT column_name(s) FROM table_name WHERE condition GROUP BY column_name(s) HAVING condition ORDER BY column_name(s)

3.
What is sampling, explain with a real world example and What are different types of sampling techniques used by data analysts?

Sampling is a statistical method to select a subset of data from an entire dataset (population) to estimate the characteristics of the whole population.

Different types of sampling techniques:

Simple random sampling
Systematic sampling
Cluster sampling
Stratified sampling
Judgmental or purposive sampling

4.
Why is Naive Bayes called ‘naive’?

5.
What are the disadvantages of data analytics?

This tricky question is asked to check whether you see other side of things and whether you’re aware of any demerits of analytics.

Compared to N number of advantages that data analytics offers, there are a very few disadvantages or demerits.

There’s a possibility of data analytics leading to a breach in customer privacy and thereby their information such as transactions, subscriptions and purchases etc.
We should note that some of the tools used for data analytics are bit complex and might require prior training to enable their usage
At times selection of right analytics tool can get tricky as it takes a lot of skills and expertise to select the right tool

6.
Explain the limitation of context filters in tableau

7.
What is statistical analysis and which statistical methods have you used in data analysis?

Statistical methods used in data analysis:

Mean
Standard Deviation
Regression
Variance
Sample size
Descriptive and inferential statistics

8.
How to find duplicates in a column in Microsoft Excel?

Use CONDITIONAL formatting to highlight duplicate values. Alternatively, use the COUNTIF function as shown below. For example, values are stored in cells D4:D7.

=COUNTIF(D4:D7,D4)

Apply filter on the column wherein you applied the COUNTIF function and select values greater than 1.

9.
Write the Python code to load the data using Pandas, fetch basic information about the data.

#Load the required libraries 
import pandas as pd 
import numpy as np 
import seaborn as sns 
#Load the data 
df = pd.read_csv('titanic.csv') 
#View the data 
df.head()

The df.info() function will give us the basic information about the dataset.

#Basic information 
df.info() 
#Describe the data 
df.describe()

Using this function, you can see the number of null values, data types, and memory usage as shown in the above outputs along with descriptive statistics.

10.
Find the number of unique values in the column in the above mentioned data.

You can find the number of unique values in the particular column using the unique() function in python.

#unique values  
df['Pclass'].unique() 
df['Survived'].unique() 
df['Sex'].unique()

array([3, 1, 2], dtype=int64) 
array([0, 1], dtype=int64) 
array(['male', 'female'], dtype=object)

The unique() function has returned the unique values which are present in the data

11.
What is CDA and what are the different steps involved?

When it comes to data analysis more often than not we talk about EDA. This question is thrown to see our in-depth knowledge in data analysis, as CDA is lesser known than EDA.

Confirmatory Data Analysis i.e. CDA, is the process that involves evaluation of your evidence using some of the traditional statistical tools such as significance, inference, and confidence.

Confirmatory Data Analysis involves various steps including: testing hypotheses, producing estimates with a specified level of precision, RA (regression analysis), and variance analysis.

Different steps involved in CDA process include:

Defining each and every individual constructs.
Overall measurement model theory development
Designing a study with the intent to produce the empirical results.
Assessing the measurement model validity.

12.
What is an N-gram?

An N-Gram is a connected sequence of N items in a given text or speech. Precisely, an N-gram is a probabilistic language model used to predict the next item in a particular sequence as in N-1.

13.
How should you tackle multi-source problems?

This question is asked to get your idea about multi-source data analysis.

We should start with explanation of multi-source data. Then go on about how would you tackle multi-source problems.
Multi-source data by characteristics is dynamic, heterogeneous, complex, distributed and very large.
When it comes multi-source problems, each source might contain bad or dirty data and the data in the sources might be represented differently, contradict or overlap.
To tackle multi-source problems, you need to identify similar data records, and combine them into one record that will contain all the useful attributes minus the redundancy.

14.
What are the generally observed missing patterns?

Missing patterns include:

Missing at random
Unobserved input variable missing
Missing due to some particular missing value

15.
Is it possible to highlight Cells Containing Negative Values in an Excel Sheet? If yes, how?

This question assesses your practical knowledge on Excel sheet. Hence we need to explain with appropriate steps required to meet the given objective.

Yes, it is possible to highlight cells with negative values in Microsoft Excel. Steps to do that are as follows:

In the Excel menu, go to the Home option and click on Conditional Formatting.
Within the Highlight Cells Rules option, click on Less Than.
In the dialog box that opens, select a value below which you want to highlight cells.
You can choose the highlight color in the dropdown menu.
Hit OK.

16.
What is a collision in a hash table and how can it be avoided?

Objective of the interviewer here would be to assess your knowledge on data structures by having a discussion about hash tables. Here explanation with diagrammatic representation would be advisable.

In a hash table, a collision occurs when two keys are hashed to the same index. Since every slot in a hash table is supposed to store a single element, collisions are a problem.

Chaining is a technique used for avoiding collisions in hash tables.

17.
When you are creating a statistical model, what is overfitting and how do you prevent it?

Overfitting is a scenario, or rather a modeling error in statistics that occurs when a function is too closely aligned to a limited set of data points.

Some of the techniques used to prevent overfitting are:

Early stopping: It helps to stop the training when parameter updates no longer begin to yield improves on a validation set
Cross validation: A statistical method of evaluating and comparing learning algorithms by dividing data into two segments, i.e. one used to learn or train a model and the other one that’s used to validate the model
Data augmentation: It is a set of methods or techniques that are used to increase the amount of data by adding slightly modified copies of existing data or newly created synthetic data from already existing data
Ensembling: Usage of multiple learning algorithms with intent to obtain better predictive performance, than that could be obtained from any of the constituent learning algorithms alone.

18.
What is skewness and what are left-skewed and right-skewed distributions ? Explain with real world examples.

Here again, interviewer wants to see our practical knowledge hence we need explain skewness by taking some real world examples.

Skewness measures the lack of symmetry in data distribution.

A left-skewed distribution is one where a left tail is longer than that of the right tail. It is important to note that in this case:

mean < median < mode

Similarly, the right-skewed distribution is the one where the right tail is longer than the left one. But here:

mean > median > mode

19.
What are a few important ways to improve the performance of tableau?

This is a very tricky question, in the sense for data visualization we would have used tableau as a tool. Interviewer wants to see how well we’ve used the tool and are aware of its cons.

Some of the ways to improve the performance of tableau are:

Use an Extract to make workbooks run faster
Reduce the scope of data to decrease the volume of data
Reduce the number of marks on the view to avoid information overload
Hide unused fields
Use Context filters
Use indexing in tables and use the same fields for filtering

20.
What is the difference between a treemap and heatmap?

This question is asked to assess our knowledge on Tableau. We need to explain the differences through practical knowledge rather than just theoretical definitions.

A heatmap is a two dimensional representation of information with the help of colors. Heatmaps can help the user visualize simple or complex information.

21.
Explain what P-value tells about statistical significance?

This is mostly straight forward question asked to validate our depth in statistics as a data analyst. We should always include point about its range during our explanation.

P value for a statistical test helps to decide whether to accept or reject the null hypothesis.

 $0 < = p_v a l u e < = 1$

P-value range is between $[0, 1]$ The threshold for P-value is set to be $0.05$ . When the value is below $0.05$ , the null-hypothesis is rejected.

22.
Given a dataset of test scores, write Python code using Pandas library to return cumulative bucketed scores of <40, <70, <85, <100.

def bucket_test_scores(df): 
    bins = [0, 40, 70, 85, 100] 
    labels=['<40','<70','<85' , '<100'] 
    df['test score'] = pd.cut(df['test score'], bins,labels=labels)

23.
What are some of the limitations of Python?

This answer will give a view about your command over Python as a programming language which is must as a data analyst.

Python is limited in a few ways, including:

Memory consumption - Python is not great for memory intensive applications
Mobile development - Though Python is great for desktop and server applications, it is weaker for mobile development
Speed - Studies have shown that Python is slower than object oriented languages like C++ and Java. However, there are options to make Python faster, like a custom runtime.
Version V2 vs V3 - Python 2 and Python 3 are incompatible

24.
Explain the Constraints in SQL.

This answer will give a view about your fluency over SQL as a query language which is absolutely necessary as a data analyst. Constraints in SQL are used to specify rules for data in the table.

NOT NULL: Ensures that a column cannot have a NULL value
UNIQUE: Ensures that all values in a column are different. It maintains the uniqueness of a column in a table. More than one UNIQUE column can be used in a table.
PRIMERY KEY: A combination of NOT NULL and UNIQUE, and uniquely identifies each row in the table thereby ensuring faster access to the table
FOREIGN KEY: This constraint creates a link between two tables by one specific column of both tables. This is used to uniquely identify row/record in another table
CHECK: This constraint controls the values in the associated column and ensures that all values in a column satisfy a specific condition
DEFAULT: Each column must contain a value ( including a NULL) .This constraint sets a default value for a column when no value is specified
INDEX: Used to create and retrieve the data from the database very quickly.

Data Analyst Interview Questions and Answers for 2024 Data Science

Beginner

Advanced

Description

Related Interview Questions