Here's a list of top, expert-curated Data Analyst Interview questions and answers which will help you competently crack the data analysis, data analytics or BI(Business Intelligence) analysis related job interviews. With the drafted answers, you can confidently face questions related to job positions like Data Analyst, Business Intelligence Analyst, Data Analytics Consultant, Quantitative Analyst, Senior Data Analyst, Marketing Analyst, Lead Data Analyst, Operations Analyst, BI Consultant and Analytics Manager. Prepare well with these Data Analyst interview questions and answers and ace your next interview at organizations like Accenture, Oracle, Mu Sigma, Fractal Analytics, TCS, Cognizant, IBM, Infosys, Amazon, Capgemini, Genpact, JPMorgan Chase & Co. and Ernst &Young.
Here interviewer wants to assess your basic knowledge of data and how well you understand its practical aspects.
Hence we need to answer it with our understanding of all kind of data along with real world scenario to showcase in-depth practical knowledge.
Data are collected observations or measurements represented as text, numbers or multimedia. Data can be field notes, photographs, documents, audio recordings, videos and transcripts.
Data is different depending upon your area of work or research. If your objective is to find out graduation rates of college students with faculty mentors, your data might be the number of graduates each year and amount of time taken to complete the graduation. Hence data will be different based on what you study.
Sharpen your DevOps skills with these DevOps online classes.
Here they don’t expect you to just give the theoretical definition for categories of data, rather check whether you’re also aware of data’s application in real world.
We need to exhibit the same.
Data can be broadly categorized as qualitative and quantitative.
We have seen that sometimes data and statistics are used interchangeably, hence here interviewer wants to see your clear understanding of data and not be confused with statistics. We need to answer it accordingly. Also, we need to explain the outcome of data analysis.
Data is not the same as statistics. Statistics are the result of data analysis and interpretations, so we can’t use the two words interchangeably.
Analyzing and interpreting the data can help you:
Here intent would be to see the awareness about skillset of data analyst. Better to answer it with separate categories, so that awareness about skillset is conveyed clearly.
Must-have data analyst skills include both soft-skills and hard-skills to be able perform data analysis efficiently.
Here interviewer wants to evaluate your understanding of the entire data analysis process or all the steps of any analytics project. Hence explain all the steps accordingly.
Here you need to list down all possible tools and frameworks you would have used to perform end to end data analysis. This should include programming languages you might have used, tools used for data cleaning and EDA, data visualization tools, query languages etc.
Being a data analyst, you are expected to have knowledge of the below tools for analysis and presentation purposes. Attend this KnowledgeHut DevOps online classes to master DevOps skills.
As we know EDA is one of the very important step in data analysis, answer we give to this question depicts our overall in-depth understanding of EDA and it’s contribution towards data analysis process.
Exploratory Data Analysis or EDA is one of the important steps in the data processing life cycle and it is nothing but a data exploration technique to understand the various aspects of the data. It is basically used to filter the data from redundancies.
Here interviewer wants to know whether you know standard practices followed for data cleaning and also some of the preventive mechanisms. Some of the best practices used in data cleaning are:
Here we need to explain our understanding of data validation including the steps or processed involved.
We need to start with formal definition and then talk about processed involved.
Data validation is the process that involves validating or determining the accuracy of the data at hand and also the quality of the sources.
There are many processes in data validation step, and the main ones include data screening and data verification.
This question is asked to assess our knowledge of corrective mechanisms to address missing values in dataset.
Missing values in a dataset is one of the big problems in real life scenarios. This situation will arise when no information is provided for one or more items or for a whole unit.
Some of the ways to handle missing values in a dataset are:
This question asked to see our overall understanding of data profiling, not just brief bookish definition. Hence we need to proceed accordingly.
Data profiling is a methodology which involves analyzing entire set of entities present across data to a greater depth. The main objective of this process is to provide highly accurate information based on the data and its attributes such as the datatype, frequency or occurrence, and more.
This process involves following key steps:
This question is asked to take a look at our depth towards awareness of the all the tools, frameworks, technologies used for big data and relevant processes. It would be ideal to brief about what they’re used for along with listing down the tools.
This question is asked not just to check our understanding of Time Series Analysis but also its various components.
Time series analysis is a statistical method that deals with an ordered sequence of values of a variable at equally spaced time intervals. It is a widely used technique while working with trend analysis and time series data in particular.
Components of TSA (Time Series Analysis) include:
This question is intended to assess our knowledge towards the topic of outliers including the different types.
An outlier is a value in a dataset which is considered to be far away from the mean of the characteristic feature of the dataset. I.e. a value that is much larger or smaller in a set of data.
For example – in following set of numbers 2 and 98 are outliers
2, 38, 40, 41, 44, 46, 98
There are two types of outliers:
This question is intended to assess our knowledge of outlier detection
techniques, so accordingly we should be talking about at least two most widely used and popular methodologies.
Multiple methodologies can be used for detecting outliers, but two most commonly used methods are as follows:
Along with the definition and how it works ensure to talk about the meaning behind ‘K’ used in K-means algorithm.
K-means algorithm clusters data into different sets based on how close the data points are to each other. The number of clusters is indicated by ‘K’ in the K-means algorithm.
It tries to maintain a good amount of separation between each of the clusters. However, since it works in an unsupervised nature, the clusters will not have any sorts of labels to work with.
So main focus behind this question would be not just to see our understanding of types of hypothesis testing, but mainly to see our understanding towards how they’re used in real world scenario.
Hypothesis testing is the procedure used by statisticians and scientists to accept or reject statistical hypotheses.
Null Hypothesis: It states that there’s no relation between predictor and outcome variables in the population. It is denoted by HO. For example there’s no association between a patient's BMI and diabetes.
Alternative Hypothesis: It states that there’s some relation between the predictor and outcome variables in the population. It is denoted by H1. Example to this is there could be no association between a patient's BMI and diabetes.
Objective behind this question would be see how well we understand dataanalysis process end to end including problems faced on daily basis and some of the commonly faced problems by data analysts across the world.
Some of the common problems that data analysts encounter:
This question is asked to take a sneak peek into our depth of knowledge when it comes to Microsoft Excell sheets. We need to highlight uses, advantages and capabilities of pivot table here.
Uses of the Pivot table include:
Here we need to list down all possible ways to filter the data in Excel.
Awareness and knowledge about data security and measures taken to ensure that are equally important for data analysts. This question is intended to check for that aspect.
Data Security safeguards digital data from unwanted access, corruption, or theft. Data security is critical to public and private sector organizations because there’s legal and moral obligation to protect users and a reputational risk of a data breach.
Protecting the data from internal or external corruption and illegal access helps to protect an organization from reputational harm, financial loss, consumer trust degradation, and brand erosion.
Through this answer we need to convey along with formal definitions of primary key and foreign key, our practical knowledge about them when we speak of SQL.
Here interviewer would be happy to listen if we explain the differences through examples.
Data blending allows a combination of data from different data sources to be linked. Whereas, Data Joining works only with data from one and the same source.
For example: If the data is from an Excel sheet and a SQL database, then Data Blending is the only option to combine the two types of data. However if the data is from two excel sheets, you can use either data blending or data joining.
Data blending is also the only choice available when ‘joining’ the tables is impractical. This impracticality occurs when the dataset is humongous. When joins might create duplicate data or when using databases such as Salesforce and Cubes which do not support joins.
At times we might ignore theory of statistics and algebra involved during data analysis process. Through this answer we need to demonstrate how wellacquainted are we in terms of fundamentals of statistics.
Eigenvectors: Eigenvectors are basically used to understand linear transformations. These are calculated for a correlation or a covariance matrix.
For definition purposes, you can say that Eigenvectors are the directions along which a specific linear transformation acts either by flipping, compressing or stretching.
Eigenvalues: Eigenvalues can be referred to as the strength of the transformation or the factor by which the compression occurs in the direction of eigenvectors.
Let A be a n × n matrix.
If Av = λ v for v
0, we say that λ is the eigenvalue for v ,and that v is an eigenvector for λ .
Here along with definition and understanding of clustering, let’s explain why is it done, it’s objective.
|WHERE clause||HAVING clause|
It works on row data
It works on aggregated data
In this clause, the filter occurs before any groupings are made
This is used to filter values from a group
SELECT column1, column2,.. FROM table_name WHERE condition;
SELECT column_name(s) FROM table_name WHERE condition GROUP BY column_name(s) HAVING condition ORDER BY column_name(s)
Sampling is a statistical method to select a subset of data from an entire dataset (population) to estimate the characteristics of the whole population.
For example, if you are researching the opinions of students in your university, you could survey a sample of 100 students. In statistics, sampling allows you to test a hypothesis about the characteristics of a population.
Different types of sampling techniques:
It is called naive because it makes a general assumption that all the data present are unequivocally important and independent of each other. This is not true and won’t hold good in a real world scenario.
This tricky question is asked to check whether you see other side of things and whether you’re aware of any demerits of analytics.
Ensure that you talk about disadvantages in such a way that you’re aware of them and can take care of them. Answer shouldn’t sound like you’re throwing lot of complaints and don’t make unjustified claims.
Compared to N number of advantages that data analytics offers, there are a very few disadvantages or demerits.
Whenever we set a context filter, Tableau generates a temp table that needs to refresh each and every time, whenever the view is triggered. So, if the context filter is changed in the database, it needs to recompute the temp table, so the performance will be decreased.
Statistical analysis is a scientific tool that helps collect and analyze large amounts of data to identify common patterns and trends to convert them into meaningful information. In simple words, statistical analysis is a data analysis tool that helps draw meaningful conclusions from raw and unstructured data.
Statistical methods used in data analysis:
Use CONDITIONAL formatting to highlight duplicate values. Alternatively, use the COUNTIF function as shown below. For example, values are stored in cells D4:D7.
Apply filter on the column wherein you applied the COUNTIF function and select values greater than 1.
#Load the required libraries import pandas as pd import numpy as np import seaborn as sns #Load the data df = pd.read_csv('titanic.csv') #View the data df.head()
The df.info() function will give us the basic information about the dataset.
#Basic information df.info() #Describe the data df.describe()
Using this function, you can see the number of null values, data types, and memory usage as shown in the above outputs along with descriptive statistics.
You can find the number of unique values in the particular column using the unique() function in python.
#unique values df['Pclass'].unique() df['Survived'].unique() df['Sex'].unique()
array([3, 1, 2], dtype=int64) array([0, 1], dtype=int64) array(['male', 'female'], dtype=object)
The unique() function has returned the unique values which are present in the data
When it comes to data analysis more often than not we talk about EDA. This question is thrown to see our in-depth knowledge in data analysis, as CDA is lesser known than EDA.
Confirmatory Data Analysis i.e. CDA, is the process that involves evaluation of your evidence using some of the traditional statistical tools such as significance, inference, and confidence.
Confirmatory Data Analysis involves various steps including: testing hypotheses, producing estimates with a specified level of precision, RA (regression analysis), and variance analysis.
Different steps involved in CDA process include:
This question is intended to test your knowledge on computational linguistics and probability. Along with a formal definition, it would be advisable to explain it with the help of an example to showcase your knowledge about it.
An N-Gram is a connected sequence of N items in a given text or speech. Precisely, an N-gram is a probabilistic language model used to predict the next item in a particular sequence as in N-1.
This question is asked to get your idea about multi-source data analysis.
Missing patterns include:
This question assesses your practical knowledge on Excel sheet. Hence we need to explain with appropriate steps required to meet the given objective.
Yes, it is possible to highlight cells with negative values in Microsoft Excel. Steps to do that are as follows:
Objective of the interviewer here would be to assess your knowledge on data structures by having a discussion about hash tables. Here explanation with diagrammatic representation would be advisable.
In a hash table, a collision occurs when two keys are hashed to the same index. Since every slot in a hash table is supposed to store a single element, collisions are a problem.
Chaining is a technique used for avoiding collisions in hash tables.
The hash table is an array of linked lists as per chaining approach i.e., each index has its own linked list. All key-value pairs mapping to the same index will be stored in the linked list of that index.
Here we need to talk about statistical model overfitting by making use of graphical representation. Also better to explain model overfitting prevention techniques in detail to demonstrate our expertise with statistical modelling.
Overfitting is a scenario, or rather a modeling error in statistics that occurs when a function is too closely aligned to a limited set of data points.
Some of the techniques used to prevent overfitting are:
Here again, interviewer wants to see our practical knowledge hence we need explain skewness by taking some real world examples.
Skewness measures the lack of symmetry in data distribution.
A left-skewed distribution is one where a left tail is longer than that of the right tail. It is important to note that in this case:
mean < median < mode
Similarly, the right-skewed distribution is the one where the right tail is longer than the left one. But here:
mean > median > mode
This is a very tricky question, in the sense for data visualization we would have used tableau as a tool. Interviewer wants to see how well we’ve used the tool and are aware of its cons.
Some of the ways to improve the performance of tableau are:
This question is asked to assess our knowledge on Tableau. We need to explain the differences through practical knowledge rather than just theoretical definitions.
A heatmap is a two dimensional representation of information with the help of colors. Heatmaps can help the user visualize simple or complex information.
Treemaps are ideal for displaying large amounts of hierarchically structured (tree-structured) data. The space in visualization is split into rectangles that are sized and ordered by a quantitative variable.
This is mostly straight forward question asked to validate our depth in statistics as a data analyst. We should always include point about its range during our explanation.
P value for a statistical test helps to decide whether to accept or reject the null hypothesis.
0 <= p_value <= 1
P-value range is between [0,1][0,1] The threshold for P-value is set to be 0.050.05. When the value is below 0.050.05, the null-hypothesis is rejected.
def bucket_test_scores(df): bins = [0, 40, 70, 85, 100] labels=['<40','<70','<85' , '<100'] df['test score'] = pd.cut(df['test score'], bins,labels=labels)
This answer will give a view about your command over Python as a programming language which is must as a data analyst.
Python is limited in a few ways, including:
This answer will give a view about your fluency over SQL as a query language which is absolutely necessary as a data analyst. Constraints in SQL are used to specify rules for data in the table.
Here rather than hurrying, we need to give ourselves some time and think about statistical methods that can be applied to be able to tell whether coin is biased as this question is about probability theory and statistical concepts.
To answer this question let's say X is the number of heads and let's assume that the coin is not biased. Since each individual flip is a Bernoulli random variable, we can assume it has a probability of showing up heads as p = 0.5, so this will lead to the following expected number of heads:
A Data Analyst is the one who interprets data and turns it into information which can offer ways to improve a business, thus affecting business decisions. Data Analyst gathers information from various sources and interprets patterns and trends. Roles and responsibilities of data analyst include to import, inspect, clean, transform, validate, collect, process or interpret collections of data to stay connected to customers, drive innovation and product development.
According to glassdoor.com, the average base pay for a Data Analyst is $72,306 per year. A few of these companies are Google, Amazon, Cisco Systems, Qualcomm, IBM, JPMorgan Chase & Co and Meta etc.
If you are determined to ace your next interview as a Data Analyst, these Data Analyst interview questions and answers will fast-track your career. To relieve you of the worry and burden of preparation for your upcoming interviews, we have compiled the above list of interview questions for Data Analyst with answers prepared by industry experts. Being well versed with these commonly asked Data Analyst or Data Analytics interview questions will be your very first step towards a promising career as an Data Analyst/Analytics Consultant.
You can opt for various job profiles after mastering data analyst interview questions, including SQL interview questions for data analyst, excel interview questions for data analyst and data analytics interview questions. A few are listed below:
Crack your Data Analyst interview with ease and confidence!