Category

Courses

- Agile Methodology
- Certified ScrumMaster (CSM) Certification
- Certified Scrum Product Owner (CSPO) Certification
- Leading SAFe 6.0 Certification
- Professional Scrum Master-Advanced™ (PSM-A) Training
- SAFe 6.0 Scrum Master (SSM) Certification
- Implementing SAFe 6.0 (SPC) Certification
- SAFe 6.0 Release Train Engineer (RTE) Certification
- SAFe 6.0 Product Owner Product Manager (POPM) Certification
- ICP-ACC Certification
- Agile Master's Program
- Agile Excellence Master's Program
- Kanban Management Professional (KMP I: Kanban System Design) Certification
- Professional Scrum Product Owner I (PSPO I) Training
- View All Courses

Accreditation Bodies

- Project Management
- Project Management Professional (PMP) Certification
- PRINCE2 Certification
- PRINCE2 Foundation Certification
- PRINCE2 Practitioner Certification
- Change Management Training
- Project Management Techniques Training
- Certified Associate in Project Management (CAPM) Certification
- Program Management Professional (PgMP) Certification
- Portfolio Management Professional (PfMP) Certification
- Oracle Primavera P6 Certification
- Project Management Master's Program
- Microsoft Project Training
- View All Courses

Accreditation Bodies

- Data Science
- Data Science Bootcamp
- Data Engineer Bootcamp
- Data Analyst Bootcamp
- AI Engineer Bootcamp
- Data Science with Python Certification
- Python for Data Science
- Machine Learning with Python
- Data Science with R
- Machine Learning with R
- Deep Learning Certification Training
- Natural Language Processing (NLP)
- View All Courses

- DevOps
Accreditation Bodies

- Cloud Computing
- AWS Certified Solutions Architect - Associate
- Multi-Cloud Engineer Bootcamp
- AWS Cloud Practitioner Certification
- Developing on AWS
- AWS DevOps Certification
- Azure Solution Architect Certification
- Azure Fundamentals Certification
- Azure Administrator Certification
- Azure Data Engineer Certification
- Azure Devops Certification
- AWS Cloud Architect Master's Program
- AWS Certified SysOps Administrator Certification
- Azure Security Engineer Certification
- Azure AI Solution Certification Training
- View All Courses

Career TrackSupercharge your career with our Multi-Cloud Engineer Bootcamp

KNOW MORE - Web Development
- Full-Stack Developer Bootcamp
- UI/UX Design Bootcamp
- Full-Stack [Java Stack] Bootcamp
- Software Engineer Bootcamp
- Software Engineer Bootcamp (with PMI)
- Front-End Development Bootcamp
- Back-End Development Bootcamp
- React Training
- Node JS Training
- Angular Training (Version 12)
- Javascript Training
- PHP and MySQL Training
- View All Courses

- IT Service Management
- ITIL 4 Foundation Certification
- ITIL Practitioner Certification
- ISO 14001 Foundation Certification
- ISO 20000 Certification
- ISO 27000 Foundation Certification
- ITIL 4 Specialist: Create, Deliver and Support Training
- ITIL 4 Specialist: Drive Stakeholder Value Training
- ITIL 4 Strategist Direct, Plan and Improve Training
- View All Courses

- Programming
- BI And Visualization
- Blockchain
- Big Data
- Mobile App Development
- Software Testing
- Selenium Certification Training
- ISTQB Foundation Certification
- ISTQB Advanced Level Security Tester Training
- ISTQB Advanced Level Test Manager Certification
- ISTQB Advanced Level Test Analyst Certification
- ISTQB Advanced Level Technical Test Analyst Certification
- Silk Test Workbench Training
- Automation Testing using TestComplete Training
- Cucumber Training
- Functional Testing Using Ranorex Training
- Teradata Certification Training
- View All Courses

- Business Management
- Quality Management
- IT Security
- Cyber Security Bootcamp
- Certified Ethical Hacker (CEH v12) Certification
- Certified Information Systems Auditor (CISA) Certification
- Certified Information Security Manager (CISM) Certification
- Certified Information Systems Security Professional (CISSP) Certification
- Cybersecurity Master's Program
- Certified Cloud Security Professional (CCSP) Certification
- Certified Information Privacy Professional - Europe (CIPP-E) Certification
- Control Objectives for Information and Related Technology (COBIT5) Foundation
- Payment Card Industry Security Standards (PCI-DSS) Certification
- Introduction to Forensic

- Digital Marketing
- Risk Management
- Finance
- Credit Risk Management
- Budget Analysis and Forecasting
- International Financial Reporting Standards (IFRS) for SMEs
- Diploma In International Financial Reporting
- Certificate in International Financial Reporting
- Corporate Governance
- Finance for Non-Finance Managers
- Financial Modeling with Excel
- Auditing and Assurance

- Database
- Soft Skills Training
- CompTIA
- Master of Business Administration
- Other
- MS Excel 2010
- Advanced Excel 2013
- IoT
- Certified Supply Chain Professional
- Software Estimation and Measurement Using IFPUG FPA
- Software Size Estimation and Measurement using IFPUG FPA & SNAP
- Leading and Delivering World Class Product Development Course
- Product Management and Product Marketing for Telecoms IT and Software
- Foundation Certificate in Marketing
- Flow Measurement and Custody Transfer Training Course
- View All Courses

- Home
- Data Science
- Data Analyst Interview Questions and Answers for 2024

- 4.7 Rating
- 50 Question(s)
- 25 Mins of Read
- 6055 Reader(s)

Here's a list of top expert-curated Data Analyst Interview questions and answers which will help you competently crack the data analysis, data analytics or BI(Business Intelligence) analysis related job interviews. With the drafted answers, you can confidently face questions related to job positions like Data Analyst, Business Intelligence Analyst, Data Analytics Consultant, Quantitative Analyst, Senior Data Analyst, Marketing Analyst, Lead Data Analyst, Operations Analyst, BI Consultant and Analytics Manager. Prepare well with these Data Analyst interview questions and answers and ace your next interview at organizations like Accenture, Oracle, Mu Sigma, Fractal Analytics, TCS, Cognizant, IBM, Infosys, Amazon, Capgemini, Genpact, JPMorgan Chase & Co. and Ernst &Young.

- 4.7 Rating
- 50 Question(s)
- 25 Mins of Read
- 6055 Reader(s)

Filter By

Clear all

Here interviewer wants to assess your basic knowledge of data and how well you understand its practical aspects.

Hence we need to answer it with our understanding of all kind of data along with real world scenario to showcase in-depth practical knowledge.

Data are collected observations or measurements represented as text, numbers or multimedia. Data can be field notes, photographs, documents, audio recordings, videos and transcripts.

Data is different depending upon your area of work or research. If your objective is to find out graduation rates of college students with faculty mentors, your data might be the number of graduates each year and amount of time taken to complete the graduation. Hence data will be different based on what you study.

Sharpen your DevOps skills with these DevOps online classes.

Here they don’t expect you to just give the theoretical definition for categories of data, rather check whether you’re also aware of data’s application in real world.

We need to exhibit the same.

Data can be broadly categorized as qualitative and quantitative.

- Quantitative Data: This data can be expressed as a number, counted or compared on numerical scale. Examples include number of attendees at an event, count of words in a book, temperatures observed, land measurements gathered and gradient scales from surveys.
- Qualitative Data: This data is non-numerical or categorical in nature and describes the attributes or properties that an object possesses such as social class, marital status, method of treatment etc. Examples include maps, transcripts, pictures and textual descriptions.

We have seen that sometimes data and statistics are used interchangeably, hence here interviewer wants to see your clear understanding of data and not be confused with statistics. We need to answer it accordingly. Also, we need to explain the outcome of data analysis.

Data is not the same as statistics. Statistics are the result of data analysis and interpretations, so we can’t use the two words interchangeably.

Analyzing and interpreting the data can help you:

- Identify patterns and trends
- Offer solutions
- Understand scientific phenomena

Here intent would be to see the awareness about skillset of data analyst. Better to answer it with separate categories, so that awareness about skillset is conveyed clearly.

Must-have data analyst skills include both soft-skills and hard-skills to be able perform data analysis efficiently.

**Soft Skills:**

- Communication
- Critical Thinking
- Story Telling
- Decision making
- Fast coding
- Collaboration

**Hard Skills:**

- Linear algebra and Calculus
- SQL and NoSQL
- Matlab, R and Python
- Microsoft Excel
- Data Visualization

Here interviewer wants to evaluate your understanding of the entire data analysis process or all the steps of any analytics project. Hence explain all the steps accordingly.

- Understanding the problem – Identify the right question to solve, understand expectations
- Data collection – This step is about gathering accurate data from all sources. Examples include opinion polls, sales records and surveys etc.
- Data cleaning – This includes removal or fixing of incomplete, corrupted, incorrect, wrongly formatted or duplicate data
- Data exploration and analysis (EDA) - Objective of this step is to analyze and investigate the data and summarize its main characteristics
- Interpret the results – This involves employing data visualization techniques ultimately to discover trends, patterns or to cross check assumptions

Here you need to list down all possible tools and frameworks you would have used to perform end to end data analysis. This should include programming languages you might have used, tools used for data cleaning and EDA, data visualization tools, query languages etc.

Being a data analyst, you are expected to have knowledge of the below tools for analysis and presentation purposes. Attend this KnowledgeHut DevOps online classes to master DevOps skills.

- Python
- Tableau
- Microsoft Excel
- MySQL
- Microsoft PowerPoint or equivalent
- Microsoft SQL Server
- IBM SPSS

As we know EDA is one of the very important step in data analysis, answer we give to this question depicts our overall in-depth understanding of EDA and it’s contribution towards data analysis process.

Exploratory Data Analysis or EDA is one of the important steps in the data processing life cycle and it is nothing but a data exploration technique to understand the various aspects of the data. It is basically used to filter the data from redundancies.

- Exploratory Data Analysis helps to understand the data better
- It helps you to gain confidence in your data to a point where you are ready to apply a machine learning algorithm
- It allows you to refine your selection of feature variables that will be used later for building the model
- You can discover hidden patterns, trends and insights from the data

Here interviewer wants to know whether you know standard practices followed for data cleaning and also some of the preventive mechanisms. Some of the best practices used in data cleaning are:

- Preparation of data cleaning plan by understanding where the common errors take place and we need to keep communications open
- We need to identify and remove duplicates before we work with the data. This will ensure data analysis process is effective
- As a data analyst it’s our responsibility to focus on the accuracy of the data. Also maintain the value types of data, provide mandatory constraints and set cross-field validation
- We need to standardize the data at entry point so that it is less chaotic and you will be able ensure that all the information is standardized, leading to fewer errors on entry

Here we need to explain our understanding of data validation including the steps or processed involved.

We need to start with formal definition and then talk about processed involved.

Data validation is the process that involves validating or determining the accuracy of the data at hand and also the quality of the sources.

There are many processes in data validation step, and the main ones include data screening and data verification.

- Data Screening: This process is all about making use of a variety of models to ensure that the data under consideration is accurate and there are no redundancies present.
- Data Verification: If there is a redundancy found at the screening process, then it is evaluated based on multiple steps and later a call is taken to ensure the presence of the data item.

This question is asked to assess our knowledge of corrective mechanisms to address missing values in dataset.

Missing values in a dataset is one of the big problems in real life scenarios. This situation will arise when no information is provided for one or more items or for a whole unit.

Some of the ways to handle missing values in a dataset are:

- Listwise Deletion: In this method, an entire record is excluded from analysis if any single value is missing
- Average Imputation: Use the average value of the responses from the other participants to fill in the missing value
- Regression Substitution: One can use multiple-regression analysis to estimate a missing value
- Multiple Imputation: It creates plausible values based on the correlations for the missing data and then averages the simulated datasets by incorporating random errors in your prediction

This question asked to see our overall understanding of data profiling, not just brief bookish definition. Hence we need to proceed accordingly.

Data profiling is a methodology which involves analyzing entire set of entities present across data to a greater depth. The main objective of this process is to provide highly accurate information based on the data and its attributes such as the datatype, frequency or occurrence, and more.

This process involves following key steps:

- Collect the data from one or more sources
- Perform initial data cleansing work to consolidate and unify data sets from different sources
- Eventually run profiling tasks to collect statistics on the data and at last identify characteristics and issues

This question is asked to take a look at our depth towards awareness of the all the tools, frameworks, technologies used for big data and relevant processes. It would be ideal to brief about what they’re used for along with listing down the tools.

- Hadoop – used to efficiently store and process large datasets
- Spark - unified analytics engine for large-scale data processing
- Hive - allows users to read, write, and manage petabytes of data using SQL-like query languages
- Flume – intended for high volume data ingestion to Hadoop of event-based data
- Mahout – designed to create scalable machine learning algorithms
- Flink - unified stream-processing and batch-processing framework
- Tableau - interactive data visualization software
- Microsoft PowerBI - interactive data visualization software developed by Microsoft
- QlikView - business analytics platform

This question is asked not just to check our understanding of Time Series Analysis but also its various components.

Time series analysis is a statistical method that deals with an ordered sequence of values of a variable at equally spaced time intervals. It is a widely used technique while working with trend analysis and time series data in particular.

Components of TSA (Time Series Analysis) include:

- Long-term movement or trend
- Short-term movements - seasonal variations and cyclic variations
- Random or irregular movements

This question is intended to assess our knowledge towards the topic of outliers including the different types.

An outlier is a value in a dataset which is considered to be far away from the mean of the characteristic feature of the dataset. I.e. a value that is much larger or smaller in a set of data.

For example – in following set of numbers 2 and 98 are outliers

2, 38, 40, 41, 44, 46, 98

There are two types of outliers:

- Univariate – scenario that consists of an extreme value on one variable
- Multivariate - combination of unusual scores on at least two variables

This question is intended to assess our knowledge of outlier detection

techniques, so accordingly we should be talking about at least two most widely used and popular methodologies.

Multiple methodologies can be used for detecting outliers, but two most commonly used methods are as follows:

- Standard Deviation Method: Here the value is considered as an outlier if the value is lower or higher than three standard deviations from the mean value
- Box Plot Method: Here, a value is considered to be an outlier if it is lesser or higher than 1.5 times the IQR (Interquartile Range)

Along with the definition and how it works ensure to talk about the meaning behind ‘K’ used in K-means algorithm.

K-means algorithm clusters data into different sets based on how close the data points are to each other. The number of clusters is indicated by ‘K’ in the K-means algorithm.

It tries to maintain a good amount of separation between each of the clusters. However, since it works in an unsupervised nature, the clusters will not have any sorts of labels to work with.

So main focus behind this question would be not just to see our understanding of types of hypothesis testing, but mainly to see our understanding towards how they’re used in real world scenario.

Hypothesis testing is the procedure used by statisticians and scientists to accept or reject statistical hypotheses.

Null Hypothesis: It states that there’s no relation between predictor and outcome variables in the population. It is denoted by HO. For example there’s no association between a patient's BMI and diabetes.

Alternative Hypothesis: It states that there’s some relation between the predictor and outcome variables in the population. It is denoted by H1. Example to this is there could be no association between a patient's BMI and diabetes.

Objective behind this question would be see how well we understand dataanalysis process end to end including problems faced on daily basis and some of the commonly faced problems by data analysts across the world.

Some of the common problems that data analysts encounter:

- Handling duplicate and missing values
- Collecting the meaningful right data and the right time
- Making data secure and dealing with compliance issues
- Handling data purging and storage problems

This question is asked to take a sneak peek into our depth of knowledge when it comes to Microsoft Excell sheets. We need to highlight uses, advantages and capabilities of pivot table here.

Uses of the Pivot table include:

- Pivot tables are one of the powerful and key features of Excel sheet
- They are leveraged to see comparisons, patterns, and trends in your data
- They are employed to be able to view and summarize entirety of large datasets in a simple manner
- They help with quick creation of reports through simple drag-and-drop operations

Here we need to list down all possible ways to filter the data in Excel.

- As per management requirement
- Period wise filter - week wise, month and year wise
- Current period comparison with any period (past/future) - this quarter’s performance Vs Last quarter’s performance
- Filter for error summary - for example there’s abnormal increase in raw material wastage, then we can filter for error summary and highlight it in the report for management
- Filter for performance - for example while analyzing company’s growth

Awareness and knowledge about data security and measures taken to ensure that are equally important for data analysts. This question is intended to check for that aspect.

Data Security safeguards digital data from unwanted access, corruption, or theft. Data security is critical to public and private sector organizations because there’s legal and moral obligation to protect users and a reputational risk of a data breach.

Protecting the data from internal or external corruption and illegal access helps to protect an organization from reputational harm, financial loss, consumer trust degradation, and brand erosion.

Through this answer we need to convey along with formal definitions of primary key and foreign key, our practical knowledge about them when we speak of SQL.

- A PRIMARY KEY is a column or a group of columns in a table that uniquely identifies the rows of data in that table.
- A FOREIGN KEY is a column or group of columns in one table, that refers to the PRIMARY KEY in another table. It maintains referential integrity in the database.
- Table with the FOREIGN KEY is called the child table, and a table with a PRIMARY KEY is called a reference or parent table.

Here interviewer would be happy to listen if we explain the differences through examples.

Data blending allows a combination of data from different data sources to be linked. Whereas, Data Joining works only with data from one and the same source.

For example: If the data is from an Excel sheet and a SQL database, then Data Blending is the only option to combine the two types of data. However if the data is from two excel sheets, you can use either data blending or data joining.

Data blending is also the only choice available when ‘joining’ the tables is impractical. This impracticality occurs when the dataset is humongous. When joins might create duplicate data or when using databases such as Salesforce and Cubes which do not support joins.

At times we might ignore theory of statistics and algebra involved during data analysis process. Through this answer we need to demonstrate how wellacquainted are we in terms of fundamentals of statistics.

**Eigenvectors:**** **Eigenvectors are basically used to understand linear transformations. These are calculated for a correlation or a covariance matrix.

For definition purposes, you can say that Eigenvectors are the directions along which a specific linear transformation acts either by flipping, compressing or stretching.

**Eigenvalues:**** **Eigenvalues can be referred to as the strength of the transformation or the factor by which the compression occurs in the direction of eigenvectors.

Let A be a n × n matrix.

- An eigenvector of A is a nonzero vector v in Rn such that Av = λ v , for some scalar λ .
- An eigenvalue of A is a scalar λ such that the equation Av = λ v has a nontrivial solution.

If Av = λ v for v

≠≠

0, we say that λ is the eigenvalue for v ,and that v is an eigenvector for λ .

Here along with definition and understanding of clustering, let’s explain why is it done, it’s objective.

- Hierarchical clustering or hierarchical cluster analysis, is an algorithm that groups similar objects into common groups called clusters.
- The goal is to create a set of clusters, where each cluster is different from the other and, individually, they contain similar entities.

- A good model should be intuitive, insightful and self-explanatory
- It should be derived from the correct data points and sources
- The model developed should be able to easily consumed by the clients for actionable and profitable results
- A good model should easily adapt to changes according to business requirements
- If the data gets updated, the model should be able to scale according to the new data
- A good model provides predictable performance
- A good data model will display minimal redundancy with regard to repeated entry types, data redundancy, and many-to-many relationships

WHERE clause | HAVING clause |
---|---|

It works on row data | It works on aggregated data |

In this clause, the filter occurs before any groupings are made | This is used to filter values from a group |

SELECT column1, column2,.. FROM table_name WHERE condition; | SELECT column_name(s) FROM table_name WHERE condition GROUP BY column_name(s) HAVING condition ORDER BY column_name(s) |

Sampling is a statistical method to select a subset of data from an entire dataset (population) to estimate the characteristics of the whole population.

For example, if you are researching the opinions of students in your university, you could survey a sample of 100 students. In statistics, sampling allows you to test a hypothesis about the characteristics of a population.

Different types of sampling techniques:

- Simple random sampling
- Systematic sampling
- Cluster sampling
- Stratified sampling
- Judgmental or purposive sampling

This tricky question is asked to check whether you see other side of things and whether you’re aware of any demerits of analytics.

Ensure that you talk about disadvantages in such a way that you’re aware of them and can take care of them. Answer shouldn’t sound like you’re throwing lot of complaints and don’t make unjustified claims.

Compared to N number of advantages that data analytics offers, there are a very few disadvantages or demerits.

- There’s a possibility of data analytics leading to a breach in customer privacy and thereby their information such as transactions, subscriptions and purchases etc.
- We should note that some of the tools used for data analytics are bit complex and might require prior training to enable their usage
- At times selection of right analytics tool can get tricky as it takes a lot of skills and expertise to select the right tool

Statistical analysis is a scientific tool that helps collect and analyze large amounts of data to identify common patterns and trends to convert them into meaningful information. In simple words, statistical analysis is a data analysis tool that helps draw meaningful conclusions from raw and unstructured data.

Statistical methods used in data analysis:

- Mean
- Standard Deviation
- Regression
- Variance
- Sample size
- Descriptive and inferential statistics

Use CONDITIONAL formatting to highlight duplicate values. Alternatively, use the COUNTIF function as shown below. For example, values are stored in cells D4:D7.

=COUNTIF(D4:D7,D4)

Apply filter on the column wherein you applied the COUNTIF function and select values greater than 1.

#Load the required libraries import pandas as pd import numpy as np import seaborn as sns #Load the data df = pd.read_csv('titanic.csv') #View the data df.head()

The df.info() function will give us the basic information about the dataset.

#Basic information df.info() #Describe the data df.describe()

Using this function, you can see the number of null values, data types, and memory usage as shown in the above outputs along with descriptive statistics.

You can find the number of unique values in the particular column using the unique() function in python.

#unique values df['Pclass'].unique() df['Survived'].unique() df['Sex'].unique()

array([3, 1, 2], dtype=int64) array([0, 1], dtype=int64) array(['male', 'female'], dtype=object)

The unique() function has returned the unique values which are present in the data

When it comes to data analysis more often than not we talk about EDA. This question is thrown to see our in-depth knowledge in data analysis, as CDA is lesser known than EDA.

Confirmatory Data Analysis i.e. CDA, is the process that involves evaluation of your evidence using some of the traditional statistical tools such as significance, inference, and confidence.

Confirmatory Data Analysis involves various steps including: testing hypotheses, producing estimates with a specified level of precision, RA (regression analysis), and variance analysis.

Different steps involved in CDA process include:

- Defining each and every individual constructs.
- Overall measurement model theory development
- Designing a study with the intent to produce the empirical results.
- Assessing the measurement model validity.

This question is intended to test your knowledge on computational linguistics and probability. Along with a formal definition, it would be advisable to explain it with the help of an example to showcase your knowledge about it.

An N-Gram is a connected sequence of N items in a given text or speech. Precisely, an N-gram is a probabilistic language model used to predict the next item in a particular sequence as in N-1.

This question is asked to get your idea about multi-source data analysis.

- We should start with explanation of multi-source data. Then go on about how would you tackle multi-source problems.
- Multi-source data by characteristics is dynamic, heterogeneous, complex, distributed and very large.
- When it comes multi-source problems, each source might contain bad or dirty data and the data in the sources might be represented differently, contradict or overlap.
- To tackle multi-source problems, you need to identify similar data records, and combine them into one record that will contain all the useful attributes minus the redundancy.

Missing patterns include:

- Missing at random
- Unobserved input variable missing
- Missing due to some particular missing value

This question assesses your practical knowledge on Excel sheet. Hence we need to explain with appropriate steps required to meet the given objective.

Yes, it is possible to highlight cells with negative values in Microsoft Excel. Steps to do that are as follows:

- In the Excel menu, go to the
option and click on*Home*.*Conditional Formatting* - Within the
option, click on*Highlight Cells Rules*.*Less Than* - In the dialog box that opens, select a value below which you want to highlight cells.
- You can choose the highlight color in the dropdown menu.
- Hit
.*OK*

Objective of the interviewer here would be to assess your knowledge on data structures by having a discussion about hash tables. Here explanation with diagrammatic representation would be advisable.

In a hash table, a collision occurs when two keys are hashed to the same index. Since every slot in a hash table is supposed to store a single element, collisions are a problem.

Chaining is a technique used for avoiding collisions in hash tables.

The hash table is an array of linked lists as per chaining approach i.e., each index has its own linked list. All key-value pairs mapping to the same index will be stored in the linked list of that index.

Here we need to talk about statistical model overfitting by making use of graphical representation. Also better to explain model overfitting prevention techniques in detail to demonstrate our expertise with statistical modelling.

Overfitting is a scenario, or rather a modeling error in statistics that occurs when a function is too closely aligned to a limited set of data points.

Some of the techniques used to prevent overfitting are:

- Early stopping: It helps to stop the training when parameter updates no longer begin to yield improves on a validation set
- Cross validation: A statistical method of evaluating and comparing learning algorithms by dividing data into two segments, i.e. one used to learn or train a model and the other one that’s used to validate the model
- Data augmentation: It is a set of methods or techniques that are used to increase the amount of data by adding slightly modified copies of existing data or newly created synthetic data from already existing data
- Ensembling: Usage of multiple learning algorithms with intent to obtain better predictive performance, than that could be obtained from any of the constituent learning algorithms alone.

Here again, interviewer wants to see our practical knowledge hence we need explain skewness by taking some real world examples.

Skewness measures the lack of symmetry in data distribution.

A left-skewed distribution is one where a left tail is longer than that of the right tail. It is important to note that in this case:

mean < median < mode

Similarly, the right-skewed distribution is the one where the right tail is longer than the left one. But here:

mean > median > mode

This is a very tricky question, in the sense for data visualization we would have used tableau as a tool. Interviewer wants to see how well we’ve used the tool and are aware of its cons.

Some of the ways to improve the performance of tableau are:

- Use an Extract to make workbooks run faster
- Reduce the scope of data to decrease the volume of data
- Reduce the number of marks on the view to avoid information overload
- Hide unused fields
- Use Context filters
- Use indexing in tables and use the same fields for filtering

This question is asked to assess our knowledge on Tableau. We need to explain the differences through practical knowledge rather than just theoretical definitions.

A heatmap is a two dimensional representation of information with the help of colors. Heatmaps can help the user visualize simple or complex information.

Treemaps are ideal for displaying large amounts of hierarchically structured (tree-structured) data. The space in visualization is split into rectangles that are sized and ordered by a quantitative variable.

This is mostly straight forward question asked to validate our depth in statistics as a data analyst. We should always include point about its range during our explanation.

P value for a statistical test helps to decide whether to accept or reject the null hypothesis.

`0 <= p_value <= 1`

P-value range is between [0,1][0,1] The threshold for P-value is set to be 0.050.05. When the value is below 0.050.05, the null-hypothesis is rejected.

This answer will give a view about your command over Python as a programming language which is must as a data analyst.

Python is limited in a few ways, including:

- Memory consumption - Python is not great for memory intensive applications
- Mobile development - Though Python is great for desktop and server applications, it is weaker for mobile development
- Speed - Studies have shown that Python is slower than object oriented languages like C++ and Java. However, there are options to make Python faster, like a custom runtime.
- Version V2 vs V3 - Python 2 and Python 3 are incompatible

This answer will give a view about your fluency over SQL as a query language which is absolutely necessary as a data analyst. Constraints in SQL are used to specify rules for data in the table.

- NOT NULL: Ensures that a column cannot have a NULL value
- UNIQUE: Ensures that all values in a column are different. It maintains the uniqueness of a column in a table. More than one UNIQUE column can be used in a table.
- PRIMERY KEY: A combination of NOT NULL and UNIQUE, and uniquely identifies each row in the table thereby ensuring faster access to the table
- FOREIGN KEY: This constraint creates a link between two tables by one specific column of both tables. This is used to uniquely identify row/record in another table
- CHECK: This constraint controls the values in the associated column and ensures that all values in a column satisfy a specific condition
- DEFAULT: Each column must contain a value ( including a NULL) .This constraint sets a default value for a column when no value is specified
- INDEX: Used to create and retrieve the data from the database very quickly.

Here rather than hurrying, we need to give ourselves some time and think about statistical methods that can be applied to be able to tell whether coin is biased as this question is about probability theory and statistical concepts.

To answer this question let's say X is the number of heads and let's assume that the coin is not biased. Since each individual flip is a Bernoulli random variable, we can assume it has a probability of showing up heads as p = 0.5, so this will lead to the following expected number of heads:

Here interviewer wants to assess your basic knowledge of data and how well you understand its practical aspects.

Hence we need to answer it with our understanding of all kind of data along with real world scenario to showcase in-depth practical knowledge.

Data are collected observations or measurements represented as text, numbers or multimedia. Data can be field notes, photographs, documents, audio recordings, videos and transcripts.

Data is different depending upon your area of work or research. If your objective is to find out graduation rates of college students with faculty mentors, your data might be the number of graduates each year and amount of time taken to complete the graduation. Hence data will be different based on what you study.

Sharpen your DevOps skills with these DevOps online classes.

Here they don’t expect you to just give the theoretical definition for categories of data, rather check whether you’re also aware of data’s application in real world.

We need to exhibit the same.

Data can be broadly categorized as qualitative and quantitative.

- Quantitative Data: This data can be expressed as a number, counted or compared on numerical scale. Examples include number of attendees at an event, count of words in a book, temperatures observed, land measurements gathered and gradient scales from surveys.
- Qualitative Data: This data is non-numerical or categorical in nature and describes the attributes or properties that an object possesses such as social class, marital status, method of treatment etc. Examples include maps, transcripts, pictures and textual descriptions.

We have seen that sometimes data and statistics are used interchangeably, hence here interviewer wants to see your clear understanding of data and not be confused with statistics. We need to answer it accordingly. Also, we need to explain the outcome of data analysis.

Data is not the same as statistics. Statistics are the result of data analysis and interpretations, so we can’t use the two words interchangeably.

Analyzing and interpreting the data can help you:

- Identify patterns and trends
- Offer solutions
- Understand scientific phenomena

Here intent would be to see the awareness about skillset of data analyst. Better to answer it with separate categories, so that awareness about skillset is conveyed clearly.

Must-have data analyst skills include both soft-skills and hard-skills to be able perform data analysis efficiently.

**Soft Skills:**

- Communication
- Critical Thinking
- Story Telling
- Decision making
- Fast coding
- Collaboration

**Hard Skills:**

- Linear algebra and Calculus
- SQL and NoSQL
- Matlab, R and Python
- Microsoft Excel
- Data Visualization

Here interviewer wants to evaluate your understanding of the entire data analysis process or all the steps of any analytics project. Hence explain all the steps accordingly.

- Understanding the problem – Identify the right question to solve, understand expectations
- Data collection – This step is about gathering accurate data from all sources. Examples include opinion polls, sales records and surveys etc.
- Data cleaning – This includes removal or fixing of incomplete, corrupted, incorrect, wrongly formatted or duplicate data
- Data exploration and analysis (EDA) - Objective of this step is to analyze and investigate the data and summarize its main characteristics
- Interpret the results – This involves employing data visualization techniques ultimately to discover trends, patterns or to cross check assumptions

Here you need to list down all possible tools and frameworks you would have used to perform end to end data analysis. This should include programming languages you might have used, tools used for data cleaning and EDA, data visualization tools, query languages etc.

Being a data analyst, you are expected to have knowledge of the below tools for analysis and presentation purposes. Attend this KnowledgeHut DevOps online classes to master DevOps skills.

- Python
- Tableau
- Microsoft Excel
- MySQL
- Microsoft PowerPoint or equivalent
- Microsoft SQL Server
- IBM SPSS

As we know EDA is one of the very important step in data analysis, answer we give to this question depicts our overall in-depth understanding of EDA and it’s contribution towards data analysis process.

Exploratory Data Analysis or EDA is one of the important steps in the data processing life cycle and it is nothing but a data exploration technique to understand the various aspects of the data. It is basically used to filter the data from redundancies.

- Exploratory Data Analysis helps to understand the data better
- It helps you to gain confidence in your data to a point where you are ready to apply a machine learning algorithm
- It allows you to refine your selection of feature variables that will be used later for building the model
- You can discover hidden patterns, trends and insights from the data

Here interviewer wants to know whether you know standard practices followed for data cleaning and also some of the preventive mechanisms. Some of the best practices used in data cleaning are:

- Preparation of data cleaning plan by understanding where the common errors take place and we need to keep communications open
- We need to identify and remove duplicates before we work with the data. This will ensure data analysis process is effective
- As a data analyst it’s our responsibility to focus on the accuracy of the data. Also maintain the value types of data, provide mandatory constraints and set cross-field validation
- We need to standardize the data at entry point so that it is less chaotic and you will be able ensure that all the information is standardized, leading to fewer errors on entry

Here we need to explain our understanding of data validation including the steps or processed involved.

We need to start with formal definition and then talk about processed involved.

Data validation is the process that involves validating or determining the accuracy of the data at hand and also the quality of the sources.

There are many processes in data validation step, and the main ones include data screening and data verification.

- Data Screening: This process is all about making use of a variety of models to ensure that the data under consideration is accurate and there are no redundancies present.
- Data Verification: If there is a redundancy found at the screening process, then it is evaluated based on multiple steps and later a call is taken to ensure the presence of the data item.

This question is asked to assess our knowledge of corrective mechanisms to address missing values in dataset.

Missing values in a dataset is one of the big problems in real life scenarios. This situation will arise when no information is provided for one or more items or for a whole unit.

Some of the ways to handle missing values in a dataset are:

- Listwise Deletion: In this method, an entire record is excluded from analysis if any single value is missing
- Average Imputation: Use the average value of the responses from the other participants to fill in the missing value
- Regression Substitution: One can use multiple-regression analysis to estimate a missing value
- Multiple Imputation: It creates plausible values based on the correlations for the missing data and then averages the simulated datasets by incorporating random errors in your prediction

This question asked to see our overall understanding of data profiling, not just brief bookish definition. Hence we need to proceed accordingly.

Data profiling is a methodology which involves analyzing entire set of entities present across data to a greater depth. The main objective of this process is to provide highly accurate information based on the data and its attributes such as the datatype, frequency or occurrence, and more.

This process involves following key steps:

- Collect the data from one or more sources
- Perform initial data cleansing work to consolidate and unify data sets from different sources
- Eventually run profiling tasks to collect statistics on the data and at last identify characteristics and issues

This question is asked to take a look at our depth towards awareness of the all the tools, frameworks, technologies used for big data and relevant processes. It would be ideal to brief about what they’re used for along with listing down the tools.

- Hadoop – used to efficiently store and process large datasets
- Spark - unified analytics engine for large-scale data processing
- Hive - allows users to read, write, and manage petabytes of data using SQL-like query languages
- Flume – intended for high volume data ingestion to Hadoop of event-based data
- Mahout – designed to create scalable machine learning algorithms
- Flink - unified stream-processing and batch-processing framework
- Tableau - interactive data visualization software
- Microsoft PowerBI - interactive data visualization software developed by Microsoft
- QlikView - business analytics platform

This question is asked not just to check our understanding of Time Series Analysis but also its various components.

Time series analysis is a statistical method that deals with an ordered sequence of values of a variable at equally spaced time intervals. It is a widely used technique while working with trend analysis and time series data in particular.

Components of TSA (Time Series Analysis) include:

- Long-term movement or trend
- Short-term movements - seasonal variations and cyclic variations
- Random or irregular movements

This question is intended to assess our knowledge towards the topic of outliers including the different types.

An outlier is a value in a dataset which is considered to be far away from the mean of the characteristic feature of the dataset. I.e. a value that is much larger or smaller in a set of data.

For example – in following set of numbers 2 and 98 are outliers

2, 38, 40, 41, 44, 46, 98

There are two types of outliers:

- Univariate – scenario that consists of an extreme value on one variable
- Multivariate - combination of unusual scores on at least two variables

This question is intended to assess our knowledge of outlier detection

techniques, so accordingly we should be talking about at least two most widely used and popular methodologies.

Multiple methodologies can be used for detecting outliers, but two most commonly used methods are as follows:

- Standard Deviation Method: Here the value is considered as an outlier if the value is lower or higher than three standard deviations from the mean value
- Box Plot Method: Here, a value is considered to be an outlier if it is lesser or higher than 1.5 times the IQR (Interquartile Range)

Along with the definition and how it works ensure to talk about the meaning behind ‘K’ used in K-means algorithm.

K-means algorithm clusters data into different sets based on how close the data points are to each other. The number of clusters is indicated by ‘K’ in the K-means algorithm.

It tries to maintain a good amount of separation between each of the clusters. However, since it works in an unsupervised nature, the clusters will not have any sorts of labels to work with.

So main focus behind this question would be not just to see our understanding of types of hypothesis testing, but mainly to see our understanding towards how they’re used in real world scenario.

Hypothesis testing is the procedure used by statisticians and scientists to accept or reject statistical hypotheses.

Null Hypothesis: It states that there’s no relation between predictor and outcome variables in the population. It is denoted by HO. For example there’s no association between a patient's BMI and diabetes.

Alternative Hypothesis: It states that there’s some relation between the predictor and outcome variables in the population. It is denoted by H1. Example to this is there could be no association between a patient's BMI and diabetes.

Objective behind this question would be see how well we understand dataanalysis process end to end including problems faced on daily basis and some of the commonly faced problems by data analysts across the world.

Some of the common problems that data analysts encounter:

- Handling duplicate and missing values
- Collecting the meaningful right data and the right time
- Making data secure and dealing with compliance issues
- Handling data purging and storage problems

This question is asked to take a sneak peek into our depth of knowledge when it comes to Microsoft Excell sheets. We need to highlight uses, advantages and capabilities of pivot table here.

Uses of the Pivot table include:

- Pivot tables are one of the powerful and key features of Excel sheet
- They are leveraged to see comparisons, patterns, and trends in your data
- They are employed to be able to view and summarize entirety of large datasets in a simple manner
- They help with quick creation of reports through simple drag-and-drop operations

Here we need to list down all possible ways to filter the data in Excel.

- As per management requirement
- Period wise filter - week wise, month and year wise
- Current period comparison with any period (past/future) - this quarter’s performance Vs Last quarter’s performance
- Filter for error summary - for example there’s abnormal increase in raw material wastage, then we can filter for error summary and highlight it in the report for management
- Filter for performance - for example while analyzing company’s growth

Awareness and knowledge about data security and measures taken to ensure that are equally important for data analysts. This question is intended to check for that aspect.

Data Security safeguards digital data from unwanted access, corruption, or theft. Data security is critical to public and private sector organizations because there’s legal and moral obligation to protect users and a reputational risk of a data breach.

Protecting the data from internal or external corruption and illegal access helps to protect an organization from reputational harm, financial loss, consumer trust degradation, and brand erosion.

Through this answer we need to convey along with formal definitions of primary key and foreign key, our practical knowledge about them when we speak of SQL.

- A PRIMARY KEY is a column or a group of columns in a table that uniquely identifies the rows of data in that table.
- A FOREIGN KEY is a column or group of columns in one table, that refers to the PRIMARY KEY in another table. It maintains referential integrity in the database.
- Table with the FOREIGN KEY is called the child table, and a table with a PRIMARY KEY is called a reference or parent table.

Here interviewer would be happy to listen if we explain the differences through examples.

Data blending allows a combination of data from different data sources to be linked. Whereas, Data Joining works only with data from one and the same source.

For example: If the data is from an Excel sheet and a SQL database, then Data Blending is the only option to combine the two types of data. However if the data is from two excel sheets, you can use either data blending or data joining.

Data blending is also the only choice available when ‘joining’ the tables is impractical. This impracticality occurs when the dataset is humongous. When joins might create duplicate data or when using databases such as Salesforce and Cubes which do not support joins.

At times we might ignore theory of statistics and algebra involved during data analysis process. Through this answer we need to demonstrate how wellacquainted are we in terms of fundamentals of statistics.

**Eigenvectors:**** **Eigenvectors are basically used to understand linear transformations. These are calculated for a correlation or a covariance matrix.

For definition purposes, you can say that Eigenvectors are the directions along which a specific linear transformation acts either by flipping, compressing or stretching.

**Eigenvalues:**** **Eigenvalues can be referred to as the strength of the transformation or the factor by which the compression occurs in the direction of eigenvectors.

Let A be a n × n matrix.

- An eigenvector of A is a nonzero vector v in Rn such that Av = λ v , for some scalar λ .
- An eigenvalue of A is a scalar λ such that the equation Av = λ v has a nontrivial solution.

If Av = λ v for v

≠≠

0, we say that λ is the eigenvalue for v ,and that v is an eigenvector for λ .

Here along with definition and understanding of clustering, let’s explain why is it done, it’s objective.

- Hierarchical clustering or hierarchical cluster analysis, is an algorithm that groups similar objects into common groups called clusters.
- The goal is to create a set of clusters, where each cluster is different from the other and, individually, they contain similar entities.

- A good model should be intuitive, insightful and self-explanatory
- It should be derived from the correct data points and sources
- The model developed should be able to easily consumed by the clients for actionable and profitable results
- A good model should easily adapt to changes according to business requirements
- If the data gets updated, the model should be able to scale according to the new data
- A good model provides predictable performance
- A good data model will display minimal redundancy with regard to repeated entry types, data redundancy, and many-to-many relationships

WHERE clause | HAVING clause |
---|---|

It works on row data | It works on aggregated data |

In this clause, the filter occurs before any groupings are made | This is used to filter values from a group |

SELECT column1, column2,.. FROM table_name WHERE condition; | SELECT column_name(s) FROM table_name WHERE condition GROUP BY column_name(s) HAVING condition ORDER BY column_name(s) |

Sampling is a statistical method to select a subset of data from an entire dataset (population) to estimate the characteristics of the whole population.

For example, if you are researching the opinions of students in your university, you could survey a sample of 100 students. In statistics, sampling allows you to test a hypothesis about the characteristics of a population.

Different types of sampling techniques:

- Simple random sampling
- Systematic sampling
- Cluster sampling
- Stratified sampling
- Judgmental or purposive sampling

This tricky question is asked to check whether you see other side of things and whether you’re aware of any demerits of analytics.

Ensure that you talk about disadvantages in such a way that you’re aware of them and can take care of them. Answer shouldn’t sound like you’re throwing lot of complaints and don’t make unjustified claims.

Compared to N number of advantages that data analytics offers, there are a very few disadvantages or demerits.

- There’s a possibility of data analytics leading to a breach in customer privacy and thereby their information such as transactions, subscriptions and purchases etc.
- We should note that some of the tools used for data analytics are bit complex and might require prior training to enable their usage
- At times selection of right analytics tool can get tricky as it takes a lot of skills and expertise to select the right tool

Statistical analysis is a scientific tool that helps collect and analyze large amounts of data to identify common patterns and trends to convert them into meaningful information. In simple words, statistical analysis is a data analysis tool that helps draw meaningful conclusions from raw and unstructured data.

Statistical methods used in data analysis:

- Mean
- Standard Deviation
- Regression
- Variance
- Sample size
- Descriptive and inferential statistics

Use CONDITIONAL formatting to highlight duplicate values. Alternatively, use the COUNTIF function as shown below. For example, values are stored in cells D4:D7.

=COUNTIF(D4:D7,D4)

Apply filter on the column wherein you applied the COUNTIF function and select values greater than 1.

#Load the required libraries import pandas as pd import numpy as np import seaborn as sns #Load the data df = pd.read_csv('titanic.csv') #View the data df.head()

The df.info() function will give us the basic information about the dataset.

#Basic information df.info() #Describe the data df.describe()

Using this function, you can see the number of null values, data types, and memory usage as shown in the above outputs along with descriptive statistics.

You can find the number of unique values in the particular column using the unique() function in python.

#unique values df['Pclass'].unique() df['Survived'].unique() df['Sex'].unique()

array([3, 1, 2], dtype=int64) array([0, 1], dtype=int64) array(['male', 'female'], dtype=object)

The unique() function has returned the unique values which are present in the data

When it comes to data analysis more often than not we talk about EDA. This question is thrown to see our in-depth knowledge in data analysis, as CDA is lesser known than EDA.

Confirmatory Data Analysis i.e. CDA, is the process that involves evaluation of your evidence using some of the traditional statistical tools such as significance, inference, and confidence.

Confirmatory Data Analysis involves various steps including: testing hypotheses, producing estimates with a specified level of precision, RA (regression analysis), and variance analysis.

Different steps involved in CDA process include:

- Defining each and every individual constructs.
- Overall measurement model theory development
- Designing a study with the intent to produce the empirical results.
- Assessing the measurement model validity.

This question is intended to test your knowledge on computational linguistics and probability. Along with a formal definition, it would be advisable to explain it with the help of an example to showcase your knowledge about it.

An N-Gram is a connected sequence of N items in a given text or speech. Precisely, an N-gram is a probabilistic language model used to predict the next item in a particular sequence as in N-1.

This question is asked to get your idea about multi-source data analysis.

- We should start with explanation of multi-source data. Then go on about how would you tackle multi-source problems.
- Multi-source data by characteristics is dynamic, heterogeneous, complex, distributed and very large.
- When it comes multi-source problems, each source might contain bad or dirty data and the data in the sources might be represented differently, contradict or overlap.
- To tackle multi-source problems, you need to identify similar data records, and combine them into one record that will contain all the useful attributes minus the redundancy.

Missing patterns include:

- Missing at random
- Unobserved input variable missing
- Missing due to some particular missing value

This question assesses your practical knowledge on Excel sheet. Hence we need to explain with appropriate steps required to meet the given objective.

Yes, it is possible to highlight cells with negative values in Microsoft Excel. Steps to do that are as follows:

- In the Excel menu, go to the
option and click on*Home*.*Conditional Formatting* - Within the
option, click on*Highlight Cells Rules*.*Less Than* - In the dialog box that opens, select a value below which you want to highlight cells.
- You can choose the highlight color in the dropdown menu.
- Hit
.*OK*

Objective of the interviewer here would be to assess your knowledge on data structures by having a discussion about hash tables. Here explanation with diagrammatic representation would be advisable.

In a hash table, a collision occurs when two keys are hashed to the same index. Since every slot in a hash table is supposed to store a single element, collisions are a problem.

Chaining is a technique used for avoiding collisions in hash tables.

The hash table is an array of linked lists as per chaining approach i.e., each index has its own linked list. All key-value pairs mapping to the same index will be stored in the linked list of that index.

Here we need to talk about statistical model overfitting by making use of graphical representation. Also better to explain model overfitting prevention techniques in detail to demonstrate our expertise with statistical modelling.

Overfitting is a scenario, or rather a modeling error in statistics that occurs when a function is too closely aligned to a limited set of data points.

Some of the techniques used to prevent overfitting are:

- Early stopping: It helps to stop the training when parameter updates no longer begin to yield improves on a validation set
- Cross validation: A statistical method of evaluating and comparing learning algorithms by dividing data into two segments, i.e. one used to learn or train a model and the other one that’s used to validate the model
- Data augmentation: It is a set of methods or techniques that are used to increase the amount of data by adding slightly modified copies of existing data or newly created synthetic data from already existing data
- Ensembling: Usage of multiple learning algorithms with intent to obtain better predictive performance, than that could be obtained from any of the constituent learning algorithms alone.

Here again, interviewer wants to see our practical knowledge hence we need explain skewness by taking some real world examples.

Skewness measures the lack of symmetry in data distribution.

A left-skewed distribution is one where a left tail is longer than that of the right tail. It is important to note that in this case:

mean < median < mode

Similarly, the right-skewed distribution is the one where the right tail is longer than the left one. But here:

mean > median > mode

This is a very tricky question, in the sense for data visualization we would have used tableau as a tool. Interviewer wants to see how well we’ve used the tool and are aware of its cons.

Some of the ways to improve the performance of tableau are:

- Use an Extract to make workbooks run faster
- Reduce the scope of data to decrease the volume of data
- Reduce the number of marks on the view to avoid information overload
- Hide unused fields
- Use Context filters
- Use indexing in tables and use the same fields for filtering

This question is asked to assess our knowledge on Tableau. We need to explain the differences through practical knowledge rather than just theoretical definitions.

A heatmap is a two dimensional representation of information with the help of colors. Heatmaps can help the user visualize simple or complex information.

Treemaps are ideal for displaying large amounts of hierarchically structured (tree-structured) data. The space in visualization is split into rectangles that are sized and ordered by a quantitative variable.

This is mostly straight forward question asked to validate our depth in statistics as a data analyst. We should always include point about its range during our explanation.

P value for a statistical test helps to decide whether to accept or reject the null hypothesis.

`0 <= p_value <= 1`

P-value range is between [0,1][0,1] The threshold for P-value is set to be 0.050.05. When the value is below 0.050.05, the null-hypothesis is rejected.

This answer will give a view about your command over Python as a programming language which is must as a data analyst.

Python is limited in a few ways, including:

- Memory consumption - Python is not great for memory intensive applications
- Mobile development - Though Python is great for desktop and server applications, it is weaker for mobile development
- Speed - Studies have shown that Python is slower than object oriented languages like C++ and Java. However, there are options to make Python faster, like a custom runtime.
- Version V2 vs V3 - Python 2 and Python 3 are incompatible

This answer will give a view about your fluency over SQL as a query language which is absolutely necessary as a data analyst. Constraints in SQL are used to specify rules for data in the table.

- NOT NULL: Ensures that a column cannot have a NULL value
- UNIQUE: Ensures that all values in a column are different. It maintains the uniqueness of a column in a table. More than one UNIQUE column can be used in a table.
- PRIMERY KEY: A combination of NOT NULL and UNIQUE, and uniquely identifies each row in the table thereby ensuring faster access to the table
- FOREIGN KEY: This constraint creates a link between two tables by one specific column of both tables. This is used to uniquely identify row/record in another table
- CHECK: This constraint controls the values in the associated column and ensures that all values in a column satisfy a specific condition
- DEFAULT: Each column must contain a value ( including a NULL) .This constraint sets a default value for a column when no value is specified
- INDEX: Used to create and retrieve the data from the database very quickly.

Here rather than hurrying, we need to give ourselves some time and think about statistical methods that can be applied to be able to tell whether coin is biased as this question is about probability theory and statistical concepts.

To answer this question let's say X is the number of heads and let's assume that the coin is not biased. Since each individual flip is a Bernoulli random variable, we can assume it has a probability of showing up heads as p = 0.5, so this will lead to the following expected number of heads:

A Data Analyst is the one who interprets data and turns it into information which can offer ways to improve a business, thus affecting business decisions. Data Analyst gathers information from various sources and interprets patterns and trends. Roles and responsibilities of data analyst include to import, inspect, clean, transform, validate, collect, process or interpret collections of data to stay connected to customers, drive innovation and product development.

According to glassdoor.com, the average base pay for a Data Analyst is $72,306 per year. A few of these companies are Google, Amazon, Cisco Systems, Qualcomm, IBM, JPMorgan Chase & Co and Meta etc.

If you are determined to ace your next interview as a Data Analyst, these Data Analyst interview questions and answers will fast-track your career. To relieve you of the worry and burden of preparation for your upcoming interviews, we have compiled the above list of interview questions for Data Analyst with answers prepared by industry experts. Being well versed with these commonly asked Data Analyst or Data Analytics interview questions will be your very first step towards a promising career as an Data Analyst/Analytics Consultant.

You can opt for various job profiles after mastering data analyst interview questions, including SQL interview questions for data analyst, excel interview questions for data analyst and data analytics interview questions. A few are listed below:

- Data Analyst
- Data Analytics Consultant
- Data Visualization Engineer
- Business Intelligence Consultant
- Senior Data Analyst
- Business Intelligence Analyst
- Analytics Manager
- Quantitative Analyst
- Marketing Analyst
- Operations Analyst

Crack your Data Analyst interview with ease and confidence!

Submitted questions and answers are subjecct to review and editing,and may or may not be selected for posting, at the sole discretion of Knowledgehut.

Get a 1:1 Mentorship call with our Career Advisor

- Get Job-Ready Digital Skills
- Experience Outcome-Based Immersive Learning
- Get Trained by a Stellar Pool of Industry Experts
- Best-In-Class Industry-Vetted Curriculum

By tapping submit, you agree to KnowledgeHut Privacy Policy and Terms & Conditions