Big Data Interview Questions and Answers

Big data are data sets that are very large, very complex, and difficult to process conventionally. Big data engineers develop, maintain, test, analyze, and evaluate this data for stakeholders to use and take business decisions. Needless to say, skilled big data engineers draw a huge salary for doing this. We have listed the most frequently asked big data interview questions to help you grab any upcoming exciting opportunities in the field of big data. This comprehensive list of questions with answers is meant for freshers, intermediate and expert data engineers looking to ace their next interview. Make sure you go through questions on topics like different big data processing tools and platforms, big data development, big data models, ELT process, common big data operations, big data visualization, model evaluation and optimization, big data integration, graph analytics, data governance, big data maturity model and more for better clarity. With this extensive list of Big Data Interview Questions, you will be well-prepared for your next interview. If you are looking to advance your career in big data, use this guide as a handy resource for your next interview.

  • 4.5 Rating
  • 50 Question(s)
  • 60 Mins of Read
  • 4991 Reader(s)

Intermediate

Big Data has the potential to significantly transform any business. It has patterns, trends, and insights hidden in it. These insights when discovered help any business to formulate their current and future strategies.

It helps to reduce unnecessary expenses and increase efficiency. It helps to reduce losses.

By exploiting Big Data, you can understand the market in general and your customers in particular in a very personalized way and accordingly customize your offerings. The chances of conversion and adoption increase manyfold.

The use of Big Data reduces the efforts/budget of marketing and in turn, increases the revenue. It gives businesses an added advantage and an extra edge over their competitors.

If you do not harness the potential of Big Data, you may be thrown out of the market.

As the Big Data offers an extra competitive edge to a business over its competitors, a business can decide to tap the potential of Big Data as per its requirements and streamline the various business activities as per its objectives.

So the approaches to deal with Big Data are to be determined as per your business requirements and the available budgetary provisions.

First, you have to decide the kind of business concerns you are having right now. What kind of questions you want your data to answer. What are your business objectives and how do you want to achieve them.

As far as the approaches regarding Big Data processing are concerned, we can do it in two ways:

  1. Batch processing
  2. Stream processing

As per your business requirements, you can process the Big Data in batches daily or after a certain duration. If your business demands it, you can process it in streamline fashion after every hour or after every 15 seconds or so.

It all depends on your business objectives and the strategies you adopt.

There are various platforms available for Big Data. Some of these are open source and the others are license based.

In open-source, we have Hadoop as the biggest Big Data platform. The other alternative being HPCC. HPCC stands for High-Performance Computing Cluster.

In a licensed category, we have Big Data platform offerings from Cloudera(CDH), Hortonworks(HDP), MapR(MDP), etc. (Hortonworks is now merged with Cloudera.)

  • For Stream processing, we have tools like - Storm.
  • The Big Data platforms landscape can be better understood if we consider it usage wise.
  • For example, in the data storage and management category, we have big players like Cassandra, MongoDB, etc.
  • In data cleaning category we have tools like OpenRefine, DataCleaner, etc.
  • In data mining category we have IBM SPSS, RapidMiner, Teradata, etc.
  • In the data visualization category, the tools are Tableau, SAS, Spark, Chartio, etc.

Features and specialities of these Big Data platforms/tools are as follows:

1) Hadoop: 

  • Open Source
  • Highly Scalable
  • Runs on Commodity Hardware
  • Has a good ecosystem

2) HPCC: 

  • Open Source
  • Good Alternative to Hadoop
  • Parallelism at Data, Pipeline and System Level 
  • High-Performance Online Query Applications

3) Storm: 

  • Open Source                
  • Distributed Stream Processing               
  • Log Processing            
  • Real-Time Analytics

4) CDH: 

  • Licence based (Limited Free Version available)
  • Cloudera Manager for easy administration
  • Easy implementation
  • More Secure

5) HDP: 

  • Licence based (Limited Free Version available)             
  • Dashboard with Ambari UI            
  • Data Analytics Studio          
  • HDP Sandbox available for VirtualBox, VMware, Docker

6) MapR: 

  • Licence based (Limited Free Version available)               
  • On-premise and cloud support           
  • Features AI and ML            
  • Open APIs

7) Cassandra: 

  • Open Source                     
  • NoSQL Database                 
  • Log-Structured Storage            
  • Includes Cassandra Structure Language (CQL)

8) MongoDB: 

  • Licence based (also Open Source)           
  • NoSQL Database      
  • Document Oriented           
  • Aggregation Pipeline etc.

All the projects that involve a lot of data crunching (mostly unstructured) are better candidates for Big Data projects. Thus Telecom, Banking, Healthcare, Pharma, e-commerce, Retail, energy, transportation, etc. are the major sectors that are playing big with Big Data. Apart from these any business or sector that is dealing with a lot of data is better candidates for implementing Big Data projects. Even the manufacturing companies can utilize Big Data for product improvement, quality improvement, inventory management, reducing expenses, improving operations, predicting equipment failures, etc. Big Data is being used in Educational fields also. Educational industry is generating a lot of data related to students, courses, faculties,  results, and so on. If this data is properly analyzed and studied, it can provide many useful insights that we can be used to have an improvement in the operational efficiency and the overall working of the educational entities. 

By harnessing the potential of Big Data in the Educational field, we can expect the following benefits:

  1. Customized Contents
  2. Dynamic Learning Programs
  3. Enhanced Grading system
  4. Flexible Course Materials
  5. Success Prediction
  6. Better Career Options

Healthcare is one of the biggest domains which makes use of the Big Data. Better treatment can be given to patients as the patient's related data gives us the necessary details about the patient's history. It helps you to perform only the required tests, so the costs related to diagnosis gets reduced. Any outbreaks of epidemics can be better predicted and hence the necessary steps for its prevention can be taken early. Some of the diseases can be prevented or their severity can be reduced by taking preventive steps and early medication.

Following are the observed benefits of using Big Data in Healthcare:

  1. Better Prediction
  2. Enhanced Treatment
  3. Only the necessary tests to be performed.
  4. Reduced Costs
  5. Increased Care

Another area/project which is suitable for the implementation of Big Data is - 'Welfare Schemes'. It assists in making informed decisions about various welfare schemes. We can identify those areas of concern that need immediate attention. The national challenges like Unemployment, Health concerns, Depletion of energy resources, Exploration of new avenues for growth, etc. can be better understood and accordingly dealt with. Cyber Security is another area where we can apply Big Data for the detection of security loopholes, identifying cyber crimes, illegal online activities or transactions, etc. Not only we can detect such activities but also we can predict in advance and have better control of such fraudulent activities.

Some of the benefits of using Big Data in Media and Entertainment Industry can be as given below:

  1. On-demand content delivery.
  2. Predicting the preferences and interests of the audience.
  3. Insights from reviews of the customers.
  4. Targeted Advertisements etc.

The projects related to Weather Forecasting, Transportation, Retail, Logistics, etc. can also be good players for Big Data.

Many sectors are harnessing the power of Big Data. However, the top 3 domains as per the market understanding that can and are utilizing the power of Big Data are :

  1. Financial institutions
  2. Manufacturing
  3. Healthcare

These are followed by energy and utilities, media and entertainment, government, logistics, telecom and many more. How Big Data offers value addition to different enterprises can be seen as follows.

Financial Institutions:

Big Data Insights has the potential to drive innovation in the Financial Sector.

There are certain challenges that financial institutions have to deal with. Some of these challenges are as follows:

  1. Fraudulent transactions
  2. Trade visibility
  3. Archival of audit trails
  4. Card-fraud detection
  5. Reporting of enterprise credit-risk
  6. Transformation of customer data
  7. Trade analytics
  8. Regulations and compliance analytics etc.

Big Data can provide better solutions to deal with such issues. There are Big Data solution providers that cater specifically to the financial sector. Some of the Big Players are:

Panopticon Software, Nice Actimize, Streambase Systems, Quartet FS, etc.

Manufacturing:

Manufacturing Industry is another biggest user of Big Data. In the manufacturing industry, a lot of data is generated continuously. There are enormous benefits we get, by utilizing Big Data in the Manufacturing sector.

Some of the major use cases are:

  1. Supply chain planning
  2. Defects tracking
  3. Product quality improvement
  4. Output forecasting
  5. Simulation and Testing of new methods
  6. Increasing energy efficiency
  7. Enhanced Manufacturing
  8. Tracking daily production
  9. Predicting equipment failure etc.

Healthcare:

The volume of data that is being generated in healthcare systems is very large. Previously due to a lack of consolidated and standardized data, the healthcare sector was not able to process and analyse this data. Now, by leveraging Big Data, the Healthcare sector is gaining various benefits such as Better disease Prediction, Enhanced Treatment, Reduced Costs, Increased Patients Care, etc.

Some of the major Big Data Solution Providers in the Healthcare industry are:

  1. Humedica
  2. Recombinant Data
  3. Cerner
  4. Explorys etc.

There are various frameworks for Big Data processing.

One of the most popular is MapReduce. It consists of mainly two phases called Map phase and the Reduce phase. In between Map and Reduce phase there is an intermediate phase called Shuffle. The given job is divided into two tasks:

  • Map tasks
  • Reduce tasks.

The input is divided into splits of fixed size. Each input split is then given to each mapper. The mappers run in parallel. So the execution time is drastically reduced and we get the output very fast.

The input to the mapper is a key-value pair. The output of mappers is another key-value pair. This intermediate result is then shuffled and given to reducers. The output of reducers is your desired output.

There are so many Big Data tools available in the market today. Some provide the facility of storage and processing,

Some provide only storage and various APIs for processing, some provide analytical facilities, etc. Out of these, Hadoop, Spark, HPCC, CDH, etc. are the most widely used Big Data tools.

  • Hadoop is an open- source Big Data platform from the Apache foundation. Its beauty is that it can run on commodity hardware. Spark is another tool from the Apache Foundation. It adds Stream processing capability. It also offers in-memory data processing. So it is much faster.
  • HPCC is a High- Performance Computing Cluster. It is a supercomputing platform and is highly scalable.
  • CDH is Cloudera Hadoop Distribution. It is an enterprise-level/class Big Data tool.

Hadoop is probably the very first open-source Big Data platform. It is highly scalable and runs on commodity hardware. It includes HDFS which is Hadoop Distributed File System. It can store a very large amount of unstructured data in a distributed fashion.

Hadoop also includes MapReduce which is a data processing framework. It processes data in a highly parallel fashion.

For a large quantity of data, the processing time is drastically reduced. There are so many API's and other tools available that can be integrated with Hadoop that further extends its usefulness and enhances its capability and makes it more suitable for Big Data.

The Hadoop Framework let the user write and test the distributed systems quickly.

It is fault-tolerant and automatically distributes the data across the cluster of machines. It makes use of massive parallelism. To provide high availability and fault-tolerance, Hadoop does not depend on the underlying hardware.

At the application layer itself, it provides such support. We can add or remove nodes as per our requirements. You are not required to make any changes to the application.

Apart from being open-source, the other biggest advantage we have of Hadoop is its compatibility with almost all the platforms. The amount of data that is being generated is increasing by a very large quantity day by day. So, the need for data storage and processing will increase accordingly. The best part of Hadoop is that by adding more number of commodity machines you can increase the storage and processing power of Hadoop without any other investment in software or other tools.

Thus, just by adding more machines, we can accommodate the ever-increasing volume of data. Due to Fault-tolerant feature of Hadoop, the data as well as the application processing, both are protected against any hardware failure.

If a particular node goes down, the jobs are redirected automatically to other nodes. This ensures that 'Distributed Computing' does not fail.

There are multiple copies (by default 3) of data stored by the Hadoop automatically.

Hadoop provides more flexibility in terms of data capture and storage. You can capture any data, in any format, from any source into the Hadoop and store it as it is without any kind of preprocessing on it. Whereas in traditional systems, you are required to pre-process the data before storage.

So, in Hadoop, you can store any data and then later process it as per your requirements.

The ecosystem around Hadoop is very strong. There are so many tools available for different needs. We have tools for automatic data extraction, storage, transformation, processing, analysis etc.

There are a variety of cloud options available for Hadoop. So, you have a choice to use on-premise as well as cloud-based features/tools as per your requirements.

Thus, by considering all these features that Hadoop provides and the robustness, cost-effectiveness it offers and also by taking into consideration the nature of Big Data, we can say that Hadoop is more suitable for Big Data.

There are various things to be considered very carefully before going for Big Data deployment. First, the business objectives and the requirement for Big Data solutions are to be well understood and written. What kind of insights are needed out of the Big Data needs to be clearly defined?

Then you have to find out the various sources of data collection. You have to decide the data extraction strategy. Find out the various architectures and tools for Big Data deployment. Compare and decide the best fit depending on your requirements and the drafted policy.

You also have to take into consideration the data ingestion policy, storage, and processing requirements for the Big Data deployment. You can either manually deploy the solution or you can choose to automate the process by using the automated deployment tools.

The power of Big Data can be exploited in various domains and at various levels. You can utilize it to have an overall view of the customer, Fraud Detection and prevention, Intelligent Security, Personalized Recommendations, Operational Efficiency, Customized offerings, etc.

Let's consider the use case of the customer view. A business can build a dashboard that can show the overall view of the customer such as all the demographic details of the customer, customer habits, browsing patterns, interests, probability of purchasing a particular item, liking a product, recommending it to someone else etc.

For example, a washing machine sales company may have a Big Data system that can have such a dashboard. The company may collect all the data of the probable customers from all the possible sources. These may be internal sources or external sources such as social media channels. The company can collect the interaction with these channels of the probable customers and can decide whether the customer will buy a product or not.

The dashboard can be made capable enough to give all these details about the customer and can also predict the likelihood of purchasing a product or losing the customer to its competitor.

Thus Big Data assist the companies to identify potential customers and offer them the personalized offerings based on their preferences, social media chats, browsing patterns etc.

Making a business decision involves a lot of factors. One wrong decision can ruin the whole business. A Big Data contains a lot of information that when used wisely can benefit a Business immensely.

It can transform any business that is willing to exploit its potential.

It contains patterns, trends, and value hidden in it. This information when discovered can help any business to make its decision based on actual data and not just human instinct. It assists in formulating various strategies about marketing, production or inventory management.

It can increase efficiency at all levels and drastically reduce the overall costs.

A business that is not harnessing the potential of Big Data may miss the opportunity and lag behind its competitor. It may make some incorrect decisions by not considering market trends and customer concerns. As Big Data can provide valuable feedback and market sentiments, it can immensely help a Business make wise, correct, and timely decisions by providing great business insights.

The choice of language for a particular Big Data project depends on the kind of solution we want to develop. For example, if we want to do data manipulation, certain languages are good at the manipulation of data.

If we are looking for Big Data Analytics, we see another set of languages that should be preferred.

As far as R and Python are concerned, both of these languages are preferred choices for Big Data. When we are looking into the visualization aspect of Big Data, R language is preferred as it is rich in tools and libraries related to graphics capabilities.

When we are into Big Data development, Model building, and testing, we choose Python.

R is more favourite among statisticians whereas developers prefer Python.

Next, we have Java as a popular language in the Big Data environment as the most preferred Big Data platform ‘Hadoop’ itself is written in java. There are other languages also popular such as Scala, SAS, and MATLAB.

There is also a community of Big Data people who prefer to use both R and Python. So we see that there are ways we can use a combination of both of these languages such as PypeR, PyRserve, rPython, rJython, PythonInR etc.

Thus, it is up to you to decide which one or a combination will be the best choice for your Big Data project.

Yes, upto some extent the day to day business operations will get affected during the initial phases of Big Data deployment. We are required to integrate the various data sources. The policies regarding data collection, extraction, storage as well processing are bound to change.

The various data points are bound to have different formats, architectures, tools and technologies, protocols of data transfer etc. So deciding to capture and use Big Data for your business will involve integrating these various data points, making some changes to the formats, usage, securities etc. will have some impact on an overall day to day operations of the business.

There are various benefits we get by implementing Big Data such as:

  1. Process Automation
  2. In-depth Insights
  3. Better, Faster  and Informed Decision Making
  4. Improved Operations and Efficiency

From making the use of sensors to track the performance of machines to optimising operations, from recruiting top talent to measuring employees performance, Big Data has the required potential to bring an improvement in the overall efficiency and business operations at all levels.

Now, Data is not just a matter of IT Department but becoming an essential element of every department in an enterprise.

Thus, we can conclude that the adoption of Big Data would have an impact on the Day to Day operations of the business.

It has the potential to transform every element of a business. When matched with the business objectives, it will have value added to the whole enterprise. The customers will get enhanced product offerings and improved service. It will improve the overall operational efficiency and let you have an edge over your competitors.

Data science is a broad spectrum of activities involving analysis of Big Data, finding patterns, trends in data, interpreting statistical terms and predicting future trends. Big Data is just one part of Data Science. Though Data Science is a broad term and very important in the overall Business operations, it is nothing without Big Data.

All the activities we perform in Data Science are based on Big Data. Thus Big Data and Data Science are interrelated and can not be seen in isolation.

Big Data insight means the discovery of information, patterns, and trends that were hidden in Big Data. Big Data insight assists in making critical business decisions and gives us the direction to formulate future strategies. It also gives you a competitive edge over your competitors. It helps to understand your customers more and give them personalized offerings. Neglecting Big Data insights may lag you behind the market and throw you out of the competition. So to remain competitive in the market businesses must harness the inherent potential of the Big Data.

Businesses can take fast and informed decisions based on the insights obtained from Big Data. It is an outcome of comprehensive data processing and analytics.

To obtain insight from Big Data, we are required to process and analyze lots of data. We try to find out various patterns such as market trends, customer spending habits, financial understandings, etc. which may assist businesses to formulate their business strategies accordingly.

Big Data insights give us the answers to our varous questions. These questions are related to:

  1. market trends
  2. customer habits
  3. customer preferences
  4. customer likings
  5. what may work and what may not
  6. customised offerings
  7. personalised recommendations
  8. quality aspects
  9. efficiency improvement
  10. cost reduction
  11. revenue growth
  12. where to spend and where not,
  13. hiring top talent etc.

These are some of the questions of which only the data can give better answers.

There are so many tools available that assist you to get this required insights out of Big Data.

Nowadays Big Data has become a business norm. One can not continue in the business and remain competitive by neglecting Big Data. Big Data offers you the insights which otherwise you may not be able to discover.

These insights help you to decide your inventory management, production, marketing, service offerings, etc. which are directly related to business revenue. Big Data helps you to increase the efficiency at every stage of a business and thus, in turn, reduces the overall expenses making you more competitive and profitable.

To increase business revenue, you have various options such as:

  1. Increase sales
  2. Reduce costs
  3. Increase efficiency etc.

Increasing sales is not an easy task. It depends on the market demands and customer

Preferences. How will you come to know about market demands and what does the customer want? You can get the proper answers for such questions by analyzing the Big Data. Big Data contains valuable information and insights that need to be discovered and utilized as per your requirements. By analyzing Big Data, you can get various patterns, trends, customer insights, etc. Such insights will assist you in formulating your business strategies accordingly and increase the chances of customer conversion and ultimately increase your revenues.

Big Data also helps you to reduce costs by having proper inventory management, streamlining operations and increasing efficiency at all levels. You can consolidate the data from various departments and a variety of sources to collectively analyze it and get the proper answers to your questions and the various business concerns.

Thus by harnessing the inherent potential of Big Data, you can increase efficiency, reduce costs and in turn increase revenues and the overall business growth.

Special emphasis needs to be given when building Big Data Models. It is so because the Big Data itself is less predictable when compared to the other traditional kind of data. It is a little bit complex process as it involves reorganizing and rearranging the business data by the business processes.

To support the business objectives, the data models need to be designed to have logical inter-relationships among the various business data.

Then these logical designs need to be translated into the corresponding physical models.

Big Data is significantly different than the traditional data, the old data modelling techniques do no longer apply to Big Data. So you are required to apply different approaches for modelling Big Data.

The data interfaces should be designed to incorporate elasticity and openness due to the unpredictable nature of the Big Data to accommodate future changes.

Here the focus should not be on a schema but on designing a system. We should also take into consideration the various Big Data modeling tools out there. Not all the Big Data present there should be considered for modeling. Only the data appropriate to your business concerns should be selected to build models around.

ETL stands for Extract-Transform-Load. Mostly the Big Data is unstructured and it is in very large quantity and also gets accumulated at a very fast pace.

So, at the time of extraction, it becomes very difficult to transform it because of its sheer volume, velocity, and variety. Also, we can not afford to lose Big Data. So, it requires to be stored as it is and then in the future as per the business requirements can be transformed and analysed.

The process of extraction of Big Data involves the retrieval of data from various data sources.

The enterprises extract data for various reasons such as:

  1. For further processing
  2. Migrate it to some other data repository such as a data warehouse/data lake
  3. For analyzing etc.

Sometimes, while extracting the data, it may be desired to add some additional information to the data, depending on the business requirements. This additional information can be something like geolocation data, timestamps, etc. It is called as data enrichment.

Sometimes it may be required to consolidate the data with some other data in the target datastore. These different processes are collectively known as ETL. ie. Extract-Transform-Load.

In ETL, Extraction is the very first step.

The Big Data tools for data extraction assist in collecting the data from a variety of different data sources. The functionalities of these tools can be as mentioned below:

  1. Extract the data from various homogeneous/heterogeneous sources.
  2. Transform it to store in a proper format/structure for further processing and querying.
  3. Load the data in the target store such as data mart, an operational data store, or a data warehouse.

It's a usual activity in ETL tools that the common 3 steps are executed in parallel. As the extraction of data takes a longer time, the other process of transformation starts. It processes the already pulled data and prepares it for loading.

As the data becomes ready for loading into the target store, the process of loading the data starts immediately irrespective of the completion of previous steps.

ETL for Structured Data:

If the data under consideration is structured, then the extraction process is performed generally within the source system itself.

Following extraction strategies may be used:

  1. Full Extraction: In the full extraction method, the data is extracted completely from the source. Tracking the changes are not required. The logic here is simpler but the load on the system is greater.
  2. Incremental extraction: In the incremental extraction method, the changes occurring in the source data are tracked from the last successful data extraction. It is so because you are not required to go through the entire process of extracting all the data every time there occurs a change.

For this, a changing table is created to track the changes. In some data warehouses, a special functionality known as 'CDC' (Change Data Capture) is built-in.

The logic required for incremental data extraction is a little bit more complex but the load on the system is reduced.

ETL for Unstructured Data:

When the data under consideration is unstructured, a major part of the work goes into preparing the data so that the data can be extracted. In most cases, such data is stored data lakes until it is required to extract for some kind of processing, analysis or migration.

The data is cleaned up by removing the so-called 'noise' from it.

It is done in the following ways:

  1. Removing whitespaces/symbols
  2. Removing duplicate results
  3. Handling missing values.
  4. Removing outliers etc.

There are some challenges in the ETL process. When you are consolidating data from one system to the other system, you have to ensure that the combination is good/successful. A lot of strategic planning is required. The complexity of planning increases manyfold when the data under consideration is both structured and unstructured. The other challenges include maintaining the security of the data intact and complying with the various regulations.

Thus performing ETL on Big Data is a very important and sensitive process that is to be done with the utmost care and strategic planning.

There are numerous tools available for Big Data extraction. For example, Flume, Kafka, Nifi, Sqoop, Chukwa, Talend, Scriptella, Morphlines, etc. Apart from data extraction, these tools also assist in modification and formatting the data.

The Big Data extraction can be done in various modes :

  1. Batched
  2. Continuous
  3. Real-time
  4. Asynchronous

There are other issued also that needs to be addressed. The source and destination systems may have different I/O formats, different protocols, scalability, security issues, etc. So the data extraction and storage needs to be taken care of accordingly.

Open source tools: Open source tools can be more suitable for budget-constrained users.

They are supposed to have a sufficient knowledge base and the required supporting infrastructure in place. Some vendors do offer light or limited versions of their tools as open source.

  • Batch processing tools: The existing Legacy data extraction tools, combine/consolidate the data in batches. It is generally done in off-hours to have minimum impact on the working systems.

For on-premise, closed environments, a batch extraction seems to be a good approach.

  • Cloud-based tools: These are the new generation of data extraction tools. Here, the emphasis is on the real-time extraction of the data.

These tools offer an added advantage of data security and also takes care of any data compliance issues. So, an enterprise need not worry about these things.

'Talend Open Studio' is one of the good tools which offers data extraction as one of its features. It is one of the 'most powerful Data Integration' tools out there in the market.

  • It is a set of versatile open- source products that can be better used in Developing, Testing, Deploying as well as Administering the various Data Management applications and the other integration projects.

'Scriptella' is one of the open-source ETL tools by Apache. It has various features related to data extraction, transformation, loading, database migration, etc.

it can also execute the java scripts, SQL, Velocity, JEXL, etc. It also has interoperability with JDBC, LDAP, XML, and many other data sources. It is a very popular tool due to its ease of use and simplicity.

Another best open-source tool is 'KETL'. It is best for data warehousing. It is Built on open, multi-threaded java oriented, XML based architecture. The major features of KETL are integration with 'security' and 'data management tools', scalable across multiple servers, etc.

'Kettle' - Pentaho Data Integrator. It is the default tool in 'Pentaho' Business-Intelligence Suite.

There are other tools also such as Jaspersoft ETL, Clover ETL, Apatar ETL, GeoKettle, Jedox, etc.

To query Big Data, there are various languages available. Some of these languages are either functional, dataflow, declarative, or imperative. Querying Big Data often involves certain challenges. For example:

  1. Unstructured data
  2. Latency
  3. Fault tolerance etc.
  • By 'unstructured data’ we mean that the data, as well as the various data sources, do not follow any particular format or protocol.
  • By 'latency’ we mean the time taken by certain processes such as Map-Reduce to produce the result.
  • By 'fault tolerance’ we mean the steps in the analysis that support partial failures, rolling back to previous results, etc.

To query Big Data, there are various tools available. You have to decide which one to use as per your infrastructural requirements. The following are some of the tools/languages to query the Big Data: HiveQL, Pig Latin, Scriptella, BigQuery, DB2 Big SQL, JAQL, etc.

The tools such as Flume and Pig are based on the concept of processing pipeline which is explicit. The other approach is to translate the SQL into an equivalent construct in Big Data.

For example, HiveQL, Drill, Impala, Dremel, etc. follow this approach.

It is always desirable from a user perspective to use the second approach based on SQL. It is easy to follow and widely known. The query optimization part is left for the tool/system to perform.

The major limitation of using such a query language is the built-in operators. They are very limited. The dataflow languages such as Flume and Pig are designed in such a manner to incorporate user-specified operators.

Therefore such languages can be easily extensible. The construction of processing pipelines is a major limitation in such query languages.

'Presto' is a good example of a distributed 'SQL query' engine which is an open source also. It can run interactive analytical queries over various data stores.

One of the features of Presto which is worth mentioning is its ability to combine data from multiple stores by a single query. Thus it allows you to perform analytics across the entire organization.

Feature selection is a process of extracting only the required features from the given Big Data. Big Data may contain a lot of features that may not be needed at a particular time during processing, so we are required to select only the features in which we are interested and do further processing.

There are several methods for features selection:

  1. Filters method
  2. Wrappers method
  3. Embedded method

Filters Method:

In this method, the selection of features is not dependent on the designated classifiers. The selection of variables for the ordering purpose, a variable ranking technique is used.

In the technique of variable ranking, we take into consideration the importance and usefulness of a feature for classification. In the filters method, to filter out the less relevant features, we can apply the ranking method before classification.

Some of the examples of filters method are:

  1. Chi-Square Test
  2. Variance Threshold
  3. Information Gain etc.

Wrappers method:

In the wrappers method, the algorithm for feature subset selection exists as a 'wrapper' around the algorithm known as 'induction algorithm'.

The induction algorithm is considered as a 'Black Box'. It is used to produce a classifier that will be used in classifying.

It requires a heavy computation to obtain the subset of features. This is considered as a drawback of this technique.

Some of the examples of Wrappers Method are:

  1. Genetic Algorithms
  2. Recursive Feature Elimination
  3. Sequential Feature Selection

Embedded Method:

This method combines the efficiencies of the Filters method and the Wrappers method.

It is generally specific to a given learning machine. The selection of variables is usually done in the training process itself. What is learned by this method is the 'feature' that provides the most accurate to the model.

Some of the examples of Embedded Method are:

  1. L1 Regularisation Technique (such as LASSO)
  2. Ridge Regression (also known as L2 Regularisation)
  3. Elastic Net etc.

The process of feature selection simplifies machine learning models. So, it becomes easier to interpret them. It eliminates the burden of dimensionality. The generality of the model is enhanced by this technique. So, the overfitting problem gets reduced.

Thus, we get various benefits by using Feature Selection methods. Following are some of the obvious benefits:

  1. A better understanding of data
  2. Improved prediction performance
  3. Reduced computation time
  4. Reduced space etc.

Tools such as SAS, MATLAB, Weka also include methods/ tools for feature selection.

Overfitting refers to a model that is tightly fitted to the data. It is a modeling error. It occurs when a modeling function is too closely fit a limited data set. Here the model is made too complex to explain the peculiarity or individuality in the data which is under consideration.

The predictivity of such models gets reduced due to overfitting. The generalization ability of such models also gets affected. Such models generally fail when applied on the outside data i.e. the data which was not part of the sample data.

There are several methodologies to avoid overfitting. These are:

  1. Cross-validation
  2. Early stopping
  3. Pruning
  4. Regularization etc.

Overfitting seems to be a common problem in the world of data science and machine learning. Such a model learns noise also along with the signal. It proves to be a poor fit when applied to new data sets.

A model should be considered as an overfitted when it performs better on the training set but poor on the test set. Following is a description of the most widely used cross-validation method:

The cross-validation method is considered to be one of the powerful techniques for the prevention of overfitting. Here, the training data is used to obtain multiple small test sets. These small test sets should be used to tune the model.

In 'k-fold cross-validation' method, the data is partitioned into 'k' subsets. These subsets are called folds. The model is then trained on 'k-1' folds and the remaining fold is used as the test set. It is also called the 'holdout fold'.

This method allows us to keep the test set as an unseen dataset and lets us select the final model.

Missing values refer to the values that are not present for a particular column. If we do not take care of the missing values, it may lead to erroneous data and in turn incorrect results. So before processing the Big Data, we are required to properly treat the missing values so that we get the correct sample. There are various ways to handle missing values.

We can either drop the data or decide to replace them with the data imputation.

If the number of missing values is small, then the general practice is to leave it. If the number of cases is more then the data imputation is done.

There are certain techniques in statistics to estimate the so-called missing values:

  1. Regression
  2. Maximum Likelihood Estimation,
  3. Listwise/pairwise Deletion
  4. Multiple data imputation etc.

Outliers are data points/values that are very far from the group. These do not belong to any particular group/cluster.

The presence of outliers may affect the behavior of the model. So proper care is to be taken to identify and properly treat the outliers.

The outliers may contain valuable and often useful information. So they should be handled very carefully. Most of the time, they are considered to be bad data points but their presence in the data set should also be investigated.

Outliers present in the input data may skew the result. They may mislead the process of training of machine learning algorithms. This results in:

  1. Longer Training Time
  2. Less Accurate Models
  3. Poor Results.

It is observed that many machine learning models are sensitive to:

  1. The range of attribute values
  2. Distribution of attribute values

The presence of outliers may create misleading representations. This will lead to misleading interpretations of the collected data.

As in descriptive statistics, the presence of outliers may skew the mean and standard deviation of the attribute values The effects can be observed in plots like scatterplots and histograms.

 For some problems, outliers can be more relevant. For example anomalies in:

  1. Fraud detection
  2. Computer security.

Some of the outlier detection methods are:

  1. Extreme Value Analysis: Here we determine the statistical tails of the distribution of data. For example, Statistical methods like 'z-scores' on univariate data.
  2. Probabilistic and Statistical Models: Here we determine the 'unlikely instances' from a 'probabilistic model' of data. For example, the Optimization of' Gaussian mixture' models using 'expectation-maximization'.
  3. Linear Models: Using the linear correlations, the data is modeled into lower dimensions. For example, Data having large residual errors can be outliers.
  4. Proximity-based Models: Here, the data instances which are isolated from the group or mass of the data are determined by Cluster, Density or by the Nearest Neighbor Analysis.
  5. Information-Theoretic Models: Here the outliers can be detected as data instances that increase the complexity of the dataset (minimum code length).
  6. High-Dimensional Outlier Detection: In this method, we search subspaces for the outliers based on distance measures in higher dimensions.

Without having a proper visualization tool, the Big Data will be of little use. These tools help the user to visualize the data in a visually interactive format. You can interact with data using data visualization tools.

Going beyond analysis, these big data visualizations bring the presentation to life. It lets the user get interested in the insights asking more questions and getting detailed answers.

It takes a lot of manual effort to get the data prepared and organized to have easy viewing.

But a proper visualization tool should provide features to do all this automatically without hard manual efforts. We can manipulate the spreadsheet data in reams and charts, but it will not make sense until it is crunched to get presented in a proper visualization format. Otherwise, it will be just a heap of numbers.

High-quality data visualization tools are considered as crucial for a successful data analytics strategy. Data visualization tools present the given data in a pictorial format. They represent the image of the data as a whole giving various insights.

To summarize the data visually, it can be presented in a variety of ways such as Graphs, Histograms, Pie charts, Heat maps, etc. This will let you better understand the meaning and the patterns conveyed by the data.

Thus for insightful analytics, it is imperative to have a good visualization tool.

There are many tools available for Big Data visualization. Some of the prominent Big Data visualization tools are Tableau, Google Chart, SAS, SPSS, Microsoft Power BI, QlikView, Fusion Chart, Tibco Spotfire, Cognos, etc.

Depending on your business as well as infrastructural requirements and the budgetary provisions, you have to decide which visualization tool will be the best fit for all of your Big Data insight needs.

Choosing the right tool for your data visualization needs is not an easy task. There are so many factors and features to be considered before making a selection for the right data visualization tool.

Following are some of the most sought after features you should consider:

  1. Dashboard: Customizable, Clear and Concise  
  2. Embeddability: It should integrate seamlessly the visual reports with the other specific applications.
  3. Interactive Reporting: You should be able to drill down the details.
  4. Data Collection and Sharing: Importing the available data to the visualization tool and exporting the reports/visualizations into other applications or formats.
  5. Location Intelligence: If your business demands 'geolocation' details then the tool should have a ‘location intelligence' feature.
  6. Data Mining: It should be capable enough to perform the desired data mining operations etc.

There are some other popular tools also in the market such as:

Infogram, Sisense, Datawrapper, ChartBlocks, Domo, RAW, Klipfolio, Visual.ly, Plotly, Ember Charts, Geckoboard, NVD3, Chartio, FusionCharts, HighCharts, D3.js, Chart.js, Chartist.js, Processing.js, Polymaps, Leaflet, n3-charts, Sigma JS, etc.

Choosing the right tools for all of your data visualization needs is a big and very strategic decision. They are not just costly but play a very crucial role in making strategic and informed decisions.

So you have to choose wisely the right data visualization tool depending on your business requirements and the kind of visualizations and reports you desire.

Advanced

By model optimization, we mean to build/refine the model in such a way to be as realistic as it can be. It should reflect the real-life situation as closely as possible. When we apply a model to the real-world data, it should give the expected results. So optimization is required. This is achieved by capturing some significant or key components from the dataset.

There are some tools available in the market for optimizing the models. One such tool is the ‘TensorFlow Model Optimization Toolkit’. There are three major components in model optimization:

  1. An objective function.
  2. Decision Variables
  3. Constraints

An objective function is a function that we need to optimize for model Optimization. The solution to a given optimization problem is nothing but the set of values of the decision variables. These are those values of the decision variables for which our objective function reaches its expected optimal value. The values of the decision variables are restricted by the constraints.

The classification of optimization problems is based on the nature of our objective function and the nature of given constraints. In an unconstrained optimization problem, there are no constraints and our objective function can be of any kind - linear/nonlinear. In the linear optimization problem, our objective function is linear in variables and the given constraints are also linear.

In a quadratic optimization problem, our objective function quadratic in variables and the given constraints are linear. In a nonlinear optimization problem, our objective function is an arbitrary function that is nonlinear of the given decision variables.

The given constraints can be linear or they can be nonlinear. The objective of model optimization is to find the optimal values of the given decision variables.

Two methods are used to evaluate models:

  1. Hold-out
  2. Cross-validation

We use a test data set to evaluate the performance of the model. This test data set should not be part of the training of the model. Otherwise, the model will suffer from overfitting. In Hold-out method, the given data set is divided randomly into three sets:

  1. Training set
  2. Validation set
  3. Test set.

When the data available is limited, we use the Cross-validation method. Here, the data set is divided into  'k' number of equal subsets. We build a model for each set. It is also known as K-fold Cross-validation. The categories of models under supervised learning are:

  1. Regression
  2. Classification.

The  corresponding  methods for evaluation of these models are also categorized as:

  1. Evaluation of Regression Models
  2. Evaluation of Classification Models.

In the evaluation of regression models, we are concerned with the continuous values whereas, in the evaluation of classification models, we try to find out the error between the actual value and the predicted value. Here in the classification models, our concern is on the correct and incorrect classification of the number of data points. We try to find out the confusion matrix and calculate the ROC curve to help us better in model evaluation.

Confusion matrix:

From the confusion matrix we find out the following:

  • True Positive, True Negative, False positive and False-negative.
  • ROC Curve
  • ROC Curve stands for Receiver Operating Characteristics Curve.For evaluating a model using a ROC curve, we measure the area under the.

ROC curve:

It is the ratio of True Positive Rate (TPR) to the False Positive Rate (FPR).

There are some other evaluation methods also for the evaluation of classification models such as:

  1. Gain and Lift charts
  2. Gini coefficient etc.

The often-used methods are the confusion matrix and the ROC curve.

In Big Data integration we are required to integrate the various data sources and systems. The policies regarding data collection, extraction, storage as well processing are bound to change. The various data points have different formats, architectures, tools and technologies, protocols of data transfer, etc. So deciding to capture and use Big Data for your business will involve integrating these various data points,

making some changes to the formats, usage, securities, etc. It will have some impact on an overall day to day operation of the business.

There are several issues in Big Data integration that needs to be addressed before going ahead with the process of integration. Some of the issues are:

  1. Consolidation of different verticals
  2. Change in business practices
  3. Change in culture
  4. Initial capital investment
  5. Change in operations etc.

Likely, many businesses have already deployed their IT infrastructures depending on their requirements. So when deciding to have Big Data integration in place, businesses are required to rethink their IT strategies and make the necessary provisions for capital investments.

So initially while planning for Big Data adoption, there we see a  reluctance in the organization as it requires drastic changes at various levels.

In many enterprises, traditionally, the data is stored in silos. The integration of these different data silos is not an easy task as they have different structures and formats.

So, when we are planning for the Big Data integration, the focus should be on long term requirements of the overall Big Data infrastructure and not just the present integration needs.

The traditional platforms for data storage and processing are insufficient to accommodate  Big Data. So, now if you are looking to tap the potential of Big Data, you are required to integrate the various data systems. Here, you are not just integrating among the various Big Data tools and technologies but also with the traditional non-Big Data systems.

Big Data systems are also required to be integrated with the other new kind of data sources- may be Streaming data, IoT data, etc. In simpler terms, we can say that Big Data Integration combines the data which is originating from a variety of data points or different sources and formats, and then provides the user with a unified and translated view of the combined data.

There are some obvious challenges in Big Data integration such as syncing across various data sources, uncertainty,  data management, finding insights, selection of proper tools, skills availability, etc. When you aspire for Big Data integration,  attention should also be given on data governance, performance, scalability and security. The Big Data integration should start with the logical integration taking into consideration all the aspects and needs of the business and also the regulatory requirements and then end with the actual physical deployment.

Tools:  iWay Big Data Integrator,  Hadoop can also play a very big role in Big Data integration. As Hadoop is an open-source and requires commodity hardware, enterprises can expect a lot of savings with regards to data storage and processing. You can integrate with Hadoop the data systems of various kinds. There are so many open-source tools available such as Flume, Kafka, Sqoop, etc.

In a Graph Analytics of Big Data, we try to model the given problem into a graph database and then  perform analysis over that graph to get the required answers to our questions. There are several types of graph analytics used such as:

  1. Path Analysis
  2. Connectivity Analysis
  3. Community Analysis
  4. Centrality Analysis

Path Analysis is generally used to find out the shortest distance between any two nodes in a given graph.

Route optimization is the best example of Path Analysis. It can be used in applications such as supply chain, logistics, traffic optimization, etc. Connectivity Analysis is used to determine the weaknesses in a network. For Example - a Utility PowerGrid. 

The connectivity across a network can also be determined using the Connectivity Analysis. Community Analysis is based on Density and Distance. It can be used to identify the different groups of people in a social network. Centrality Analysis enables  us to determine the most 'Influential People' in a social-network.

Using this analysis, we can find out the web pages that are highly accessed. Various algorithms are making use of Graph Analytics. For example- PageRank, Eigen Centrality and Closeness, Betweenness Centrality, etc.

Graphs are made up of nodes/vertices and edges. When applied to real-life examples, 'people' can be considered as nodes. For example customers, employees, social groups, companies etc. There can be other examples also for nodes such as buildings, cities and towns, airports, bus depots, distribution points, houses,   bank accounts, assets, devices, policies, products, grids, web pages, etc.

Edges can be the things that represent relationships. For example-  social networking likes and dislikes emails, payment transactions, phone calls, etc. The Edges can be directed, non-directed or weighted. For example  -John transferred money to Smith, Peter follows David on some social platform, etc. The examples of non-directed edges can be - Sam likes America etc. An example of weighted edges can be something like  - 'the number of transactions between any two accounts is very high', the time required to reach any two stations or locations', etc. In a big data environment, we can do Graph Analytics using Apache Spark 'GraphX' by loading the given data into memory and then running the 'Graph Analysis' in parallel. 

There is also an interface called 'Tinkerpop' that can be used to connect Spark with the other graph databases. By this process,   you can extract the data out of any graph database and load it into memory for faster graph analysis. For analyzing the graphs, we can use some tools such as Neo4j, GraphFrames, etc.  GraphFrames is massively scalable.

Graph analytics can be applied to detect fraud, financial crimes, identifying social media influencers,  route optimization, network optimization, etc.

In the early days of Big Data, it was accessible only to the Big businesses. As the technologies related to Big Data were costly, so the small businesses were not able to make use of it. But with the growth of   cloud and allied technologies now even the small enterprises are tapping the potential of big data and making big out of it.

More and more businesses are now turning to predictive analytics to drive sales and growth. There is also an increase in the number of connected devices. So, a large amount of data is being generated which contains insights that when harnessed can prove to be a boon for the enterprises. The trend is now to make use of machine learning and AI to gain an extra edge and remain competitive in the market. 

The trend is now shifting from on-premise processing to online/cloud processing.  It relieves businesses from heavy upward investments. They are now able to make use of the latest technologies and tools with a minimal/affordable cost. Because of these-  cost per usage basis- trends, nowadays even the small enterprises are now able to  have access to the Big Data Tools and Technologies and increasing efficiencies across all levels.

Data preparation involves collecting, combining, organizing and structuring data so that it can be analyzed for patterns, trends, and insights. The Big Data needs to be preprocessed, cleansed, validated and transformed. For this, the required data is pulled in from different sources internal or external.  One of the major focuses of data preparation is that the data under consideration for analysis is consistent and accurate. It so because accurate data will only produce valid results.

When the data is collected, it is not complete. It may have some missing values, outliers, etc. Data preparation is the major and very important activity in any Big Data project. Only good data will produce good results. Most of the time, the data resides in silos, in different databases. It is also in different formats. So it needs to be reconciled. There are five D's associated with the process of data preparation. These are :

  1. Discover
  2. Detain
  3. Distill
  4. Document
  5. Deliver

The process of data preparation is automated. Various machine learning algorithms can be used in data preparation like filling missing values, fields renaming, ensuring consistency, removing redundancy, etc. There are various terminologies related to the process of data preparation such as data cleansing, transforming variables, removing outliers, data curation, data enrichment, data structuring and modeling, etc. These terminologies are actually the various processes or activities that are done under the process of data preparation.

It is seen that the time spent on data preparation is generally more than the time required for data analysis.

Though the methods used for data preparation are automated, it takes a lot of time to prepare the data as the volume of data is very large in quantity and it tends to grow continuously.

By data cleansing, you can identify which of your data records or entries are incomplete, inaccurate, incorrect or irrelevant. In other words, we can say that data cleansing is nothing but identifying the inaccuracies and redundancies in the dataset. There remain certain issues with the data we collect. These issues must be resolved or rectify before we can apply any kind of processing or analysis on the data. If the data remains unclean, it will give the wrong insights. To have good results, it is expected that the input data must also be good. For this data, cleansing is required and it is very important and a necessary step in any Big Data project.

Without cleansing data, you should not proceed further. Otherwise, you may end up with all the incorrect information. The various issues that our input dataset may contain are outlined as follows:

  1. Invalid Values
  2. Different Formats
  3. Attribute Dependencies
  4. Uniqueness
  5. Missing Values
  6. Misspellings
  7. Wrongly Classified Values etc

There are various methods to identify these issues:

  1. Visualization
  2. Outlier Analysis
  3. Validation Code

By visualization method, we mean we can take a random sample of the data and see whether it is correct or not.

By the Outlier Analysis method, we mean to find out an extreme or odd value that is not expected in that particular feature. For example in the 'age' column, we can not expect a value like 200 or 350 etc.

By the Validation Code method, we mean creating such a  code that can identify whether the data or values under consideration are right or not. Once we can identify the issues, we can apply the corresponding methods to correct them.

Cleansing a Big Data can become a time consuming and cumbersome process. So it is always suggested to start with a small random sample of the data.

Developing rules on a small valid sample of the data will speed-up your time required to get the required insights. This is so because it reduces the latency that is associated with the iterative analysis/exploration of Big Data.

One obvious question is- why do we need data transformation?

Several reasons make it compulsory to transform the data. In a Big Data kind of environment, we need to make use of every type of data available and from every possible source to draw useful insights out of it that will help the business to grow. These reasons can be:

  1. Making it compatible with the other data
  2. Moving it to the other systems
  3. Joining it with other data
  4. Aggregating the information present in the data.

Several steps can be followed to have a successful data transformation. These steps are:

  1. Data Interpretation or Data Discovery
  2. Data Quality Check - Pre-Translation
  3. Data Translation or Data Mapping
  4. Data Quality Check - Post-Translation.

There are many ways that you can perform the data transformation.

  • You can use scripting to transform the data. i.e. you have to manually write a code to perform the required transformation.
  • You can use automation tools on-premises.
  • Or you can opt for cloud-based automation tools

The process of data transformation tends to be slow, costly and time-consuming. You have to design an optimized strategy to have a successful data transformation to take place considering all the aspects, business needs, objectives, data governance,  regulatory requirements, security, scalability, etc.

The different methods that can be used for data transformation are:

  • Data binning: It is also called as data bucketing. It is a technique used for data pre-processing. It reduces the effects of small observational errors. In the process of data binning, the sample is divided into some intervals and then replaced by the categorical values.
  • Indicator variables: The technique of indicator variables is used for the conversion of categorical data into the Boolean values. It is done by creating the indicator variables. If there are more than two values-'n', we are required to create 'n-1' columns.
  • Centering & Scaling: The value of one feature can be centred by subtracting the mean of all values. For scaling the data, the centered feature is divided by the standard deviation.
  • Other techniques:  We can use some other techniques for data transformation, such as making a group of outliers having the same values. We can also decide to replace the value with the number of times it appears in the column.

Dimensionality reduction means reducing the number of dimensions or variables that are under consideration. Big Data contains a large number of variables. Most of the time, some of these variables are correlated.   So there is always room to select only the major/distinct variables that contribute in a big way to produce the result. Such variables are also called Principal Components.

In most cases, some features are redundant. We can always reduce the features where we observe a high correlation.  Dimensionality Reduction technique is also known as 'Low Dimensional Embedding'.

When the number of variables is huge, it becomes difficult to draw inferences from the given data set.  Visualization also becomes too difficult. So, it is always desirable in such situations to reduce the number of features and utilize only the more significant features. Thus the technique of  Dimensionality Reduction helps a lot in such situations by allowing us to reduce the number of dimensions and speed up our analytics. There are several obvious advantages of Dimensionality Reduction such as:

  1. Reduced storage due to data compression.
  2. Reduced computation time.
  3. Removal of redundant features
  4. Visualization becomes easier.

Dimensionality Reduction may cause some loss of data but the advantages gain is more.

There are two approaches to do Dimensionality Reduction:

  1. Feature Selection
  2. Feature Extraction

Following are the different ways by which we can perform 'Feature Selection':

  1. Filter Method
  2. Wrapper Method
  3. Embedded Method

In 'Feature Extraction' we reduce the data from a 'high dimensional space' to a lesser number of dimensions or 'lower-dimensional space'. The process of 'Dimensionality Reduction' can be linear or nonlinear. Several methods are used with  Dimensionality Reduction.

Some of these are:

  1. PCA (Principal Component Analysis)
  2. LCA (Linear Discriminant Analysis)
  3. DCA (Generalized Discriminant Analysis)

When we are using the 'Principal Component Analysis', there is a requirement that the variance of the data which is in the 'lower-dimensional space' should be 'maximum'.  When it is being mapped to a 'lower-dimensional space' from a 'higher dimensional space'. The following steps are followed in the process of Principal Component

Analysis:

  1. Constructing the Covariance Matrix of the given data.
  2. Computing the EigenVectors of the computed matrix.
  3. Reconstructing the variance of the original data by using the Eigen Vectors corresponding to the largest eigenvalue.

By using 'Linear Discriminant Analysis' we try to find such a linear combination of features that can separate the two or more classes of objects/events.

The 'Generalized  Discriminant Analysis' method is used to provide a mapping of the given 'input vectors' into a 'high dimensional feature space’.

Big Data is not a business jargon now. It is becoming a necessity. To remain competitive in the market, you have to make use of Big Data. Now, you can not ignore it. So, we see a positive upward trend in the adoption of Big Data across different verticals.

The use cases may be different for different industries. Previously it was only the private industries that were utilizing the power of Big Data. But nowadays we see the government organizations are also adopting Big Data.  The Big Data adoption in different enterprises is for different reasons.

Some of these are mentioned below:

  1. To improve control over waste management, fraud detection, and abuse
  2. To make efficient resource management
  3. To formulate future strategies and budgeting
  4. To predict equipment failures etc.

There are certain challenges in the adoption of Big Data that needs to be properly addressed. Some of these challenges are:

  1. Initial capital investment
  2. Change in culture
  3. Change in business practices
  4. Change in operations etc.

Many governments, as well as private organizations, have already invested heavily in their IT infrastructure before the emergence of Big Data. So making a sudden adoption of Big Data is reluctant. Many organizations have their data stored in silos. There were also no strategies and established practices for the extraction and processing of that data. So making full use of the data was also not possible due to different formats and protocols. The awareness regarding Big Data integration was little and the reluctance to change was high. Most of them want to adopt a wait and watch strategy. So, initially, we saw a slow adoption in Big Data and allied technologies. But nowadays the trend has shifted and we see wants to have some kind of Big Data adoption in their enterprises.

The Big Data solutions that are available also vary widely. A specific Big Data solution that is suitable for one enterprise may be completely unsuitable for the other. So considering these various challenges, it becomes imperative to see the best practices in the adoption of Big Data solutions. Certain factors are conducive to the adoption of Big Data:

  1. Leadership that is open-minded and holistic
  2. Business needs and objectives must be clearly stated before the implementation of any Big Data technology.
  3. Specific business use cases should be identified and aligned with the business objectives.
  4. A clear strategy and plan to utilize existing IT resources.
  5. A holistic view of the integration plan with the legacy systems.
  6. Emphasis on data governance.
  7. Employees Awareness and training etc.

Attention should be given on the overall Big Data infrastructure and not just the present required application. While planning for the adoption of Big Data, care should also be taken regarding the proper size of the cluster, requirement of good commodity hardware, storage and network architecture, security and compliance considerations.

The overall data management including its availability, integrity, usability, security, etc. is termed as data governance. For effective data governance, there should be a council about data governance, well-defined procedures and an effective plan for the implementation of those procedures and practices.

When it is ensured that there is an integrity of the given data and also it is trustworthy, we get the expected business benefits out of that data. As businesses depend more and more on data for making business decisions, data governance becomes very important and more critical.

A uniform and consistent data access across different business applications should be ensured.  A team of data governance is responsible for implementing these policies and defined procedures regarding the handling of data. This team may include data managers, business managers and other staff related to the handling of data. Some Associations are dedicated to the promotion of best practices in data governance. These are:

  1. Data Governance Institute
  2. Data Management Association
  3. Data Governance Professionals Organization etc.

There can be many use cases where we see the data governance can play a crucial role. Some of these use cases can be as enlisted below:

  1. Mergers and Acquisitions,
  2. Business Process Management,
  3. Modernization of legacy systems
  4. Financial and Regulatory Compliance,
  5. Credit Risk Management
  6. Business Intelligence Applications etc.

Some various strategies and steps need to be incorporated to have good data governance in place.

  1. You need to decide the data ownership.
  2. Define the policies regarding data storage, availability, backup, security, etc.
  3. Define the standard procedures for authentication and usage of data by the different users in the enterprise.
  4. Ensure good policies regarding data audit and various government compliances.
  5. Ensure data consistency at various levels and across various departments and applications within the enterprise.

Thus by implementing data governance, we ensure data integrity, consistency, accuracy, accessibility and quality. Defining data ownership is considered to be the first step in data governance. Then the different processes regarding data storage, back-up, archival, security etc needs to be defined. Procedures and standards regarding data access and authorization need to be defined. A set of policies and audit controls regarding compliance with the different regulations and company policies should be defined.

Data stewardship means owning accountability regarding data availability, accessibility, accuracy,   consistency, etc. To ensure data stewardship, a team of different people is formed. This team includes data managers, data engineers, business analysts, policymakers, etc. A Data Steward is responsible for data proficiency and the management of an organization's data. He is also expected to critically handle almost all the things that are related to data policies, processing, data governance and look over the organization’s information assets in compliance with the different policies and the other regulatory obligations.

A Data Steward is supposed to answer the following questions:

  1. What is the importance of this particular data to an organization?
  2. How long the data should be stored?
  3. What could be the improvements in the quality of the data insights?

He also looks after the data protection, data authorization and access depending upon the defined roles. Any breach would be immediately noted and brought to the notice of the management. He has to ensure that the practices regarding data retention, archival as well as disposal requirements have complied with the organizational policies and the various regulations in place. While ensuring transparency, he also has to check that the data privacy and security is not breached.  A Data Steward should ensure the quality of data and should take different measures to keep it intact in consultation with the various stakeholders.

He acts as an intermediary between the business side/ management of the organization and the IT department. Depending on the company culture and the kind of Big Data project is concerned,  we can have different models or forms of Data Stewards. There can be:

  1. Subject Area wise Data Stewards
  2. Functional Data Stewards
  3. Process wise Data Stewards
  4. System Data Stewards
  5. Project Data Stewards

As far as skills are concerned, a Data Steward should have the following skills:

  1. Programming skills
  2. Database & Warehousing Proficiency
  3. Technical Writing
  4. Business Acumen & Foresight etc.

These are the models that are designed to measure an organization's maturity to Big Data. Big Data Maturity Models provide the required tools for an enterprise to assess its Big Data capabilities.  It assists an enterprise to formulate its goals and strategies concerning Big Data. Using Big Data Maturity Model, an enterprise can have clear communication about its Big Data strategy and policy among the various departments and at various levels within the enterprise.

A Big Data Maturity Model can also be used as a means to monitor the progressive journey of an enterprise into the world of Big Data. It also helps in identifying the weak areas and the areas that require more attention to fit into the Big Data arena. A Big Data Maturity Model gives a direction as to how an organization can make the efficient use of its Big Data to achieve the anticipated benefits out of the Big Data. It can also be implied from the use of the Big Data Maturity Model that the more mature a model, the more benefits/revenue an organization can expect. It also helps in curtailing the overall operational expenses.

Yes, there are several categories of Big Data Maturity Model. Mostly it is categorized into three main levels:

  1. Descriptive Models
  2. Comparative Models
  3. Prescriptive Models.

The descriptive model helps in assessing the maturity level of an enterprise at various stages. It is described in qualitative terms. It does not provide any recommendation regarding the improvement in the maturity of an organization's Big Data capability. However, it helps you to understand the value that is generated out of your investments in Big Data. From it, you can identify the probable steps that can be taken to improve your Big Data potential. The following maturity levels are described in the descriptive model :

  1. Ad-hoc
  2. Foundational
  3. Competitive Differentiating
  4. Breakaway.

The Comparative model given an idea about the status of your organization concerning your competitors as far as Big Data capability is concerned. It provides a kind of benchmarking that you can use to know your position in the Big Data market.

It consists of quantitative as well as qualitative information to gauge your status/position when compared to your peers. Comparative models consist of various stages/levels in terms of maturity. These are as follows:

  1. Nascent
  2. Pre-adoption
  3. Early-adoption
  4. Corporate adoption
  5. Mature/Visionary

The Prescriptive model first assesses the present situation and then the various stages. It then suggests the probable paths by which the improvement can be made to increase your Big Data capability/maturity. The  prescriptive model also has several phases which are mentioned as follows:

  1. Undertake Big Data Education
  2. Measure Big Data Readiness
  3. Identify a significant Big Data use case
  4. Structure a Proof of Concept Big Data Project.

The criteria to evaluate Big Data Maturity Model can be stated as follows:

  1. Model structure completeness
  2. The quality of development and evaluation of the model
  3. Ease of use
  4. Value creation

By model structure completeness we mean the completeness of the model in all respect taking into consideration the Big Data needs of the organization. The model should also exhibit consistency across all levels. The second criterion evaluates the model in terms of quality of development and evaluation. The model should exhibit trustworthiness. It should also be stable. In the third criterion, we see the ease of use of the model. The model should be easily applicable to the situations under consideration. It should also be comprehensible.

The fourth criterion state the value creation of the model concerning Big Data. We evaluate whether the model brings any value addition to the enterprise as far as Big Data initiatives are concerned. We also see how close the model to reality is. We also evaluate its relevance and performance.

We evaluate these Big Data Maturity model taking into consideration the various aspects of the business. These aspects can be stated as follows:

  1. Business Strategy
  2. Information
  3. Analytics
  4. Organizational Culture and Execution
  5. Architecture
  6. Governance

Big Data Maturity Model assists in a big way to plan and go ahead with the Big Data initiatives. First, it helps in formulating the organizational goals and strategies concerning Big Data. It equips the organization with the necessary tools to assess its  Big Data capabilities and plan accordingly. Big Data Maturity Models make an organization to ensure its maturity level concerning Big Data.

An organization can better assess itself on various aspects using Big Data Maturity Model.  It will help an organization to have clear communication with all the staff and across all the domains about the strategies, policies, and initiatives concerning Big Data. An  organization can also monitor itself in its implementation of Big Data initiatives and compare with the other players in the market who are in the

Big Data space. It will help an organization to revisit its goals and make corresponding changes in the implementation and strategic moves as far as the adoption of Big Data is concerned.

By using the Big Data Maturity Model, an organization can identify the areas where improvements are needed to have uniformity and cohesion in the implementation of Big Data initiative. In this way, by ensuring the correctness of strategy and guiding as well as gauging the implementation details of the Big Data initiatives, the Big Data Maturity Model plays a vital role in not just planning but also in monitoring the overall Big Data journey of an enterprise.

Yes, there are certain tools to assess the Big Data Maturity Model. We have tools like:

  1. IBM's 'Big Data & Analytics Maturity Model'
  2. 'TDWI Big Data Maturity Model Assessment Tool'
  3. 'Knowledgent Big Data Maturity Assessment'
  4. 'Info-Tech Big Data Maturity Assessment Tool'
  5. 'CSC Big Data Maturity Tool'
  6. 'Radcliffe Big Data Maturity Model'
  7. 'Booz & Company's Model'
  8. 'Van Veenstra's Model'

These tools give benchmark based on surveys. The survey contains around 50 questions across the various aspects of an organization. These aspects are:

  1. Business Strategy
  2. Information
  3. Analytics
  4. Organizational Culture and Execution
  5. Architecture
  6. Governance

These are also called as dimensions. There is a range of benchmark which determines the maturity level.

For example, the following table gives a glimpse of the maturity level Benchmarking:

Scores per Dimension Maturity Level

  • <15 Nascent
  • 16–25 Pre-Adoption
  • 26–35 Early Adoption
  • 36–45 Corporate Adoption
  • 46–49 Mature
  • 50 Visionary

Its interpretation can be as follows: If you get a score of 19 for a particular aspect/dimension,  it means you are in the pre-adoption level/stage for the dimension which is under consideration. These assessment tools guide you regarding the progress of your Big Data journey. It may be likely that this progress is not uniform across all the dimensions. A variation in score may be observed for different dimensions.

It may be likely that you are in the corporate adoption stage at data management but as far as analytics is concerned, you are in the pre-adoption stage. Thus the tools to assess Big Data Maturity Model are very helpful and give you the understanding of your maturity in the Big Data space.

There are various options to use as a messaging system with Big Data. These kinds of systems are widely known as the 'Distributed Commit Log Technologies'. Apart from delivery guarantees and flexibility, theses messaging systems are more inclined towards Throughput and Scalability. Some of the most used Big Data messaging systems are :

  1. Apache Kafka
  2. Microsoft Event Hubs
  3. Amazon Kinesis
  4. Google Pub/Sub
  5. Rabbit MQ

'Apache Kafka' is a messaging system that is distributed in nature. It allows having publish-subscribe kind of messages in a data pipeline. It is a highly scalable and fast messaging system. It centralizes communication between large data systems. It is mostly used as a 'Central Messaging System'.  Kafka is fault-tolerant and so is more reliable. It gives high performance. Originally it was developed at 'LinkedIn'. 'Microsoft Event Hubs' are event investors. They receive and process millions of events/second. Producers of the events send them to an event hub through AMQP/HTTPS. 'Amazon Kinesis' is a cloud-based messaging service. It's processing in real-time. Google Pub/Sub is also a cloud-based messaging service. Here the 'Consumers' subscribe to a topic and the 'Publishers' send the messages to a topic. Rabbit MQ is a 'Message Queuing System'. It serves as a middleware between the 'Producers' and the 'Consumers.'

Lambda architecture is a Big Data processing architecture. To handle the enormous quantities of data, the lambda architecture makes use of batch as well as stream processing methods. It is a fault-tolerant architecture and achieves a balance between latency and throughput. Lambda architecture makes use of the model of data that has an append-only, immutable data source which serves as a system of record.

In this architecture, new events are appended to the existing events. The new events do not overwrite existing events. The lambda architecture is designed for ingesting and the processing of timestamp-based events. The state can be determined from the 'natural', 'time-based' ordering of the data.

In Lambda architecture, we have a system that consists of three layers:

  1. Batch processing
  2. Real-time processing
  3. Serving layer

The third layer is to respond to queries. The data is ingested to the processing layers from a master copy of the entire data set. This master copy is immutable.  The real-time processing layer processes the data streams in real-time. It does not require completeness or any fix-ups.

This layer provides real-time views on the most recent data. So the latency is minimized but the throughput is sacrificed.   The real-time processing is also termed as speed processing.

As there is a lag by the batch layer in providing the views on the most recent data, we can say that the speed layer does the work of filling this gap. The benefit which we get from the speed layer is that the view is immediately available once we receive the data. This view may not be complete or we can say when compared with the view generated by the batch layer. However, there is always a choice with you to replace the view produced by the speed layer with the batch layer's view when that data made available to the batch layer. The output obtained from the batch layer and the speed layer is stored in the serving layer.  In response to the 'ad-hoc queries', this serving layer returns the views that are pre-computed or building the views from processed data.

Data enrichment is a process to improve, refine or enhance the data. It is something like adding some additional details to the existing data. It also includes adding external data from some trusted sources to the existing data. Data enrichment helps you to have complete and accurate data. More informed decisions can be made by having enriched data. As data is the most valuable asset in the Big Data world, it must be ensured that the data is in good condition. It should not be incomplete, missing, redundant or inaccurate. If we do not have good data, we can not expect good results out of it. 

What we mean by good data is that it should be complete and accurate. The process of data enrichment helps us to add more details to the existing data so that it becomes a complete data.

Incomplete or little data can not give a bigger or complete picture of your customer. If you have insufficient information about your customers, you may not be able to give the expected service or customized offerings. This affects the business conversion rate and ultimately the business revenue. So having data in a  good and complete condition is a must for Big Data analytics to give the correct insights and hence produce the expected results. Data enrichment involves data refinement that may be insufficient, inaccurate or may have small errors. Extrapolating data is also a kind of data enrichment. Here we produce more data from the available raw data.  There are several types of data enrichment methods. Out of these, the two significant methods are:

  1. Demographic Data Enrichment
  2. Geographic Data Enrichment

It is up to you to decide what kind of data enrichment you need depending on your business requirements and objectives. Data enrichment is not a one time process, it is to be done continuously because the customer data tends to change with time. There are several data enrichment tools available. Some of these are:

  1. Clearbit
  2. Datanyze
  3. LeadGenius
  4. Leadspace
  5. Reachforce
  6. FullContact
  7. RingLead
  8. DemandGen
  9. ZoomInfo etc.

Outliers are observations that appear far away from the group. They diverge from the overall pattern outlined by the given sample. Due to the presence of outliers in the dataset, we can observe a drastic change in the results. There are various unfavourable effects of outliers in the data set. Some of the

impacts can be stated as follows:

  1. It may increase the error variance.
  2. Normality may get decreased.
  3. It may decrease the power of various statistical tests.
  4. We may get biased estimates.

Outliers must not be ignored and should be properly treated as their presence may change the basic assumptions in statistical modelling. The results may get skewed due to the presence of outliers. Before applying procedures to deal with the outliers, we should always try to reason out the presence of outliers.

If we know the reason for the presence of outliers in our dataset, we can use the methods accordingly, to deal with the outliers. The reasons for having outliers in the dataset can be as follows:

  1. Non-natural (Data Errors)
  2. Natural (True Outliers)

The non-natural reasons for outliers can be :

  1. Data Entry Errors
  2. Measurement Error
  3. Sampling Error
  4. Experimental Error
  5. Data Processing Error etc.

Natural or true outliers can be originally present in the dataset. To deal with outliers, the following approaches can be used:

  1. Deleting observations
  2. Transformation
  3. Binning
  4. Imputing values
  5. Treating as a separate group
  6. Other statistical methods.

Trimming can also be used at the extremes/both ends to remove the outliers. Weights can also be assigned to different observations. Mean, Mode and Median can also be used to remove outliers. Before imputing values, we should analyze if it is a natural outlier or artificial. If the outliers present are significantly large in number, it is advisable to treat them as separate groups. We can then build corresponding models for both the groups. The output is then combined.

Using the Cloud for Big Data Development is a good choice. It will help the businesses to increase their operational efficiencies with a minimal initial investment. They just have to pay only for the facilities they are using. Furthermore, they can upgrade or downgrade the facilities as per the changing business requirements.

For some enterprises, deploying Big Data technologies on their premises prove to be a costly affair.  Most of the time, they do not possess the required expertise to deal with Big Data deployment. Furthermore, the initial investments in these technologies are more. Their business requirements are also changing with market conditions. The tools and technologies related to Big Data also tend to evolve with the changing requirements. So keeping updated with the latest versions/tools prove to be costly for the enterprises. So, the cloud seems to be a better alternative to start with the Big Data initiative. There are several players in the cloud space. The major Big Data Cloud providers are:

  1. Amazon Web Services
  2. Microsoft Azure
  3. Google Cloud Platform
  4. Rackspace
  5. Qubole etc.

Developing a cloud-based solution for Big Data involves a lot of awareness regarding the various Big Data offerings by different cloud providers. At the very first a business should be very, very clear in its requirements regarding Big Data. These requirements can be something like:

  1. Kind of insights needed
  2. Data Sources
  3. Storage needs
  4. Processing requirements (batch/real-time) etc.

Once you are clear in your Big Data requirements and strategies for future developments, you can choose a better combination of the storage solutions, processing platforms and the analytical tools to get the required results from your Big Data initiatives.

Depending on the business constraints and the state regulations, you can decide to opt some Big Data solutions from the cloud and some tools can be employed within the enterprise to have a better tradeoff. This way you can ensure confirming the various regulations as well as make efficient use of the available resources and the budgetary provisions.

In Big Data projects, one of the greatest concerns is data availability and accessibility. The cloud providers assure 99.9 % uptime. They also employ various data checking and security mechanisms to ensure data availability all the time. Making such provisions at an enterprise-level requires heavy investments in not just capital but also in tackling the operational challenges. It is observed mostly that despite having sound planning, the demand becomes difficult to anticipate.  This results in under or over-allocation of resources which ultimately affects your investments. This will enable the new services, products and projects to initiate on a small scale basis with minimal or at very low costs. This gives a lot of room for innovations. So, opting for cloud seems to be a better choice as far as the initial journey into the world of Big Data is concerned.

Description

Big Data is an expression related to an extensive amount of both structured and unstructured data, so large that it is tough to process using traditional database and software techniques. In the majority of enterprise scenarios, the volume of data is too big, it moves too fast or it exceeds current processing capacity. Professionals having big data training usually make use of big data in enterprise scenarios.

Big Data is capable of helping companies improve operations and make faster and more rational judgments. The data is collected from a host of sources that includes emails, mobile devices, applications, databases, servers, and other sources. When captured, this data is formatted, manipulated, stored and then analyzed. It can benefit a business to achieve valuable insight to increase revenues, acquire or maintain customers and develop operations.

When specifically used by vendors, the term ‘Big Data’ may apply to the technology including the tools and processes, one that a company needs to manage the large amounts of data and storage facilities. Considered to have originated with web search companies, big data is for the ones who require to address queries of very large distributed aggregations of loosely-structured data.

Big Data is required in multiple industries globally such as Government bodies, International development Manufacturing, Healthcare, Education, Media, Insurance, Internet of Things (IoT) and Information Technology. You can work on similar industry-grade case studies with a big data and hadoop course.

Big data professionals are one of the most sought-after by top companies like Google, Apple, NetApp, Qualcomm, Intuit, Adobe, Salesforce, FactSet, and GE. They use Big Data as one of the most important software.

Interviews are never a walk in the park for anybody. One requires a systematic approach to clear any interview. Here is where we come to your rescue with theseBig Data interview questions for experienced and freshers. You will need to respond promptly and efficiently to answer questions asked by the employers. These interview questions on Big Data are very obvious so your prospective recruiters will anticipate you to answer the same. These Big Data interview questions and answers will give you the needed confidence to ace the interview.

These Big Data programming interview questions are relevant to your job roles like Data Science, Machine Learning or just Big Data coding. Suggested by experts, these Big Data developer interview questions have proven to be of great value. These Big Data basic interview questions are helpful to both the job aspirants and even the recruiters who need to know the appropriate questions that they need to ask to assess a candidate.

It's time to act and make a mark in your career with the next Big Data interview. Build your future and all the best!

Read More
Levels