Search

What is Big Data — An Introductory Guide

The massive world of Big DataIf one strolls around any IT office premises, over every decade (nowadays time span is even lesser, almost every 3-4 years) one would overhear professionals discussing new jargons from the hottest trends in technology. Around 5 -6 years ago, one such word has started ruling IT services is ‘BIG data’ and still has been interpreted by a layman to tech geeks in various ways.Although services industries started talking about big data solutions widely from 5-6 years, it is believed that the term was in use since the 1990s by John Mashey from Silicon Graphics, whereas credit for coining the term ‘big data’ aligning to its modern definition goes to Roger Mougalas from O’Reilly Media in 2005.Let’s first understand why everyone going gaga about ‘BIG data’ and what are the real-world problems it is supposed to solve and then we will try to answer what and how aspects of it.Why is Big Data essential for today’s digital world?Pre smart-phones era, internet and web world were around for many years, but smart-phones made it mobile with on-the-go usage. Social Media, mobile apps started generating tons of data. At the same time, smart-bands, wearable devices ( IoT, M2M ), have given newer dimensions for data generation. This newly generated data became a new oil to the world. If this data is stored and analyzed, it has the potential to give tremendous insights which could be put to use in numerous ways.You will be amazed to see the real-world use cases of BIG data. Every industry has a unique use case and is even unique to every client who is implementing the solutions. Ranging from data-driven personalized campaigning (you do see that item you have browsed on some ‘xyz’ site onto Facebook scrolling, ever wondered how?) to predictive maintenance of huge pipes across countries carrying oils, where manual monitoring is practically impossible. To relate this to our day to day life, every click, every swipe, every share and every like we casually do on social media is helping today’s industries to take future calculated business decisions. How do you think Netflix predicted the success of ‘House of Cards’ and spent $100 million on the same? Big data analytics is the simple answer.Talking about all this, the biggest challenge in the past was traditional methods used to store, curate and analyze data, which had limitations to process this data generated from newer sources and which were huge in volumes generated from heterogeneous sources and was being generated  really fast(To give you an idea, roughly 2.5 quintillion data is generated per day as on today – Refer infographic released by Domo called “Data Never Sleeps 5.0.” ), Which given rise to term BIG data and related solutions.Understanding Big Data: Experts’ viewpoint BIG data literally means Massive data (loosely > 1TB) but that’s not the only aspect of it. Distributed data or even complex datasets which could not be analyzed through traditional methods can be categorized into ‘Big data’ and hence Big data theoretical definition makes a lot of sense with this background:“Gartner (2012) defines, Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation.”Generic data possessing characteristics of big data are 3Vs namely Variety, Velocity, and VolumeBut due to the changing nature of data in today’s world and to gain most insights of it, 3 more Vs are added to the definition of BIG DATA, namely Variability, Veracity and Value.The diagram below illustrates each V in detail:Diagram: 6 V’s of Big DataThis 6Vs help understanding the characteristics of “BIG Data” but let’s also understand types of data in BIG Data processing.  “Variety” of above characteristics caters to different types of data can be processed through big data tools and technologies. Let’s drill down a bit for understanding what those are:Structured ex. Mainframes, traditional databases like Teradata, Netezza, Oracle, etc.Unstructured ex. Tweets, Facebook posts, emails, etc.Semi/Multi structured or Hybrid ex. E-commerce, demographic, weather data, etc.As the technology is advancing, the variety of data is available and its storage, processing, and analysis are made possible by big data. Traditional data processing techniques were able to process only structured data.Now, that we understand what big data and limitations of old traditional techniques are of handling such data, we could safely say, we need new technology to handle this data and gain insights out of it. Before going further, do you know, what were the traditional data management techniques?Traditional Techniques of Data Processing are:RDBMS (Relational Database Management System)Data warehousing and DataMartOn a high level, RDBMS catered to OLTP needs and data warehousing/DataMart facilitated OLAP needs. But both the systems work with structured data.I hope. now one can answer, ‘what is big data?’ conceptually and theoretically both.So, it’s time that we understand how it is being done in actual implementations.only storing of “big data” will not help the organizations, what’s important is to turn data into insights and business value and to do so, following are the key infrastructure elements:Data collectionData storageData analysis andData visualization/outputAll major big data processing framework offerings are based on these building blocks.And in an alignment of the above building blocks, following are the top 5 big data processing frameworks that are currently being used in the market:1. Apache Hadoop : Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.First up is the all-time classic, and one of the top frameworks in use today. So prevalent is it, that it has almost become synonymous with Big Data.2 Apache Spark : unified analytics engine for large-scale data processing.Apache Spark and Hadoop are often contrasted as an "either/or" choice,  but that isn't really the case.Above two frameworks are popular but apart from that following 3 are available and are comparable frameworks:3. Apache Storm : free and open source distributed real-time computation system. You can also take up Apache Storm training to learn more about Apache Storm.4. Apache Flink : streaming dataflow engine, aiming to provide facilities for distributed computation over streams of data. Treating batch processes as a special case of streaming data, Flink is effectively both batch and real-time processing framework, but one which clearly puts streaming first.5. Apache Samza : distributed Stream processing framework.Frameworks help processing data through building blocks and generate required insights. The framework is supported by the whopping number of tools providing the required functionality.Big Data processing frameworks and technology landscapeBig data tools and technology landscape can be better understood with layered big data architecture. Give a good read to a great article by Navdeep singh Gill on XENONSTACK for understanding the layered architecture of big data.By taking inspiration from layered architecture, different available tools in the market are mapped to layers to understand big data technology landscape in depth. Note that, layered architecture fits very well with infrastructure elements/building blocks discussed in the above section.Few of the tools are briefed below for further understanding:  1. Data Collection / Ingestion Layer Cassandra: is a free and open-source, distributed, wide column store, NoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failureKafka: is used for building real-time data pipelines and streaming apps. Event streaming platformFlume: log collector in HadoopHBase: columnar database in Hadoop2. Processing Layer Pig: scripting language in the Hadoop frameworkMapReduce: processing language in Hadoop3. Data Query Layer Impala: Cloudera Impala:  modern, open source, distributed SQL query engine for Apache Hadoop. (often compared with hive)Hive: Data Warehouse software for data Query and analysisPresto: Presto is a high performance, distributed SQL query engine for big data. Its architecture allows users to query a variety of data sources such as Hadoop, AWS S3, Alluxio, MySQL, Cassandra, Apache Kafka, and MongoDB4. Analytical EngineTensorFlow: n source machine learning library for research and production.5. Data storage LayerIgnite: open-source distributed database, caching and processing platform designed to store and compute on large volumes of data across a cluster of nodesPhoenix: hortonworks: Apache Phoenix is an open source, massively parallel, relational database engine supporting OLTP for Hadoop using Apache HBase as its backing storePolyBase: s a new feature in SQL Server 2016. It is used to query relational and non-relational databases (NoSQL). You can use PolyBase to query tables and files in Hadoop or in Azure Blob Storage. You can also import or export data to/from Hadoop.Sqoop: ETL toolBig data in EXCEL: Few people like to process big datasets with current excel capabilities and it's known as Big Data in Excel6. Data Visualization LayerMicrosoft HDInsight: Azure HDInsight is a Hadoop service offering hosted in Azure that enables clusters of managed Hadoop instances. Azure HDInsight deploys and provisions Apache Hadoop clusters in the cloud, providing a software framework designed to manage, analyze, and report on big data with high reliability and availability. Hadoop administration training will give you all the technical understanding required to manage a Hadoop cluster, either in a development or a production environment.Best Practices in Big Data  Every organization, industry, business, may it be small or big wants to get benefit out of “big data” but it's essential to understand that it can prove of maximum potential only if organization adhere to best practices before adapting big data:Answering 5 basic questions help clients know the need for adapting Big Data for organizationTry to answer why Big Data is required for the organization. What problem would it help solve?Ask the right questions.Foster collaboration between business and technology teams.Analyze only what is required to use.Start small and grow incrementally.Big Data industry use-cases We talked about all the things in the Big Data world except real use cases of big data. In the starting, we did discuss few but let me give you insights into the real world and interesting big data use cases and for a few, it’s no longer a secret ☺. In fact, it’s penetrating to the extent you name the industry and plenty of use cases can be told. Let’s begin.1. Streaming PlatformsAs I had given an example of ‘House of Cards’ at the start of the article, it’s not a secret that Netflix uses Big Data analytics. Netflix spent $100mn on 26 episodes of ‘House of Cards’ as they knew the show would appeal to viewers of original British House of Cards and built in director David Fincher and actor Kevin Spacey. Netflix typically collects behavioral data and it then uses this data to create a better experience for the user.But Netflix uses Big Data for more than that, they monitor and analyze traffic details for various devices, spot problem areas and adjust network infrastructure to prepare for future demand. (later is action out of Big Data analytics, how big data analysis is put to use). They also try to get insights into types of content viewers to prefer and help them make informed decisions.   Apart from Netflix, Spotify is also a known great use case.2. Advertising and Media / Campaigning /EntertainmentFor decades marketers were forced to launch campaigns while blindly relying on gut instinct and hoping for the best. That all changed with digitization and big data world. Nowadays, data-driven campaigns and marketing is on the rise and to be successful in this landscape, a modern marketing campaign must integrate a range of intelligent approaches to identify customers, segment, measure results, analyze data and build upon feedback in real time. All needs to be done in real time, along with the customer’s profile and history, based on his purchasing patterns and other relevant information and Big Data solutions are the perfect fit.Event-driven marketing is also could be achieved through big data, which is another way of successful marketing in today’s world. That basically indicates, keeping track of events customer are directly and indirectly involved with and campaign exactly when a customer would need it rather than random campaigns. For. Ex if you have searched for a product on Amazon/Flipkart, you would see related advertisements on other social media apps you casually browse through. Bang on, you would end up purchasing it as you anyway needed options best to choose from.3. Healthcare IndustryHealthcare is one of the classic use case industries for Big Data applications. The industry generates a huge amount of data.Patients medical history, past records, treatments given, available and latest medicines, Medicinal latest available research the list of raw data is endless.All this data can help give insights and Big Data can contribute to the industry in the following ways:Diagnosis time could be reduced, and exact requirement treatment could be started immediately. Most of the illnesses could be treated if a diagnosis is perfect and treatment can be started in time. This can be achieved through evidence-based past medical data available for similar treatments to doctor treating the illness, patients’ available history and feeding symptoms real-time into the system.  Government Health department can monitor if a bunch of people from geography reporting of similar symptoms, predictive measures could be taken in nearby locations to avoid outbreak as a cause for such illness could be the same.   The list is long, above were few representative examples.4. SecurityDue to social media outbreak, today, personal information is at stake. Almost everything is digital, and majority personal information is available in the public domain and hence privacy and security are major concerns with the rise in social media. Following are few such applications for big data.Cyber Crimes are common nowadays and big data can help to detect, predicting crimes.Threat analysis and detection could be done with big data.  5. Travel and TourismFlight booking sites, IRCTC track the clicks and hits along with IP address, login information, and other details and as per demand can do dynamic pricing for the flights/ trains. Big Data helps in dynamic pricing and mind you it’s real time. Am sure each one of us has experienced this. Now you know who is doing it :DTelecommunications, Public sector, Education, Social media and gaming, Energy and utility every industry have implemented are implementing several of these Big Data use cases day in and day out. If you look around am sure you would find them on the rise.Big Data is helping everyone industries, consumers, clients to make informed decisions, whatever it may be and hence wherever there is such a need, Big Data can come handy.Challenges faced by Big Data in the real world for adaptationAlthough the world is going gaga about big data, there are still a few challenges to implement and adopt Big Data and hence service industries are still striving towards resolving those challenges to implement best Big Data solution without flaws.An October 2016 report from Gartner found that organizations were getting stuck at the pilot stage of their big data initiatives. "Only 15 percent of businesses reported deploying their big data project to production, effectively unchanged from last year (14 per cent)," the firm said.Let’s discuss a few of them to understand what are they?1. Understanding Big Data and answering Why for the organization one is working with.As I started the article saying there are many versions of Big Data and understanding real use cases for organization decision makers are working with is still a challenge. Everyone wants to ride on a wave but not knowing the right path is still a struggle. As every organization is unique thus its utmost important to answer ‘why big data’ for each organization. This remains a major challenge for decision makers to adapt to big data.2. Understanding Data sources for the organizationIn today’s world, there are hundreds and thousands of ways information is being generated and being aware of all these sources and ingest all of them into big data platforms to get accurate insight is essential. Identifying sources is a challenge to address.It's no surprise, then, that the IDG report found, "Managing unstructured data is growing as a challenge – rising from 31 per cent in 2015 to 45 per cent in 2016."Different tools and technologies are on the rise to address this challenge.3. Shortage if Big Data Talent and retaining themBig Data is changing technology and there are a whopping number of tools in the Big Data technology landscape. It is demanded out of Big Data professionals to excel in those current tools and keep up self to ever-changing needs. This gets difficult for employees and employers to create and retain talent within the organization.The solution to this would be constant upskilling, re-skilling and cross-skilling and increasing budget of organization for retaining talent and help them train.4. The Veracity VThis V is a challenge as this V means inconsistent, incomplete data processing. To gain insights through big data model, the biggest step is to predict and fill missing information.This is a tricky part as filling missing information can lead to decreasing accuracy of insights/ analytics etc.To address this concern, there is a bunch of tools. Data curation is an important step in big data and should have a proper model. But also, to keep in mind that Big Data is never 100% accurate and one must deal with it.5. SecurityThis aspect is given low priority during the design and build phases of Big Data implementations and security loopholes can cost an organization and hence it’s essential to put security first while designing and developing Big Data solutions. Also, equally important to act responsibly for implementations for regulatory requirements like GDPR.  6. Gaining Valuable InsightsMachine learning data models go through multiple iterations to conclude on insights as they also face issues like missing data and hence the accuracy. To increase accuracy, lots of re-processing is required, which has its own lifecycle. Increasing accuracy of insights is a challenge and which relates to missing data piece. Which most likely can be addressed by addressing missing data challenge.This can also be caused due to unavailability of information from all data sources. Incomplete information would lead to incomplete insights which may not benefit to required potential.Addressing these discussed challenges would help to gain valuable insights through available solutions.With Big Data, the opportunities are endless. Once understood, the world is yours!!!!Also, now that you understand BIG DATA, it's worth understanding the next steps:Gary King, who is a professor at Harvard says “Big data is not about the data. It is about the analytics”You can also take up Big Data and Hadoop training to enhance your skills furthermore.Did the article helps you to understand today’s massive world of big data and getting a sneak peek into it Do let us know through the comment section below?
Rated 4.5/5 based on 2 customer reviews

What is Big Data — An Introductory Guide

27K
What is Big Data — An Introductory Guide

The massive world of Big Data

If one strolls around any IT office premises, over every decade (nowadays time span is even lesser, almost every 3-4 years) one would overhear professionals discussing new jargons from the hottest trends in technology. Around 5 -6 years ago, one such word has started ruling IT services is ‘BIG data’ and still has been interpreted by a layman to tech geeks in various ways.

Although services industries started talking about big data solutions widely from 5-6 years, it is believed that the term was in use since the 1990s by John Mashey from Silicon Graphics, whereas credit for coining the term ‘big data’ aligning to its modern definition goes to Roger Mougalas from O’Reilly Media in 2005.

Let’s first understand why everyone going gaga about ‘BIG data’ and what are the real-world problems it is supposed to solve and then we will try to answer what and how aspects of it.

Why is Big Data essential for today’s digital world?

Pre smart-phones era, internet and web world were around for many years, but smart-phones made it mobile with on-the-go usage. Social Media, mobile apps started generating tons of data. At the same time, smart-bands, wearable devices ( IoT, M2M ), have given newer dimensions for data generation. This newly generated data became a new oil to the world. If this data is stored and analyzed, it has the potential to give tremendous insights which could be put to use in numerous ways.

You will be amazed to see the real-world use cases of BIG data. Every industry has a unique use case and is even unique to every client who is implementing the solutions. Ranging from data-driven personalized campaigning (you do see that item you have browsed on some ‘xyz’ site onto Facebook scrolling, ever wondered how?) to predictive maintenance of huge pipes across countries carrying oils, where manual monitoring is practically impossible. To relate this to our day to day life, every click, every swipe, every share and every like we casually do on social media is helping today’s industries to take future calculated business decisions. How do you think Netflix predicted the success of ‘House of Cards’ and spent $100 million on the same? Big data analytics is the simple answer.

Talking about all this, the biggest challenge in the past was traditional methods used to store, curate and analyze data, which had limitations to process this data generated from newer sources and which were huge in volumes generated from heterogeneous sources and was being generated  really fast(To give you an idea, roughly 2.5 quintillion data is generated per day as on today – Refer infographic released by Domo called “Data Never Sleeps 5.0.” ), Which given rise to term BIG data and related solutions.

Understanding Big Data: Experts’ viewpoint 

BIG data literally means Massive data (loosely > 1TB) but that’s not the only aspect of it. Distributed data or even complex datasets which could not be analyzed through traditional methods can be categorized into ‘Big data’ and hence Big data theoretical definition makes a lot of sense with this background:

“Gartner (2012) defines, Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation.”

Generic data possessing characteristics of big data are 3Vs namely Variety, Velocity, and Volume

But due to the changing nature of data in today’s world and to gain most insights of it, 3 more Vs are added to the definition of BIG DATA, namely Variability, Veracity and Value.

The diagram below illustrates each V in detail:

 6 V’s of Big Data

Diagram: 6 V’s of Big Data

This 6Vs help understanding the characteristics of “BIG Data” but let’s also understand types of data in BIG Data processing.  
“Variety” of above characteristics caters to different types of data can be processed through big data tools and technologies. Let’s drill down a bit for understanding what those are:

  1. Structured ex. Mainframes, traditional databases like Teradata, Netezza, Oracle, etc.
  2. Unstructured ex. Tweets, Facebook posts, emails, etc.
  3. Semi/Multi structured or Hybrid ex. E-commerce, demographic, weather data, etc.

As the technology is advancing, the variety of data is available and its storage, processing, and analysis are made possible by big data. Traditional data processing techniques were able to process only structured data.

Now, that we understand what big data and limitations of old traditional techniques are of handling such data, we could safely say, we need new technology to handle this data and gain insights out of it. Before going further, do you know, what were the traditional data management techniques?

Traditional Techniques of Data Processing are:

  1. RDBMS (Relational Database Management System)
  2. Data warehousing and DataMart

On a high level, RDBMS catered to OLTP needs and data warehousing/DataMart facilitated OLAP needs. But both the systems work with structured data.

I hope. now one can answer, ‘what is big data?’ conceptually and theoretically both.

So, it’s time that we understand how it is being done in actual implementations.

only storing of “big data” will not help the organizations, what’s important is to turn data into insights and business value and to do so, following are the key infrastructure elements:

  • Data collection
  • Data storage
  • Data analysis and
  • Data visualization/output

All major big data processing framework offerings are based on these building blocks.

Traditional Techniques of Data Processing

And in an alignment of the above building blocks, following are the top 5 big data processing frameworks that are currently being used in the market:

1. Apache Hadoop : Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.First up is the all-time classic, and one of the top frameworks in use today. So prevalent is it, that it has almost become synonymous with Big Data.

2 Apache Spark : unified analytics engine for large-scale data processing.

Apache Spark and Hadoop are often contrasted as an "either/or" choice,  but that isn't really the case.

Above two frameworks are popular but apart from that following 3 are available and are comparable frameworks:

3. Apache Storm : free and open source distributed real-time computation system. You can also take up Apache Storm training to learn more about Apache Storm.

4. Apache Flink : streaming dataflow engine, aiming to provide facilities for distributed computation over streams of data. Treating batch processes as a special case of streaming data, Flink is effectively both batch and real-time processing framework, but one which clearly puts streaming first.

5. Apache Samza : distributed Stream processing framework.

Frameworks help processing data through building blocks and generate required insights. The framework is supported by the whopping number of tools providing the required functionality.

Big Data processing frameworks and technology landscape

Big data tools and technology landscape can be better understood with layered big data architecture. Give a good read to a great article by Navdeep singh Gill on XENONSTACK for understanding the layered architecture of big data.

By taking inspiration from layered architecture, different available tools in the market are mapped to layers to understand big data technology landscape in depth. Note that, layered architecture fits very well with infrastructure elements/building blocks discussed in the above section.

 Framework and technology landscape

Few of the tools are briefed below for further understanding:  

1. Data Collection / Ingestion Layer 

  • Cassandra: is a free and open-source, distributed, wide column store, NoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure
  • Kafka: is used for building real-time data pipelines and streaming apps. Event streaming platform
  • Flume: log collector in Hadoop
  • HBase: columnar database in Hadoop

2. Processing Layer 

  • Pig: scripting language in the Hadoop framework
  • MapReduce: processing language in Hadoop

3. Data Query Layer 

  • Impala: Cloudera Impala:  modern, open source, distributed SQL query engine for Apache Hadoop. (often compared with hive)
  • Hive: Data Warehouse software for data Query and analysis
  • Presto: Presto is a high performance, distributed SQL query engine for big data. Its architecture allows users to query a variety of data sources such as Hadoop, AWS S3, Alluxio, MySQL, Cassandra, Apache Kafka, and MongoDB

4. Analytical Engine

  • TensorFlow: n source machine learning library for research and production.

5. Data storage Layer

  • Ignite: open-source distributed database, caching and processing platform designed to store and compute on large volumes of data across a cluster of nodes
  • Phoenix: hortonworks: Apache Phoenix is an open source, massively parallel, relational database engine supporting OLTP for Hadoop using Apache HBase as its backing store
  • PolyBase: s a new feature in SQL Server 2016. It is used to query relational and non-relational databases (NoSQL). You can use PolyBase to query tables and files in Hadoop or in Azure Blob Storage. You can also import or export data to/from Hadoop.
  • Sqoop: ETL tool
  • Big data in EXCEL: Few people like to process big datasets with current excel capabilities and it's known as Big Data in Excel

6. Data Visualization Layer

  • Microsoft HDInsight: Azure HDInsight is a Hadoop service offering hosted in Azure that enables clusters of managed Hadoop instances. Azure HDInsight deploys and provisions Apache Hadoop clusters in the cloud, providing a software framework designed to manage, analyze, and report on big data with high reliability and availability. Hadoop administration training will give you all the technical understanding required to manage a Hadoop cluster, either in a development or a production environment.

Best Practices in Big Data  

Every organization, industry, business, may it be small or big wants to get benefit out of “big data” but it's essential to understand that it can prove of maximum potential only if organization adhere to best practices before adapting big data:

Answering 5 basic questions help clients know the need for adapting Big Data for organization

  1. Try to answer why Big Data is required for the organization. What problem would it help solve?
  2. Ask the right questions.
  3. Foster collaboration between business and technology teams.
  4. Analyze only what is required to use.
  5. Start small and grow incrementally.

Big Data industry use-cases 

We talked about all the things in the Big Data world except real use cases of big data. In the starting, we did discuss few but let me give you insights into the real world and interesting big data use cases and for a few, it’s no longer a secret ☺. In fact, it’s penetrating to the extent you name the industry and plenty of use cases can be told. Let’s begin.

1. Streaming Platforms

As I had given an example of ‘House of Cards’ at the start of the article, it’s not a secret that Netflix uses Big Data analytics. Netflix spent $100mn on 26 episodes of ‘House of Cards’ as they knew the show would appeal to viewers of original British House of Cards and built in director David Fincher and actor Kevin Spacey. Netflix typically collects behavioral data and it then uses this data to create a better experience for the user.

But Netflix uses Big Data for more than that, they monitor and analyze traffic details for various devices, spot problem areas and adjust network infrastructure to prepare for future demand. (later is action out of Big Data analytics, how big data analysis is put to use). They also try to get insights into types of content viewers to prefer and help them make informed decisions.   

Streaming Platforms

Apart from Netflix, Spotify is also a known great use case.

2. Advertising and Media / Campaigning /Entertainment

For decades marketers were forced to launch campaigns while blindly relying on gut instinct and hoping for the best. That all changed with digitization and big data world. Nowadays, data-driven campaigns and marketing is on the rise and to be successful in this landscape, a modern marketing campaign must integrate a range of intelligent approaches to identify customers, segment, measure results, analyze data and build upon feedback in real time. All needs to be done in real time, along with the customer’s profile and history, based on his purchasing patterns and other relevant information and Big Data solutions are the perfect fit.

Event-driven marketing is also could be achieved through big data, which is another way of successful marketing in today’s world. That basically indicates, keeping track of events customer are directly and indirectly involved with and campaign exactly when a customer would need it rather than random campaigns. For. Ex if you have searched for a product on Amazon/Flipkart, you would see related advertisements on other social media apps you casually browse through. Bang on, you would end up purchasing it as you anyway needed options best to choose from.

Advertising and Media

3. Healthcare Industry

Healthcare is one of the classic use case industries for Big Data applications. The industry generates a huge amount of data.

Patients medical history, past records, treatments given, available and latest medicines, Medicinal latest available research the list of raw data is endless.

All this data can help give insights and Big Data can contribute to the industry in the following ways:

  1. Diagnosis time could be reduced, and exact requirement treatment could be started immediately. Most of the illnesses could be treated if a diagnosis is perfect and treatment can be started in time. This can be achieved through evidence-based past medical data available for similar treatments to doctor treating the illness, patients’ available history and feeding symptoms real-time into the system.  
  2. Government Health department can monitor if a bunch of people from geography reporting of similar symptoms, predictive measures could be taken in nearby locations to avoid outbreak as a cause for such illness could be the same.   

The list is long, above were few representative examples.

4. Security

Due to social media outbreak, today, personal information is at stake. Almost everything is digital, and majority personal information is available in the public domain and hence privacy and security are major concerns with the rise in social media. Following are few such applications for big data.

  1. Cyber Crimes are common nowadays and big data can help to detect, predicting crimes.
  2. Threat analysis and detection could be done with big data.  

5. Travel and Tourism

Flight booking sites, IRCTC track the clicks and hits along with IP address, login information, and other details and as per demand can do dynamic pricing for the flights/ trains. Big Data helps in dynamic pricing and mind you it’s real time. Am sure each one of us has experienced this. Now you know who is doing it :D

Telecommunications, Public sector, Education, Social media and gaming, Energy and utility every industry have implemented are implementing several of these Big Data use cases day in and day out. If you look around am sure you would find them on the rise.

Big Data is helping everyone industries, consumers, clients to make informed decisions, whatever it may be and hence wherever there is such a need, Big Data can come handy.

Challenges faced by Big Data in the real world for adaptation

Challenges faced by Big Data in the real world for adaptation

Although the world is going gaga about big data, there are still a few challenges to implement and adopt Big Data and hence service industries are still striving towards resolving those challenges to implement best Big Data solution without flaws.

An October 2016 report from Gartner found that organizations were getting stuck at the pilot stage of their big data initiatives. "Only 15 percent of businesses reported deploying their big data project to production, effectively unchanged from last year (14 per cent)," the firm said.

Let’s discuss a few of them to understand what are they?

1. Understanding Big Data and answering Why for the organization one is working with.

As I started the article saying there are many versions of Big Data and understanding real use cases for organization decision makers are working with is still a challenge. Everyone wants to ride on a wave but not knowing the right path is still a struggle. As every organization is unique thus its utmost important to answer ‘why big data’ for each organization. This remains a major challenge for decision makers to adapt to big data.

2. Understanding Data sources for the organization

In today’s world, there are hundreds and thousands of ways information is being generated and being aware of all these sources and ingest all of them into big data platforms to get accurate insight is essential. Identifying sources is a challenge to address.

It's no surprise, then, that the IDG report found, "Managing unstructured data is growing as a challenge – rising from 31 per cent in 2015 to 45 per cent in 2016."

Different tools and technologies are on the rise to address this challenge.

3. Shortage if Big Data Talent and retaining them

Big Data is changing technology and there are a whopping number of tools in the Big Data technology landscape. It is demanded out of Big Data professionals to excel in those current tools and keep up self to ever-changing needs. This gets difficult for employees and employers to create and retain talent within the organization.

The solution to this would be constant upskilling, re-skilling and cross-skilling and increasing budget of organization for retaining talent and help them train.

4. The Veracity V

This V is a challenge as this V means inconsistent, incomplete data processing. To gain insights through big data model, the biggest step is to predict and fill missing information.

This is a tricky part as filling missing information can lead to decreasing accuracy of insights/ analytics etc.

To address this concern, there is a bunch of tools. Data curation is an important step in big data and should have a proper model. But also, to keep in mind that Big Data is never 100% accurate and one must deal with it.

5. Security

This aspect is given low priority during the design and build phases of Big Data implementations and security loopholes can cost an organization and hence it’s essential to put security first while designing and developing Big Data solutions. Also, equally important to act responsibly for implementations for regulatory requirements like GDPR.  

6. Gaining Valuable Insights

Machine learning data models go through multiple iterations to conclude on insights as they also face issues like missing data and hence the accuracy. To increase accuracy, lots of re-processing is required, which has its own lifecycle. Increasing accuracy of insights is a challenge and which relates to missing data piece. Which most likely can be addressed by addressing missing data challenge.

This can also be caused due to unavailability of information from all data sources. Incomplete information would lead to incomplete insights which may not benefit to required potential.

Addressing these discussed challenges would help to gain valuable insights through available solutions.

With Big Data, the opportunities are endless. Once understood, the world is yours!!!!

Also, now that you understand BIG DATA, it's worth understanding the next steps:

Gary King, who is a professor at Harvard says “Big data is not about the data. It is about the analytics”

You can also take up Big Data and Hadoop training to enhance your skills furthermore.

Did the article helps you to understand today’s massive world of big data and getting a sneak peek into it Do let us know through the comment section below?

Shruti

Shruti Deshpande

Blog Author

10+ years of data-rich experience in the IT industry. It started with data warehousing technologies into data modelling to BI application Architect and solution architect.


Big Data enthusiast and data analytics is my personal interest. I do believe it has endless opportunities and potential to make the world a sustainable place. Happy to ride on this tide.


*Disclaimer* - Expressed views are the personal views of the author and are not to be mistaken for the employer or any other organization’s views.

Join the Discussion

Your email address will not be published. Required fields are marked *

2 comments

shivkumar 01 Jun 2019 1 likes

Thanks for sharing this amazing blog.It is really an informative post.

Nisha 18 Jun 2019 1 likes

The article looks good and the way of presentation is nice.

Suggested Blogs

Why a Career in Big Data Is the Right Choice for You?

Are you in that job market where the Big Data skills are more appreciated? Confused about whether to make a career shift in Big Data or not? What will be the next career options available for me after Big Data? Just spend some time reading this blog and know the answers to all these questions and the reasons for making Big Data as a career choice.  “Big data is at the foundation of all of the megatrends that are happening today, from social to mobile to the cloud to gaming.” – Chris LynchReasons to Must-Have Big Data in your career1. Increased Job Opportunities for Big Data professionalsWith the technology reaching greater heights, undoubtedly Big Data is becoming a buzz word and a growing need for the organizations in the upcoming years. But, as Jeanne Harris, a senior executive at Accenture Institute said- “Data is useless without the skill to analyze it.”Today, Big Data professionals have a soaring demand across organizations worldwide. Organizations are making huge use of Big Data to stay ahead of the competitive market. The candidates with Big Data skills and expertise are in high demand. According to IBM, the number of jobs for data professionals in the U.S will increase to 2,720,000 by 2020.2. Salary GrowthThe strong demand for Big Data professionals is affecting the wages for qualified professionals. According to Glassdoor, the salary provided by various organizations based on the employees working in these organizations in the US region are as follows:CompanySalaryJ.P. Morgan$93K – $100KCognizant Technology Solutions$92K – $98KCSAA Insurance Group$133K – $144KZipRecruiter$81K – $89KThe salary of Big Data professionals is directly proportional to the factors like the skills earned, education, experience in the domain, knowledge of technology, etc. Also, one needs to understand and solve the real-world Big Data problems and a good grasp of tools and technologies.   3. Massive Big Data adoptionForbes stated that- Big data adoption in enterprises is increased from 17% in 2015 to 59% in 2018, reaching a Compound Annual Growth Rate (CAGR) of 36%. Big Data is steadily spreading its wings across numerous sectors including sales, marketing, research and development, logistics,  strategic management, etc.According to the 'Peer Research – Big Data Analytics' survey by Intel, the decision has incurred that- Big Data is one of the top priorities of the enterprises taking part in the survey as they believe that it improves the performance of their organizations. From the survey, it is found that 45% of the respondents trust that Big Data will offer more business benefits to rank on the top of the Big data market.    “Bigiota Insight out forecasted that the Big Data market is expected to grow to $80 billion from current $40 billion making a revenue of $187 billion.”4. Various options in job titles and responsibilitiesBig Data professionals have an array of job titles open depending on the skills they have achieved so far. The options for the Big Data job aspirants are many where they are free to align their career paths based on their career interests. Some of the job roles Big Data professionals can play are as follows:Data EngineerBusiness Analyst,Visualization SpecialistMachine Learning ExpertAnalytics ConsultantSolution ArchitectBig Data Solution ArchitectBig Data Analyst5. Usage Across numerous firms/industriesToday, Big Data is used almost in every firm. The top 5 industries recruiting Big Data professionals widely are Professional, Scientific and Technical Services (27%), Information Technology (19%), Manufacturing (15%), Finance and Insurance (9%), Retail Trade (9%) and Others 21%.The career path of a Big Data professionalAlthough the term Big Data is used commonly nowadays, there are many career paths available for the Big Data professionals to stand out in the industries that can be explored as per one’s potentiality and interest. The career paths that Big Data professionals can play are:Data ScientistBig Data EngineerBig Data AnalystData Visualization DeveloperMachine Learning EngineerBusiness Intelligence EngineerBusiness Analytics SpecialistMachine Learning ScientistLet us see them in details:1. Data Scientist:This is the most sought-after career path in Big Data careers. The Data Scientists are the individuals who use their technical and analytical skills to extract meaning from data. They are responsible for collecting, cleaning, and manipulating data.2. Big Data Engineer:Big Data Engineer is a well-known and more demanding career option. Data Engineers are the professionals responsible for building the designs created by Solution Architects. They are responsible for developing, testing, managing, and maintaining the big data solutions in the enterprises.3. Big Data Analyst:Being a command on the big data technologies like Hadoop, Hive, Pig, etc. and analytics skills, Data Analyst finds out relevant information from the datasets. This is also most demanding in Big Data career.4. Data Visualization Developer:The data visualization developers have the responsibilities of designing, conceptualizing, developing the graphics or data visualization, and supporting the data visualization activities. They should have strong technical skills for implementing visualization using tools.5. Machine Learning Engineer:Today, Machine Learning has become a crucial part of Big Data. Being an expert in machine learning (Machine Learning Engineer) responsible for building the data analysis software to run the product code without human intervention.6. Business Intelligence Engineer:Business Intelligence Engineer is in more demand today as around 90 percent of IT professionals are planning to increase spending on BI tools, as stated in the Forbes report. BI engineers are responsible for managing the big data warehouses with the help of Big Data tools and solving complex issues related to Big Data.7. Business Analytics Specialist:Business Analytics Specialist is an expert in Business Analytics field who aids in developing the scripts to test scripts and carrying out testing. They are also responsible for taking up business research activities to analyze the issues for developing cost-effective solutions.8. Machine Learning Scientist:Machine Learning Scientist work most probably in the research and development department. They are responsible for developing the algorithms to use in adaptive systems, adding product suggestions, and forecasting the demand for the same.Conclusion:As per Entrepreneur, Businesses that use Big Data saw a profit increase from 8 to10 percent and almost 10% reduction in overall cost. Another survey from Forbes states that IBM predicts demand For Data Scientists will reach 28% by the year 2020. As the data pours in, many high-rated companies like Google, Apple, NetApp, Qualcomm, Intuit, FactSet, The MITRE Corporation, Adobe, Salesforce, and so on are investing in Big Data.   According to the most recent McKinsey report, companies based in the U.S. are seeking for hiring 1.5 million Managers and Data Analysts with the strong knowledge and experience in Big Data. One can attain the most in-demand Big Data skills by taking specialized training in Big Data to go for any of the Big Data careers available in the job market.With the rising demand that industries are witnessing, it is an ideal time to add Big data skills to your curriculum vitae and offer yourself the wings to fly in the job market with the ample of Big Data jobs available today!  
Rated 4.5/5 based on 1 customer reviews
19970
Why a Career in Big Data Is the Right Choice for Y...

Are you in that job market where the Big Data skil... Read More

Apache Spark Pros and Cons

Apache Spark:  The New ‘King’ of Big DataApache Spark is a lightning-fast unified analytics engine for big data and machine learning. It is the largest open-source project in data processing. Since its release, it has met the enterprise’s expectations in a better way in regards to querying, data processing and moreover generating analytics reports in a better and faster way. Internet substations like Yahoo, Netflix, and eBay, etc have used Spark at large scale. Apache Spark is considered as the future of Big Data Platform.Pros and Cons of Apache SparkApache SparkAdvantagesDisadvantagesSpeedNo automatic optimization processEase of UseFile Management SystemAdvanced AnalyticsFewer AlgorithmsDynamic in NatureSmall Files IssueMultilingualWindow CriteriaApache Spark is powerfulDoesn’t suit for a multi-user environmentIncreased access to Big data-Demand for Spark Developers-Apache Spark has transformed the world of Big Data. It is the most active big data tool reshaping the big data market. This open-source distributed computing platform offers more powerful advantages than any other proprietary solutions. The diverse advantages of Apache Spark make it a very attractive big data framework. Apache Spark has huge potential to contribute to the big data-related business in the industry. Let’s now have a look at some of the common benefits of Apache Spark:Benefits of Apache Spark:SpeedEase of UseAdvanced AnalyticsDynamic in NatureMultilingualApache Spark is powerfulIncreased access to Big dataDemand for Spark DevelopersOpen-source community1. Speed:When comes to Big Data, processing speed always matters. Apache Spark is wildly popular with data scientists because of its speed. Spark is 100x faster than Hadoop for large scale data processing. Apache Spark uses in-memory(RAM) computing system whereas Hadoop uses local memory space to store data. Spark can handle multiple petabytes of clustered data of more than 8000 nodes at a time. 2. Ease of Use:Apache Spark carries easy-to-use APIs for operating on large datasets. It offers over 80 high-level operators that make it easy to build parallel apps.The below pictorial representation will help you understand the importance of Apache Spark.3. Advanced Analytics:Spark not only supports ‘MAP’ and ‘reduce’. It also supports Machine learning (ML), Graph algorithms, Streaming data, SQL queries, etc.4. Dynamic in Nature:With Apache Spark, you can easily develop parallel applications. Spark offers you over 80 high-level operators.5. Multilingual:Apache Spark supports many languages for code writing such as Python, Java, Scala, etc.6. Apache Spark is powerful:Apache Spark can handle many analytics challenges because of its low-latency in-memory data processing capability. It has well-built libraries for graph analytics algorithms and machine learning.7. Increased access to Big data:Apache Spark is opening up various opportunities for big data and making As per the recent survey conducted by IBM’s announced that it will educate more than 1 million data engineers and data scientists on Apache Spark. 8. Demand for Spark Developers:Apache Spark not only benefits your organization but you as well. Spark developers are so in-demand that companies offering attractive benefits and providing flexible work timings just to hire experts skilled in Apache Spark. As per PayScale the average salary for  Data Engineer with Apache Spark skills is $100,362. For people who want to make a career in the big data, technology can learn Apache Spark. You will find various ways to bridge the skills gap for getting data-related jobs, but the best way is to take formal training which will provide you hands-on work experience and also learn through hands-on projects.9. Open-source community:The best thing about Apache Spark is, it has a massive Open-source community behind it. Apache Spark is Great, but it’s not perfect - How?Apache Spark is a lightning-fast cluster computer computing technology designed for fast computation and also being widely used by industries. But on the other side, it also has some ugly aspects. Here are some challenges related to Apache Spark that developers face when working on Big data with Apache Spark.Let’s read out the following limitations of Apache Spark in detail so that you can make an informed decision whether this platform will be the right choice for your upcoming big data project.No automatic optimization processFile Management SystemFewer AlgorithmsSmall Files IssueWindow CriteriaDoesn’t suit for a multi-user environment1. No automatic optimization process:In the case of Apache Spark, you need to optimize the code manually since it doesn’t have any automatic code optimization process. This will turn into a disadvantage when all the other technologies and platforms are moving towards automation.2. File Management System:Apache Spark doesn’t come with its own file management system. It depends on some other platforms like Hadoop or other cloud-based platforms.3. Fewer Algorithms:There are fewer algorithms present in the case of Apache Spark Machine Learning Spark MLlib. It lags behind in terms of a number of available algorithms.4. Small Files Issue:One more reason to blame Apache Spark is the issue with small files. Developers come across issues of small files when using Apache Spark along with Hadoop. Hadoop Distributed File System (HDFS) provides a limited number of large files instead of a large number of small files.5. Window Criteria:Data in Apache Spark divides into small batches of a predefined time interval. So Apache won't support record-based window criteria. Rather, it offers time-based window criteria.6. Doesn’t suit for a multi-user environment:Yes, Apache Spark doesn’t fit for a multi-user environment. It is not capable of handling more users concurrency.Conclusion:To sum up, in light of the good, the bad and the ugly, Spark is a conquering tool when we view it from outside. We have seen a drastic change in the performance and decrease in the failures across various projects executed in Spark. Many applications are being moved to Spark for the efficiency it offers to developers. Using Apache Spark can give any business a boost and help foster its growth. It is sure that you will also have a bright future!
Rated 4.5/5 based on 19 customer reviews
8686
Apache Spark Pros and Cons

Apache Spark:  The New ‘King’ of Big DataApac... Read More

Apache Kafka Vs Apache Spark: Know the Differences

A new breed of ‘Fast Data’ architectures has evolved to be stream-oriented, where data is processed as it arrives, providing businesses with a competitive advantage. - Dean Wampler (Renowned author of many big data technology-related books)Dean Wampler makes an important point in one of his webinars. The demand for stream processing is increasing every day in today’s era. The main reason behind it is, processing only volumes of data is not sufficient but processing data at faster rates and making insights out of it in real time is very essential so that organization can react to changing business conditions in real time.And hence, there is a need to understand the concept “stream processing “and technology behind it. So, what is Stream Processing?Think of streaming as an unbounded, continuous real-time flow of records and processing these records in similar timeframe is stream processing. AWS (Amazon Web Services) defines “Streaming Data” is data that is generated continuously by thousands of data sources, which typically send in the data records simultaneously, and in small sizes (order of Kilobytes). This data needs to be processed sequentially and incrementally on a record-by-record basis or over sliding time windows and used for a wide variety of analytics including correlations, aggregations, filtering, and sampling.In stream processing method, continuous computation happens as the data flows through the system.Stream processing is highly beneficial if the events you wish to track are happening frequently and close together in time. It is also best to utilize if the event needs to be detected right away and responded to quickly.There is a subtle difference between stream processing, real-time processing (Rear real-time) and complex event processing (CEP). Let’s quickly look at the examples to understand the difference. Stream Processing: Stream processing is useful for tasks like fraud detection and cybersecurity. If transaction data is stream-processed, fraudulent transactions can be identified and stopped before they are even complete.Real-time Processing: If event time is very relevant and latencies in the second's range are completely unacceptable then it’s called Real-time (Rear real-time) processing. For ex. flight control system for space programsComplex Event Processing (CEP): CEP utilizes event-by-event processing and aggregation (for example, on potentially out-of-order events from a variety of sources, often with large numbers of rules or business logic).We have multiple tools available to accomplish above-mentioned Stream, Realtime or Complex event Processing. Spark Streaming, Kafka Stream, Flink, Storm, Akka, Structured streaming are to name a few. We will try to understand Spark streaming and Kafka stream in depth further in this article. As historically, these are occupying significant market share. Apache Kafka Stream: Kafka is actually a message broker with a really good performance so that all your data can flow through it before being redistributed to applications. Kafka works as a data pipeline.Typically, Kafka Stream supports per-second stream processing with millisecond latency.  Kafka Streams is a client library for processing and analyzing data stored in Kafka. Kafka streams can process data in 2 ways. Kafka -> Kafka: When Kafka Streams performs aggregations, filtering etc. and writes back the data to Kafka, it achieves amazing scalability, high availability, high throughput etc.  if configured correctly. It also does not do mini batching, which is “real streaming”.Kafka -> External Systems (‘Kafka -> Database’ or ‘Kafka -> Data science model’): Typically, any streaming library (Spark, Flink, NiFi etc) uses Kafka for a message broker. It would read the messages from Kafka and then break it into mini time windows to process it further. Representative view of Kafka streaming: Note:Sources here could be event logs, webpage events etc. etc. DB/Models would be accessed via any other streaming application, which in turn is using Kafka streams here. Kafka Streams is built upon important stream processing concepts such as properly distinguishing between event time and processing time, windowing support, and simple (yet efficient) management of application state. It is based on many concepts already contained in Kafka, such as scaling by partitioning.Also, for this reason, it comes as a lightweight library that can be integrated into an application.The application can then be operated as desired, as mentioned below: Standalone, in an application serverAs a Docker container, or Directly, via a resource manager such as Mesos.Why one will love using dedicated Apache Kafka Streams?Elastic, highly scalable, fault-tolerantDeploy to containers, VMs, bare metal, cloudEqually viable for small, medium, & large use casesFully integrated with Kafka securityWrite standard Java and Scala applicationsExactly-once processing semanticsNo separate processing cluster requiredDevelop on Mac, Linux, WindowsApache Spark Streaming:Spark Streaming receives live input data streams, it collects data for some time, builds RDD, divides the data into micro-batches, which are then processed by the Spark engine to generate the final stream of results in micro-batches. Following data flow diagram explains the working of Spark streaming. Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data. DStreams can be created either from input data streams from sources such as Kafka, Flume, and Kinesis, or by applying high-level operations on other DStreams. Internally, a DStream is represented as a sequence of RDDs. Think about RDD as the underlying concept for distributing data over a cluster of computers. Why one will love using Apache Spark Streaming?It makes it very easy for developers to use a single framework to satisfy all the processing needs. They can use MLib (Spark's machine learning library) to train models offline and directly use them online for scoring live data in Spark Streaming. In fact, some models perform continuous, online learning, and scoring.Not all real-life use-cases need data to be processed at real real-time, few seconds delay is tolerated over having a unified framework like Spark Streaming and volumes of data processing. It provides a range of capabilities by integrating with other spark tools to do a variety of data processing.  Spark Streaming Vs Kafka StreamNow that we have understood high level what these tools mean, it’s obvious to have curiosity around differences between both the tools. Following table briefly explain you, key differences between the two. Sr.NoSpark streamingKafka Streams1Data received form live input data streams is Divided into Micro-batched for processing.processes per data stream(real real-time)2Separated processing Cluster is requriedNo separated processing cluster is requried.3Needs re-configuration for Scaling Scales easily by just adding java processes, No reconfiguration requried.4At least one semanticsExactly one semantics5Spark streaming is better at processing group of rows(groups,by,ml,window functions etc.)Kafka streams provides true a-record-at-a-time processing capabilities. it's better for functions like rows parsing, data cleansing etc.6Spark streaming is standalone framework.Kafka stream can be used as part of microservice,as it's just a library.Kafka streams Use-cases:Following are a couple of many industry Use cases where Kafka stream is being used: The New York Times: The New York Times uses Apache Kafka and Kafka Streams to store and distribute, in real-time, published content to the various applications and systems that make it available to the readers.Pinterest: Pinterest uses Apache Kafka and the Kafka Streams at large scale to power the real-time, predictive budgeting system of their advertising infrastructure. With Kafka Streams, spend predictions are more accurate than ever.Zalando: As the leading online fashion retailer in Europe, Zalando uses Kafka as an ESB (Enterprise Service Bus), which helps us in transitioning from a monolithic to a micro services architecture. Using Kafka for processing event streams enables our technical team to do near-real time business intelligence.Trivago: Trivago is a global hotel search platform. We are focused on reshaping the way travellers search for and compare hotels while enabling hotel advertisers to grow their businesses by providing access to a broad audience of travellers via our websites and apps. As of 2017, we offer access to approximately 1.8 million hotels and other accommodations in over 190 countries. We use Kafka, Kafka Connect, and Kafka Streams to enable our developers to access data freely in the company. Kafka Streams powers parts of our analytics pipeline and delivers endless options to explore and operate on the data sources we have at hand.Broadly, Kafka is suitable for microservices integration use cases and have wider flexibility.Spark Streaming Use-cases:Following are a couple of the many industries use-cases where spark streaming is being used: Booking.com: We are using Spark Streaming for building online Machine Learning (ML) features that are used in Booking.com for real-time prediction of behaviour and preferences of our users, demand for hotels and improve processes in customer support. Yelp: Yelp’s ad platform handles millions of ad requests every day. To generate ad metrics and analytics in real-time, they built the ad event tracking and analyzing pipeline on top of Spark Streaming. It allows Yelp to manage a large number of active ad campaigns and greatly reduce over-delivery. It also enables them to share ad metrics with advertisers in a timelier fashion.Spark Streaming’s ever-growing user base consists of household names like Uber, Netflix, and Pinterest.Broadly, spark streaming is suitable for requirements with batch processing for massive datasets, for bulk processing and have use-cases more than just data streaming. Dean Wampler explains factors to evaluation for tool basis Use-cases beautifully, as mentioned below: Sr.NoEvaluation CharacteristicResponse Time windowTypical Use Case Requirement1.Latency tolerancePico to Microseconds (Real Real time)Flight control system for space programs etc.Latency tolerance< 100 MicrosecondsRegular stock trading market transactions, Medical diagnostic equipment outputLatency tolerance< 10 millisecondsCredit cards verification window when consumer buy stuff onlineLatency tolerance< 100 millisecondshuman attention required Dashboards, Machine learning modelsLatency tolerance< 1 second to minutesMachine learning model trainingLatency tolerance1 minute and abovePeriodic short jobs(typical ETL applications)2.Evaluation CharacteristicTransaction/events frequencyTypical Use Case RequirementVelocity1M per secondNest Thermostat, Big spikes during specific time period.3Evaluation CharacteristicTypes of data processingNAData Processing Requirement1. SQLNA2. ETL3. Dataflow4. Training and/or Serving Machine learning modelsData Processing Requirement1. Bulk data processingNA2. Individual Events/Transaction processing4.Evaluation CharacteristicUse of toolNAFlexibility of implementation1. Kafka : flexible as provides library.NA2. Spark: Not flexible as it’s part of a distributed frameworkConclusionKafka Streams is still best used in a ‘Kafka -> Kafka’ context, while Spark Streaming could be used for a ‘Kafka -> Database’ or ‘Kafka -> Data science model’ type of context.Although, when these 2 technologies are connected, they bring complete data collection and processing capabilities together and are widely used in commercialized use cases and occupy significant market share. 
Rated 4.5/5 based on 19 customer reviews
9970
Apache Kafka Vs Apache Spark: Know the Differences

A new breed of ‘Fast Data’ architectures has e... Read More

Useful links

20% Discount