Search

What is Big Data — An Introductory Guide

The massive world of Big DataIf one strolls around any IT office premises, over every decade (nowadays time span is even lesser, almost every 3-4 years) one would overhear professionals discussing new jargons from the hottest trends in technology. Around 5 -6 years ago, one such word has started ruling IT services is ‘BIG data’ and still has been interpreted by a layman to tech geeks in various ways.Although services industries started talking about big data solutions widely from 5-6 years, it is believed that the term was in use since the 1990s by John Mashey from Silicon Graphics, whereas credit for coining the term ‘big data’ aligning to its modern definition goes to Roger Mougalas from O’Reilly Media in 2005.Let’s first understand why everyone going gaga about ‘BIG data’ and what are the real-world problems it is supposed to solve and then we will try to answer what and how aspects of it.Why is Big Data essential for today’s digital world?Pre smart-phones era, internet and web world were around for many years, but smart-phones made it mobile with on-the-go usage. Social Media, mobile apps started generating tons of data. At the same time, smart-bands, wearable devices ( IoT, M2M ), have given newer dimensions for data generation. This newly generated data became a new oil to the world. If this data is stored and analyzed, it has the potential to give tremendous insights which could be put to use in numerous ways.You will be amazed to see the real-world use cases of BIG data. Every industry has a unique use case and is even unique to every client who is implementing the solutions. Ranging from data-driven personalized campaigning (you do see that item you have browsed on some ‘xyz’ site onto Facebook scrolling, ever wondered how?) to predictive maintenance of huge pipes across countries carrying oils, where manual monitoring is practically impossible. To relate this to our day to day life, every click, every swipe, every share and every like we casually do on social media is helping today’s industries to take future calculated business decisions. How do you think Netflix predicted the success of ‘House of Cards’ and spent $100 million on the same? Big data analytics is the simple answer.Talking about all this, the biggest challenge in the past was traditional methods used to store, curate and analyze data, which had limitations to process this data generated from newer sources and which were huge in volumes generated from heterogeneous sources and was being generated  really fast(To give you an idea, roughly 2.5 quintillion data is generated per day as on today – Refer infographic released by Domo called “Data Never Sleeps 5.0.” ), Which given rise to term BIG data and related solutions.Understanding Big Data: Experts’ viewpoint BIG data literally means Massive data (loosely > 1TB) but that’s not the only aspect of it. Distributed data or even complex datasets which could not be analyzed through traditional methods can be categorized into ‘Big data’ and hence Big data theoretical definition makes a lot of sense with this background:“Gartner (2012) defines, Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation.”Generic data possessing characteristics of big data are 3Vs namely Variety, Velocity, and VolumeBut due to the changing nature of data in today’s world and to gain most insights of it, 3 more Vs are added to the definition of BIG DATA, namely Variability, Veracity and Value.The diagram below illustrates each V in detail:Diagram: 6 V’s of Big DataThis 6Vs help understanding the characteristics of “BIG Data” but let’s also understand types of data in BIG Data processing.  “Variety” of above characteristics caters to different types of data can be processed through big data tools and technologies. Let’s drill down a bit for understanding what those are:Structured ex. Mainframes, traditional databases like Teradata, Netezza, Oracle, etc.Unstructured ex. Tweets, Facebook posts, emails, etc.Semi/Multi structured or Hybrid ex. E-commerce, demographic, weather data, etc.As the technology is advancing, the variety of data is available and its storage, processing, and analysis are made possible by big data. Traditional data processing techniques were able to process only structured data.Now, that we understand what big data and limitations of old traditional techniques are of handling such data, we could safely say, we need new technology to handle this data and gain insights out of it. Before going further, do you know, what were the traditional data management techniques?Traditional Techniques of Data Processing are:RDBMS (Relational Database Management System)Data warehousing and DataMartOn a high level, RDBMS catered to OLTP needs and data warehousing/DataMart facilitated OLAP needs. But both the systems work with structured data.I hope. now one can answer, ‘what is big data?’ conceptually and theoretically both.So, it’s time that we understand how it is being done in actual implementations.only storing of “big data” will not help the organizations, what’s important is to turn data into insights and business value and to do so, following are the key infrastructure elements:Data collectionData storageData analysis andData visualization/outputAll major big data processing framework offerings are based on these building blocks.And in an alignment of the above building blocks, following are the top 5 big data processing frameworks that are currently being used in the market:1. Apache Hadoop : Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.First up is the all-time classic, and one of the top frameworks in use today. So prevalent is it, that it has almost become synonymous with Big Data.2 Apache Spark : unified analytics engine for large-scale data processing.Apache Spark and Hadoop are often contrasted as an "either/or" choice,  but that isn't really the case.Above two frameworks are popular but apart from that following 3 are available and are comparable frameworks:3. Apache Storm : free and open source distributed real-time computation system. You can also take up Apache Storm training to learn more about Apache Storm.4. Apache Flink : streaming dataflow engine, aiming to provide facilities for distributed computation over streams of data. Treating batch processes as a special case of streaming data, Flink is effectively both batch and real-time processing framework, but one which clearly puts streaming first.5. Apache Samza : distributed Stream processing framework.Frameworks help processing data through building blocks and generate required insights. The framework is supported by the whopping number of tools providing the required functionality.Big Data processing frameworks and technology landscapeBig data tools and technology landscape can be better understood with layered big data architecture. Give a good read to a great article by Navdeep singh Gill on XENONSTACK for understanding the layered architecture of big data.By taking inspiration from layered architecture, different available tools in the market are mapped to layers to understand big data technology landscape in depth. Note that, layered architecture fits very well with infrastructure elements/building blocks discussed in the above section.Few of the tools are briefed below for further understanding:  1. Data Collection / Ingestion Layer Cassandra: is a free and open-source, distributed, wide column store, NoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failureKafka: is used for building real-time data pipelines and streaming apps. Event streaming platformFlume: log collector in HadoopHBase: columnar database in Hadoop2. Processing Layer Pig: scripting language in the Hadoop frameworkMapReduce: processing language in Hadoop3. Data Query Layer Impala: Cloudera Impala:  modern, open source, distributed SQL query engine for Apache Hadoop. (often compared with hive)Hive: Data Warehouse software for data Query and analysisPresto: Presto is a high performance, distributed SQL query engine for big data. Its architecture allows users to query a variety of data sources such as Hadoop, AWS S3, Alluxio, MySQL, Cassandra, Apache Kafka, and MongoDB4. Analytical EngineTensorFlow: n source machine learning library for research and production.5. Data storage LayerIgnite: open-source distributed database, caching and processing platform designed to store and compute on large volumes of data across a cluster of nodesPhoenix: hortonworks: Apache Phoenix is an open source, massively parallel, relational database engine supporting OLTP for Hadoop using Apache HBase as its backing storePolyBase: s a new feature in SQL Server 2016. It is used to query relational and non-relational databases (NoSQL). You can use PolyBase to query tables and files in Hadoop or in Azure Blob Storage. You can also import or export data to/from Hadoop.Sqoop: ETL toolBig data in EXCEL: Few people like to process big datasets with current excel capabilities and it's known as Big Data in Excel6. Data Visualization LayerMicrosoft HDInsight: Azure HDInsight is a Hadoop service offering hosted in Azure that enables clusters of managed Hadoop instances. Azure HDInsight deploys and provisions Apache Hadoop clusters in the cloud, providing a software framework designed to manage, analyze, and report on big data with high reliability and availability. Hadoop administration training will give you all the technical understanding required to manage a Hadoop cluster, either in a development or a production environment.Best Practices in Big Data  Every organization, industry, business, may it be small or big wants to get benefit out of “big data” but it's essential to understand that it can prove of maximum potential only if organization adhere to best practices before adapting big data:Answering 5 basic questions help clients know the need for adapting Big Data for organizationTry to answer why Big Data is required for the organization. What problem would it help solve?Ask the right questions.Foster collaboration between business and technology teams.Analyze only what is required to use.Start small and grow incrementally.Big Data industry use-cases We talked about all the things in the Big Data world except real use cases of big data. In the starting, we did discuss few but let me give you insights into the real world and interesting big data use cases and for a few, it’s no longer a secret ☺. In fact, it’s penetrating to the extent you name the industry and plenty of use cases can be told. Let’s begin.1.Streaming PlatformsAs I had given an example of ‘House of Cards’ at the start of the article, it’s not a secret that Netflix uses Big Data analytics. Netflix spent $100mn on 26 episodes of ‘House of Cards’ as they knew the show would appeal to viewers of original British House of Cards and built in director David Fincher and actor Kevin Spacey. Netflix typically collects behavioral data and it then uses this data to create a better experience for the user.But Netflix uses Big Data for more than that, they monitor and analyze traffic details for various devices, spot problem areas and adjust network infrastructure to prepare for future demand. (later is action out of Big Data analytics, how big data analysis is put to use). They also try to get insights into types of content viewers to prefer and help them make informed decisions.   Apart from Netflix, Spotify is also a known great use case.2. Advertising and Media / Campaigning /EntertainmentFor decades marketers were forced to launch campaigns while blindly relying on gut instinct and hoping for the best. That all changed with digitization and big data world. Nowadays, data-driven campaigns and marketing is on the rise and to be successful in this landscape, a modern marketing campaign must integrate a range of intelligent approaches to identify customers, segment, measure results, analyze data and build upon feedback in real time. All needs to be done in real time, along with the customer’s profile and history, based on his purchasing patterns and other relevant information and Big Data solutions are the perfect fit.Event-driven marketing is also could be achieved through big data, which is another way of successful marketing in today’s world. That basically indicates, keeping track of events customer are directly and indirectly involved with and campaign exactly when a customer would need it rather than random campaigns. For. Ex if you have searched for a product on Amazon/Flipkart, you would see related advertisements on other social media apps you casually browse through. Bang on, you would end up purchasing it as you anyway needed options best to choose from.3. Healthcare IndustryHealthcare is one of the classic use case industries for Big Data applications. The industry generates a huge amount of data.Patients medical history, past records, treatments given, available and latest medicines, Medicinal latest available research the list of raw data is endless.All this data can help give insights and Big Data can contribute to the industry in the following ways:Diagnosis time could be reduced, and exact requirement treatment could be started immediately. Most of the illnesses could be treated if a diagnosis is perfect and treatment can be started in time. This can be achieved through evidence-based past medical data available for similar treatments to doctor treating the illness, patients’ available history and feeding symptoms real-time into the system.  Government Health department can monitor if a bunch of people from geography reporting of similar symptoms, predictive measures could be taken in nearby locations to avoid outbreak as a cause for such illness could be the same.   The list is long, above were few representative examples.4. SecurityDue to social media outbreak, today, personal information is at stake. Almost everything is digital, and majority personal information is available in the public domain and hence privacy and security are major concerns with the rise in social media. Following are few such applications for big data.Cyber Crimes are common nowadays and big data can help to detect, predicting crimes.Threat analysis and detection could be done with big data.  5. Travel and TourismFlight booking sites, IRCTC track the clicks and hits along with IP address, login information, and other details and as per demand can do dynamic pricing for the flights/ trains. Big Data helps in dynamic pricing and mind you it’s real time. Am sure each one of us has experienced this. Now you know who is doing it :DTelecommunications, Public sector, Education, Social media and gaming, Energy and utility every industry have implemented are implementing several of these Big Data use cases day in and day out. If you look around am sure you would find them on the rise.Big Data is helping everyone industries, consumers, clients to make informed decisions, whatever it may be and hence wherever there is such a need, Big Data can come handy.Challenges faced by Big Data in the real world for adaptationAlthough the world is going gaga about big data, there are still a few challenges to implement and adopt Big Data and hence service industries are still striving towards resolving those challenges to implement best Big Data solution without flaws.An October 2016 report from Gartner found that organizations were getting stuck at the pilot stage of their big data initiatives. "Only 15 percent of businesses reported deploying their big data project to production, effectively unchanged from last year (14 per cent)," the firm said.Let’s discuss a few of them to understand what are they?1. Understanding Big Data and answering Why for the organization one is working with.As I started the article saying there are many versions of Big Data and understanding real use cases for organization decision makers are working with is still a challenge. Everyone wants to ride on a wave but not knowing the right path is still a struggle. As every organization is unique thus its utmost important to answer ‘why big data’ for each organization. This remains a major challenge for decision makers to adapt to big data.2. Understanding Data sources for the organizationIn today’s world, there are hundreds and thousands of ways information is being generated and being aware of all these sources and ingest all of them into big data platforms to get accurate insight is essential. Identifying sources is a challenge to address.It's no surprise, then, that the IDG report found, "Managing unstructured data is growing as a challenge – rising from 31 per cent in 2015 to 45 per cent in 2016."Different tools and technologies are on the rise to address this challenge.3. Shortage if Big Data Talent and retaining themBig Data is changing technology and there are a whopping number of tools in the Big Data technology landscape. It is demanded out of Big Data professionals to excel in those current tools and keep up self to ever-changing needs. This gets difficult for employees and employers to create and retain talent within the organization.The solution to this would be constant upskilling, re-skilling and cross-skilling and increasing budget of organization for retaining talent and help them train.4. The Veracity VThis V is a challenge as this V means inconsistent, incomplete data processing. To gain insights through big data model, the biggest step is to predict and fill missing information.This is a tricky part as filling missing information can lead to decreasing accuracy of insights/ analytics etc.To address this concern, there is a bunch of tools. Data curation is an important step in big data and should have a proper model. But also, to keep in mind that Big Data is never 100% accurate and one must deal with it.5. SecurityThis aspect is given low priority during the design and build phases of Big Data implementations and security loopholes can cost an organization and hence it’s essential to put security first while designing and developing Big Data solutions. Also, equally important to act responsibly for implementations for regulatory requirements like GDPR.  6. Gaining Valuable InsightsMachine learning data models go through multiple iterations to conclude on insights as they also face issues like missing data and hence the accuracy. To increase accuracy, lots of re-processing is required, which has its own lifecycle. Increasing accuracy of insights is a challenge and which relates to missing data piece. Which most likely can be addressed by addressing missing data challenge.This can also be caused due to unavailability of information from all data sources. Incomplete information would lead to incomplete insights which may not benefit to required potential.Addressing these discussed challenges would help to gain valuable insights through available solutions.With Big Data, the opportunities are endless. Once understood, the world is yours!!!!Also, now that you understand BIG DATA, it's worth understanding the next steps:Gary King, who is a professor at Harvard says “Big data is not about the data. It is about the analytics”You can also take up Big Data and Hadoop training to enhance your skills furthermore.Did the article helps you to understand today’s massive world of big data and getting a sneak peek into it Do let us know through the comment section below?
Rated 4.5/5 based on 11 customer reviews

What is Big Data — An Introductory Guide

7K
What is Big Data — An Introductory Guide

The massive world of Big Data

If one strolls around any IT office premises, over every decade (nowadays time span is even lesser, almost every 3-4 years) one would overhear professionals discussing new jargons from the hottest trends in technology. Around 5 -6 years ago, one such word has started ruling IT services is ‘BIG data’ and still has been interpreted by a layman to tech geeks in various ways.

Although services industries started talking about big data solutions widely from 5-6 years, it is believed that the term was in use since the 1990s by John Mashey from Silicon Graphics, whereas credit for coining the term ‘big data’ aligning to its modern definition goes to Roger Mougalas from O’Reilly Media in 2005.

Let’s first understand why everyone going gaga about ‘BIG data’ and what are the real-world problems it is supposed to solve and then we will try to answer what and how aspects of it.

Why is Big Data essential for today’s digital world?

Pre smart-phones era, internet and web world were around for many years, but smart-phones made it mobile with on-the-go usage. Social Media, mobile apps started generating tons of data. At the same time, smart-bands, wearable devices ( IoT, M2M ), have given newer dimensions for data generation. This newly generated data became a new oil to the world. If this data is stored and analyzed, it has the potential to give tremendous insights which could be put to use in numerous ways.

You will be amazed to see the real-world use cases of BIG data. Every industry has a unique use case and is even unique to every client who is implementing the solutions. Ranging from data-driven personalized campaigning (you do see that item you have browsed on some ‘xyz’ site onto Facebook scrolling, ever wondered how?) to predictive maintenance of huge pipes across countries carrying oils, where manual monitoring is practically impossible. To relate this to our day to day life, every click, every swipe, every share and every like we casually do on social media is helping today’s industries to take future calculated business decisions. How do you think Netflix predicted the success of ‘House of Cards’ and spent $100 million on the same? Big data analytics is the simple answer.

Talking about all this, the biggest challenge in the past was traditional methods used to store, curate and analyze data, which had limitations to process this data generated from newer sources and which were huge in volumes generated from heterogeneous sources and was being generated  really fast(To give you an idea, roughly 2.5 quintillion data is generated per day as on today – Refer infographic released by Domo called “Data Never Sleeps 5.0.” ), Which given rise to term BIG data and related solutions.

Understanding Big Data: Experts’ viewpoint 

BIG data literally means Massive data (loosely > 1TB) but that’s not the only aspect of it. Distributed data or even complex datasets which could not be analyzed through traditional methods can be categorized into ‘Big data’ and hence Big data theoretical definition makes a lot of sense with this background:

“Gartner (2012) defines, Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation.”

Generic data possessing characteristics of big data are 3Vs namely Variety, Velocity, and Volume

But due to the changing nature of data in today’s world and to gain most insights of it, 3 more Vs are added to the definition of BIG DATA, namely Variability, Veracity and Value.

The diagram below illustrates each V in detail:

 6 V’s of Big Data

Diagram: 6 V’s of Big Data

This 6Vs help understanding the characteristics of “BIG Data” but let’s also understand types of data in BIG Data processing.  
“Variety” of above characteristics caters to different types of data can be processed through big data tools and technologies. Let’s drill down a bit for understanding what those are:

  1. Structured ex. Mainframes, traditional databases like Teradata, Netezza, Oracle, etc.
  2. Unstructured ex. Tweets, Facebook posts, emails, etc.
  3. Semi/Multi structured or Hybrid ex. E-commerce, demographic, weather data, etc.

As the technology is advancing, the variety of data is available and its storage, processing, and analysis are made possible by big data. Traditional data processing techniques were able to process only structured data.

Now, that we understand what big data and limitations of old traditional techniques are of handling such data, we could safely say, we need new technology to handle this data and gain insights out of it. Before going further, do you know, what were the traditional data management techniques?

Traditional Techniques of Data Processing are:

  1. RDBMS (Relational Database Management System)
  2. Data warehousing and DataMart

On a high level, RDBMS catered to OLTP needs and data warehousing/DataMart facilitated OLAP needs. But both the systems work with structured data.

I hope. now one can answer, ‘what is big data?’ conceptually and theoretically both.

So, it’s time that we understand how it is being done in actual implementations.

only storing of “big data” will not help the organizations, what’s important is to turn data into insights and business value and to do so, following are the key infrastructure elements:

  • Data collection
  • Data storage
  • Data analysis and
  • Data visualization/output

All major big data processing framework offerings are based on these building blocks.

Traditional Techniques of Data Processing

And in an alignment of the above building blocks, following are the top 5 big data processing frameworks that are currently being used in the market:

1. Apache Hadoop : Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.First up is the all-time classic, and one of the top frameworks in use today. So prevalent is it, that it has almost become synonymous with Big Data.

2 Apache Spark : unified analytics engine for large-scale data processing.

Apache Spark and Hadoop are often contrasted as an "either/or" choice,  but that isn't really the case.

Above two frameworks are popular but apart from that following 3 are available and are comparable frameworks:

3. Apache Storm : free and open source distributed real-time computation system. You can also take up Apache Storm training to learn more about Apache Storm.

4. Apache Flink : streaming dataflow engine, aiming to provide facilities for distributed computation over streams of data. Treating batch processes as a special case of streaming data, Flink is effectively both batch and real-time processing framework, but one which clearly puts streaming first.

5. Apache Samza : distributed Stream processing framework.

Frameworks help processing data through building blocks and generate required insights. The framework is supported by the whopping number of tools providing the required functionality.

Big Data processing frameworks and technology landscape

Big data tools and technology landscape can be better understood with layered big data architecture. Give a good read to a great article by Navdeep singh Gill on XENONSTACK for understanding the layered architecture of big data.

By taking inspiration from layered architecture, different available tools in the market are mapped to layers to understand big data technology landscape in depth. Note that, layered architecture fits very well with infrastructure elements/building blocks discussed in the above section.

 Framework and technology landscape

Few of the tools are briefed below for further understanding:  

1. Data Collection / Ingestion Layer 

  • Cassandra: is a free and open-source, distributed, wide column store, NoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure
  • Kafka: is used for building real-time data pipelines and streaming apps. Event streaming platform
  • Flume: log collector in Hadoop
  • HBase: columnar database in Hadoop

2. Processing Layer 

  • Pig: scripting language in the Hadoop framework
  • MapReduce: processing language in Hadoop

3. Data Query Layer 

  • Impala: Cloudera Impala:  modern, open source, distributed SQL query engine for Apache Hadoop. (often compared with hive)
  • Hive: Data Warehouse software for data Query and analysis
  • Presto: Presto is a high performance, distributed SQL query engine for big data. Its architecture allows users to query a variety of data sources such as Hadoop, AWS S3, Alluxio, MySQL, Cassandra, Apache Kafka, and MongoDB

4. Analytical Engine

  • TensorFlow: n source machine learning library for research and production.

5. Data storage Layer

  • Ignite: open-source distributed database, caching and processing platform designed to store and compute on large volumes of data across a cluster of nodes
  • Phoenix: hortonworks: Apache Phoenix is an open source, massively parallel, relational database engine supporting OLTP for Hadoop using Apache HBase as its backing store
  • PolyBase: s a new feature in SQL Server 2016. It is used to query relational and non-relational databases (NoSQL). You can use PolyBase to query tables and files in Hadoop or in Azure Blob Storage. You can also import or export data to/from Hadoop.
  • Sqoop: ETL tool
  • Big data in EXCEL: Few people like to process big datasets with current excel capabilities and it's known as Big Data in Excel

6. Data Visualization Layer

  • Microsoft HDInsight: Azure HDInsight is a Hadoop service offering hosted in Azure that enables clusters of managed Hadoop instances. Azure HDInsight deploys and provisions Apache Hadoop clusters in the cloud, providing a software framework designed to manage, analyze, and report on big data with high reliability and availability. Hadoop administration training will give you all the technical understanding required to manage a Hadoop cluster, either in a development or a production environment.

Best Practices in Big Data  

Every organization, industry, business, may it be small or big wants to get benefit out of “big data” but it's essential to understand that it can prove of maximum potential only if organization adhere to best practices before adapting big data:

Answering 5 basic questions help clients know the need for adapting Big Data for organization

  1. Try to answer why Big Data is required for the organization. What problem would it help solve?
  2. Ask the right questions.
  3. Foster collaboration between business and technology teams.
  4. Analyze only what is required to use.
  5. Start small and grow incrementally.

Big Data industry use-cases 

We talked about all the things in the Big Data world except real use cases of big data. In the starting, we did discuss few but let me give you insights into the real world and interesting big data use cases and for a few, it’s no longer a secret ☺. In fact, it’s penetrating to the extent you name the industry and plenty of use cases can be told. Let’s begin.

1.Streaming Platforms

As I had given an example of ‘House of Cards’ at the start of the article, it’s not a secret that Netflix uses Big Data analytics. Netflix spent $100mn on 26 episodes of ‘House of Cards’ as they knew the show would appeal to viewers of original British House of Cards and built in director David Fincher and actor Kevin Spacey. Netflix typically collects behavioral data and it then uses this data to create a better experience for the user.

But Netflix uses Big Data for more than that, they monitor and analyze traffic details for various devices, spot problem areas and adjust network infrastructure to prepare for future demand. (later is action out of Big Data analytics, how big data analysis is put to use). They also try to get insights into types of content viewers to prefer and help them make informed decisions.   

Streaming Platforms

Apart from Netflix, Spotify is also a known great use case.

2. Advertising and Media / Campaigning /Entertainment

For decades marketers were forced to launch campaigns while blindly relying on gut instinct and hoping for the best. That all changed with digitization and big data world. Nowadays, data-driven campaigns and marketing is on the rise and to be successful in this landscape, a modern marketing campaign must integrate a range of intelligent approaches to identify customers, segment, measure results, analyze data and build upon feedback in real time. All needs to be done in real time, along with the customer’s profile and history, based on his purchasing patterns and other relevant information and Big Data solutions are the perfect fit.

Event-driven marketing is also could be achieved through big data, which is another way of successful marketing in today’s world. That basically indicates, keeping track of events customer are directly and indirectly involved with and campaign exactly when a customer would need it rather than random campaigns. For. Ex if you have searched for a product on Amazon/Flipkart, you would see related advertisements on other social media apps you casually browse through. Bang on, you would end up purchasing it as you anyway needed options best to choose from.

Advertising and Media

3. Healthcare Industry

Healthcare is one of the classic use case industries for Big Data applications. The industry generates a huge amount of data.

Patients medical history, past records, treatments given, available and latest medicines, Medicinal latest available research the list of raw data is endless.

All this data can help give insights and Big Data can contribute to the industry in the following ways:

  1. Diagnosis time could be reduced, and exact requirement treatment could be started immediately. Most of the illnesses could be treated if a diagnosis is perfect and treatment can be started in time. This can be achieved through evidence-based past medical data available for similar treatments to doctor treating the illness, patients’ available history and feeding symptoms real-time into the system.  
  2. Government Health department can monitor if a bunch of people from geography reporting of similar symptoms, predictive measures could be taken in nearby locations to avoid outbreak as a cause for such illness could be the same.   

The list is long, above were few representative examples.

4. Security

Due to social media outbreak, today, personal information is at stake. Almost everything is digital, and majority personal information is available in the public domain and hence privacy and security are major concerns with the rise in social media. Following are few such applications for big data.

  1. Cyber Crimes are common nowadays and big data can help to detect, predicting crimes.
  2. Threat analysis and detection could be done with big data.  

5. Travel and Tourism

Flight booking sites, IRCTC track the clicks and hits along with IP address, login information, and other details and as per demand can do dynamic pricing for the flights/ trains. Big Data helps in dynamic pricing and mind you it’s real time. Am sure each one of us has experienced this. Now you know who is doing it :D

Telecommunications, Public sector, Education, Social media and gaming, Energy and utility every industry have implemented are implementing several of these Big Data use cases day in and day out. If you look around am sure you would find them on the rise.

Big Data is helping everyone industries, consumers, clients to make informed decisions, whatever it may be and hence wherever there is such a need, Big Data can come handy.

Challenges faced by Big Data in the real world for adaptation

Challenges faced by Big Data in the real world for adaptation

Although the world is going gaga about big data, there are still a few challenges to implement and adopt Big Data and hence service industries are still striving towards resolving those challenges to implement best Big Data solution without flaws.

An October 2016 report from Gartner found that organizations were getting stuck at the pilot stage of their big data initiatives. "Only 15 percent of businesses reported deploying their big data project to production, effectively unchanged from last year (14 per cent)," the firm said.

Let’s discuss a few of them to understand what are they?

1. Understanding Big Data and answering Why for the organization one is working with.

As I started the article saying there are many versions of Big Data and understanding real use cases for organization decision makers are working with is still a challenge. Everyone wants to ride on a wave but not knowing the right path is still a struggle. As every organization is unique thus its utmost important to answer ‘why big data’ for each organization. This remains a major challenge for decision makers to adapt to big data.

2. Understanding Data sources for the organization

In today’s world, there are hundreds and thousands of ways information is being generated and being aware of all these sources and ingest all of them into big data platforms to get accurate insight is essential. Identifying sources is a challenge to address.

It's no surprise, then, that the IDG report found, "Managing unstructured data is growing as a challenge – rising from 31 per cent in 2015 to 45 per cent in 2016."

Different tools and technologies are on the rise to address this challenge.

3. Shortage if Big Data Talent and retaining them

Big Data is changing technology and there are a whopping number of tools in the Big Data technology landscape. It is demanded out of Big Data professionals to excel in those current tools and keep up self to ever-changing needs. This gets difficult for employees and employers to create and retain talent within the organization.

The solution to this would be constant upskilling, re-skilling and cross-skilling and increasing budget of organization for retaining talent and help them train.

4. The Veracity V

This V is a challenge as this V means inconsistent, incomplete data processing. To gain insights through big data model, the biggest step is to predict and fill missing information.

This is a tricky part as filling missing information can lead to decreasing accuracy of insights/ analytics etc.

To address this concern, there is a bunch of tools. Data curation is an important step in big data and should have a proper model. But also, to keep in mind that Big Data is never 100% accurate and one must deal with it.

5. Security

This aspect is given low priority during the design and build phases of Big Data implementations and security loopholes can cost an organization and hence it’s essential to put security first while designing and developing Big Data solutions. Also, equally important to act responsibly for implementations for regulatory requirements like GDPR.  

6. Gaining Valuable Insights

Machine learning data models go through multiple iterations to conclude on insights as they also face issues like missing data and hence the accuracy. To increase accuracy, lots of re-processing is required, which has its own lifecycle. Increasing accuracy of insights is a challenge and which relates to missing data piece. Which most likely can be addressed by addressing missing data challenge.

This can also be caused due to unavailability of information from all data sources. Incomplete information would lead to incomplete insights which may not benefit to required potential.

Addressing these discussed challenges would help to gain valuable insights through available solutions.

With Big Data, the opportunities are endless. Once understood, the world is yours!!!!

Also, now that you understand BIG DATA, it's worth understanding the next steps:

Gary King, who is a professor at Harvard says “Big data is not about the data. It is about the analytics”

You can also take up Big Data and Hadoop training to enhance your skills furthermore.

Did the article helps you to understand today’s massive world of big data and getting a sneak peek into it Do let us know through the comment section below?

Shruti

Shruti Deshpande

Blog Author

10+ years of data-rich experience in the IT industry. It started with data warehousing technologies into data modelling to BI application Architect and solution architect.


Big Data enthusiast and data analytics is my personal interest. I do believe it has endless opportunities and potential to make the world a sustainable place. Happy to ride on this tide.


*Disclaimer* - Expressed views are the personal views of the author and are not to be mistaken for the employer or any other organization’s views.

Join the Discussion

Your email address will not be published. Required fields are marked *

Suggested Blogs

Why a Career in Big Data Is the Right Choice for You?

Are you in that job market where the Big Data skills are more appreciated? Confused about whether to make a career shift in Big Data or not? What will be the next career options available for me after Big Data? Just spend some time reading this blog and know the answers to all these questions and the reasons for making Big Data as a career choice.  “Big data is at the foundation of all of the megatrends that are happening today, from social to mobile to the cloud to gaming.” – Chris LynchReasons to Must-Have Big Data in your career1.Increased Job Opportunities for Big Data professionalsWith the technology reaching greater heights, undoubtedly Big Data is becoming a buzz word and a growing need for the organizations in the upcoming years. But, as Jeanne Harris, a senior executive at Accenture Institute said- “Data is useless without the skill to analyze it.”Today, Big Data professionals have a soaring demand across organizations worldwide. Organizations are making huge use of Big Data to stay ahead of the competitive market. The candidates with Big Data skills and expertise are in high demand. According to IBM, the number of jobs for data professionals in the U.S will increase to 2,720,000 by 2020.2. Salary GrowthThe strong demand for Big Data professionals is affecting the wages for qualified professionals. According to Glassdoor, the salary provided by various organizations based on the employees working in these organizations in the US region are as follows:CompanySalaryJ.P. Morgan$93K – $100KCognizant Technology Solutions$92K – $98KCSAA Insurance Group$133K – $144KZipRecruiter$81K – $89KThe salary of Big Data professionals is directly proportional to the factors like the skills earned, education, experience in the domain, knowledge of technology, etc. Also, one needs to understand and solve the real-world Big Data problems and a good grasp of tools and technologies.   3. Massive Big Data adoptionForbes stated that- Big data adoption in enterprises is increased from 17% in 2015 to 59% in 2018, reaching a Compound Annual Growth Rate (CAGR) of 36%. Big Data is steadily spreading its wings across numerous sectors including sales, marketing, research and development, logistics,  strategic management, etc.According to the 'Peer Research – Big Data Analytics' survey by Intel, the decision has incurred that- Big Data is one of the top priorities of the enterprises taking part in the survey as they believe that it improves the performance of their organizations. From the survey, it is found that 45% of the respondents trust that Big Data will offer more business benefits to rank on the top of the Big data market.    “Bigiota Insight out forecasted that the Big Data market is expected to grow to $80 billion from current $40 billion making a revenue of $187 billion.”4. Various options in job titles and responsibilitiesBig Data professionals have an array of job titles open depending on the skills they have achieved so far. The options for the Big Data job aspirants are many where they are free to align their career paths based on their career interests. Some of the job roles Big Data professionals can play are as follows:Data EngineerBusiness Analyst,Visualization SpecialistMachine Learning ExpertAnalytics ConsultantSolution ArchitectBig Data Solution ArchitectBig Data Analyst5. Usage Across numerous firms/industriesToday, Big Data is used almost in every firm. The top 5 industries recruiting Big Data professionals widely are Professional, Scientific and Technical Services (27%), Information Technology (19%), Manufacturing (15%), Finance and Insurance (9%), Retail Trade (9%) and Others 21%.The career path of a Big Data professionalAlthough the term Big Data is used commonly nowadays, there are many career paths available for the Big Data professionals to stand out in the industries that can be explored as per one’s potentiality and interest. The career paths that Big Data professionals can play are:Data ScientistBig Data EngineerBig Data AnalystData Visualization DeveloperMachine Learning EngineerBusiness Intelligence EngineerBusiness Analytics SpecialistMachine Learning ScientistLet us see them in details:Data Scientist:This is the most sought-after career path in Big Data careers. The Data Scientists are the individuals who use their technical and analytical skills to extract meaning from data. They are responsible for collecting, cleaning, and manipulating data.Big Data Engineer:Big Data Engineer is a well-known and more demanding career option. Data Engineers are the professionals responsible for building the designs created by Solution Architects. They are responsible for developing, testing, managing, and maintaining the big data solutions in the enterprises.Big Data Analyst:Being a command on the big data technologies like Hadoop, Hive, Pig, etc. and analytics skills, Data Analyst finds out relevant information from the datasets. This is also most demanding in Big Data career.Data Visualization Developer:The data visualization developers have the responsibilities of designing, conceptualizing, developing the graphics or data visualization, and supporting the data visualization activities. They should have strong technical skills for implementing visualization using tools.Machine Learning Engineer:Today, Machine Learning has become a crucial part of Big Data. Being an expert in machine learning (Machine Learning Engineer) responsible for building the data analysis software to run the product code without human intervention.Business Intelligence Engineer:Business Intelligence Engineer is in more demand today as around 90 percent of IT professionals are planning to increase spending on BI tools, as stated in the Forbes report. BI engineers are responsible for managing the big data warehouses with the help of Big Data tools and solving complex issues related to Big Data.Business Analytics Specialist:Business Analytics Specialist is an expert in Business Analytics field who aids in developing the scripts to test scripts and carrying out testing. They are also responsible for taking up business research activities to analyze the issues for developing cost-effective solutions.Machine Learning Scientist:Machine Learning Scientist work most probably in the research and development department. They are responsible for developing the algorithms to use in adaptive systems, adding product suggestions, and forecasting the demand for the same.Conclusion:As per Entrepreneur, Businesses that use Big Data saw a profit increase from 8 to10 percent and almost 10% reduction in overall cost. Another survey from Forbes states that IBM predicts demand For Data Scientists will reach 28% by the year 2020. As the data pours in, many high-rated companies like Google, Apple, NetApp, Qualcomm, Intuit, FactSet, The MITRE Corporation, Adobe, Salesforce, and so on are investing in Big Data.   According to the most recent McKinsey report, companies based in the U.S. are seeking for hiring 1.5 million Managers and Data Analysts with the strong knowledge and experience in Big Data. One can attain the most in-demand Big Data skills by taking specialized training in Big Data to go for any of the Big Data careers available in the job market.With the rising demand that industries are witnessing, it is an ideal time to add Big data skills to your curriculum vitae and offer yourself the wings to fly in the job market with the ample of Big Data jobs available today!  
Rated 4.5/5 based on 11 customer reviews
9881
Why a Career in Big Data Is the Right Choice for Y...

Are you in that job market where the Big Data skil... Read More

Apache Spark Use Cases & Applications

Apache Spark was developed by a team at UC Berkeley in 2009. Since then, Apache Spark has seen a very high adoption rate from top-notch technology companies like Google, Facebook, Apple, Netflix etc. The demand has been ever increasing day by day. According to marketanalysis.com survey, the Apache Spark market worldwide will grow at a CAGR of 67% between 2019 and 2022. The Spark market revenue is zooming fast and may grow up $4.2 billion by 2022, with a cumulative market valued at $9.2 billion (2019 - 2022).As per Apache, “Apache Spark is a unified analytics engine for large-scale data processing”.Spark is a cluster computing framework, somewhat similar to MapReduce but has a lot more capabilities, features, speed and provides APIs for developers in many languages like Scala, Python, Java and R. It is also friendly for database developers as it provides Spark SQL which supports most of the ANSI SQL functionality. Spark also has out of the box support for Machine learning and Graph processing using components called MLlib and GraphX respectively. Spark also has support for streaming data using Spark Streaming.Spark is developed in Scala programming language. Though the majority of use cases of Spark uses HDFS as the underlying data file storage layer, it is not mandatory to use HDFS. It does work with a variety of other Data sources like Cassandra, MySQL, AWS S3 etc. Apache Spark also comes with its default resource manager which might be good enough for the development environment and small size cluster, but it also integrates very well with YARN and Mesos. Most of the production-grade and large clusters use YARN and Mesos as the resource manager.Features of SparkSpeed: According to Apache, Spark can run applications on Hadoop cluster up to 100 times faster in memory and up to 10 times faster on disk. Spark is able to achieve such a speed by overcoming the drawback of MapReduce which always writes to disk for all intermediate results. Spark does not need to write intermediate results to disk and can work in memory using DAG, lazy evaluation, RDDs and caching. Spark has a highly optimized execution engine which makes it so fast. Fault Tolerance: Spark’s optimized execution engine not only makes it fast but is also fault tolerant. It achieves this using abstraction layer called RDD (Resilient Distributed Datasets) in combination with DAG, which is built to handle failures of tasks or even node failures. Lazy Evaluation: Spark works on lazy evaluation technique. This means that the processing(transformations) on Spark RDD/Datasets are evaluated in a lazy manner, i.e. the output RDDs/datasets are not available after transformation will be available only when needed i.e. when any action is performed. The transformations are just part of the DAG which gets executed when action is called.Multiple Language Support: Spark provides support for multiple programming languages like Scala, Java, Python, R and also Spark SQL which is very similar to SQL.Reusability: Spark code once written for batch processing jobs can also be utilized for writing processed on Stream processing and it can be used to join historical batch data and stream data on the fly.Machine Learning: MLlib is a Machine Learning library of Spark. which is available out of the box for creating ML pipelines for data analysis and predictive analytics alsoGraph Processing: Apache Spark also has Graph processing logic. Using GraphX APIs which is again provided out of the box one can write graph processing and do graph-parallel computation.Stream Processing and Structured Streaming: Spark can be used for batch processing and also has the capability to cater to stream processing use case with micro batches. Spark Streaming comes with Spark and one does not need to use any other streaming tools or APIs. Spark streaming also supports Structure Streaming. Spark streaming also has in-built connectors for Apache Kafka which comes very handy while developing Streaming applications.Spark SQL: Spark has an amazing SQL support and has an in-built SQL optimizer. Spark SQL features are used heavily in warehouses to build ETL pipelines.Spark is being used in more than 1000 organizations who have built huge clusters for batch processing, stream processing, building warehouses, building data analytics engine and also predictive analytics platforms using many of the above features of Spark. Let’s look at some of the use cases in a few of these organizations.What are the different Apache Spark applications?Streaming Data: Streaming is basically unstructured data produced by different types of data sources. The data sources could be anything like log files generated while customers using mobile apps or web applications, social media contents like tweets, facebook posts, telemetry from connected devices or instrumentation in data centres. The streaming data is usually unbounded and is being processed as received from the data source.Then there is Structured streaming which works on the principle of polling data in intervals and then this interval data is processed and appended or updated to the unbounded result table.Apache Spark has a framework for both i.e. Spark Streaming to handle Streaming using micro batches and DStreams and Structured Streaming using Datasets and Data frames.Let us try to understand Spark Streaming from an example.Suppose a big retail chain company wants to get a real-time dashboard to keep a close eye on its inventory and operations. Using this dashboard the management should be able to track how many products are being purchased, shipped and delivered to customers.Spark Streaming can be an ideal fit here.The order management system pushes the order status to the queue(could be Kafka) from where Streaming process reads every minute and picks all the orders with their status. Then Spark engine processes these and emits the output status count. Spark streaming process runs like a daemon until it is killed or error is encountered.Machine learning:As defined by Arthur Samuel in 1959, “Machine Learning is the] field of study that gives computers the ability to learn without being explicitly programmed”. In 1997, Tom Mitchell gave a definition which is more specifically from an engineering perspective, “A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.”. ML solves complex problems that could not be solved with just mathematical numerical methods or means. ML is not supposed to make perfect guesses. In ML’s domain, there is no such thing. Its goal is to make a prediction or make guesses which are good enough to be useful.MLlib is the Apache Spark’s scalable machine learning library. MLlib has multiple algorithms for Supervised and Unsupervised ML which can scale out on a cluster for classification, regression, clustering, collaborative filtering. MLlib interoperates with Python’s math/numerical analysis library NumPy and also with R’s libraries. Some of these algorithms are also applicable to streaming data. MLlib helps Spark provide sentiment analysis, customer segmentation and predictive intelligence.A very common use case of ML is text classification, say for categorising emails. An ML pipeline can be trained to classify emails by reading an Inbox. A typical ML pipeline looks like this. ML is a subject in itself so it is not possible to deep dive here.Fog computing: Fog Computing is another use case of Apache Spark. To understand Fog computing we need to understand IoT first. IoT basically connects all our devices so that they can communicate with each other and provide solutions to the users of those devices. This would mean huge amounts of data and current cloud computing may not be sufficient to cater to so much data transfer, data processing and online demand of customer’s request.Fog computing can be ideal here as it takes the work of processing to the devices on the edge of the network. This would need very low latency, parallel processing of ML and complex graph analytical algorithms, all of which are readily available in Apache spark out of the box and can be pick and choose as per the requirements of the processing. So it is expected that as IoT gains momentum Apache spark will be the leader in Fog computing.Event Detection:Apache Spark is increasingly used in event detection like credit card fraud detection, money laundering activities etc. Apache spark streaming along with MLlib and Apache Kafka forms the backbone of a fraud financial transaction detection.Credit card transactions of a cardholder can be captured over a period of time to categorize user’s spending habits. Models can be developed and trained to predict any anomaly in the card transaction and along with Spark streaming and Kafka in real time.Interactive Analysis:Spark’s one of the most popular features is its ability to provide users with interactive analytics. MapReduce does provide tools like Pig and Hive for interactive analysis, but they are too slow in most of the cases. But Spark is very fast and swift and that’s why it has gained so much ground in the interactive analysis.Spark interfaces with programming languages like R, Python, SQL and Scala which caters to a bigger set of developers and users for interactive analysis.Spark also came up with Structured Streaming in version 2.0 which can be used for interactive analysis with live data as well as join the live data with batch data output to get more insight into the data. Structured streaming in future has the potential to boost Web Analytics by allowing users to query user’s live web session. Even machine learning can be applied to live session data for more insights.Data Warehousing: Data warehousing is another function where Apache Spark has is getting tremendous traction. Due to an increasing volume of data day by day, the tradition ETL tools like Informatica along with RDBMS are not able to meet the SLAs as they are not able to scale horizontally. Spark along with Spark SQL is being used by many companies to migrate to Big Data based Warehouse which can scale horizontally as the load increases.With Spark, even the processing can be scaled horizontally by adding machines to the Spark engine cluster.These migrated applications embed the Spark engine and offer a web UI to allow users to create, run, test and deploy jobs interactively. Jobs are primarily written in native Spark SQL or other flavours of SQL. These Spark clusters have been able to scale to process many terabytes of data every day and the clusters can be hundreds to thousands of nodes.Companies using Apache SparkApache Spark at Alibaba:Alibaba is the world’s one of the biggest e-commerce players. Alibaba’s online shopping platform generates Petabytes of data as it has millions of users every day doing searches, shopping and placing orders. These user interactions are represented as complex graphs. The processing of these data points is done using Spark’s Machine learning component MLlib and then used to provide better user shopping experience by suggesting products based on choice, trending products, reviews etc.Apache Spark at MyFitnessPal:MyFitnessPal is one of the largest health and fitness lifestyle portals. It has over 80 million active users. The portal helps its users follow and achieve a healthy lifestyle by following a proper diet and fitness regime. The portal uses the data added by users about their food, exercise and lifestyles to identify the best quality food and effective exercise. Using Spark the portal is able to scan through the huge amount of structured and unstructured data and pull out best suggestions for its users.Apache Spark at TripAdvisor:TripAdvisor has a huge user base and generates a mammoth amount of data every day. It is one of the biggest names in the Travel and Tourism industry. It helps users plan their personal and official trips around the world. It uses Apache Spark to process petabytes of data from user interactions and destination details and gives recommendations on planning a perfect trip based on users choice and preferences. They help users identify best airlines, best prices on hotels and airlines, best places to eat, basically everything needed to plan any trip. It also ranks these places, hotels, airlines, restaurants based on user feedback and reviews. All this processing is done using Apache SparkApache Spark at Yahoo:Yahoo is known to have one of the biggest Hadoop Cluster and everyone is aware of Yahoo’s contribution to the development of Big Data system. Yahoo is also heavily using Apache Spark Machine learning capabilities to identify topics and news which users are interested in. This is similar to trending tweets or hashtags on Twitter or Facebook. Earlier these Machine Learning algo were developed in C/C++ with thousands of lines of code. While today with Spark and Scala/Pythons these algorithms can be implemented in few hundreds of lines of code. This is a big leap in turnover time as well as code understanding and maintenance. This has been made possible due to Spark to a great extent.Apache Spark Use casesFinance: Spark is used in Finance industry across different functional and technology domains.A typical use case is building a Data Warehouse for batch processing and daily reporting. The Spark data frames abstraction has been used as a generic ingestion platform capable of ingesting data from multiple sources of different formats.Financial services companies also use Apache Spark MLlib to create and train models for fraud detection. Some of the banks have started using Spark as a tool for classifying text in money transfers.Some of the companies use Apache spark as log collection, an analysis engine and detection engine.Let’s look at Spain's 2nd biggest bank BBVA use case where every money transfer a customer makes goes through an engine that infers a category from its textual description. This engine has been developed in Spark, mixes MLLib and own implementations, and is currently into production serving more than 5M customers daily.The challenges that the BBVA technology team faced while building this ML were many:They did not know the data source in advanceThey did not have a labelled setA fraction of texts is useless (detection rather than classification)Distribution of categories is imbalancedPrefer false negatives over false positivesVery short text, language not even syntactically correctThe engineers solved these problems using the Spark MLlib pipeline using some other NLP tools like word2vec.TF-IDF features + linear classifier (98% precision, 21% recallFurther tests with word2vec + Vector of Locally Aggregated Descriptors (VLAD)Implemented in Spark/Scala, using MLlib classesOwn classes implemented for Multi-class Logistic Regression, VLADScala dependency injection useful to quickly setup variants of the above stepsHealthCare:Healthcare industry is the newest in adopting advanced technologies like big data and machine learning to provide hi-tech facilities to their patients. Apache Spark is penetrating fast and is becoming the heartbeat in the latest Healthcare applications. Hospitals use these Spark enabled healthcare applications to analyze patients medical history to identify possible health issues based on history and learning.Also, healthcare produces massive amounts of data and to process so much of the data in quick time and provide insights based on that itself was a challenge which Spark solves with ease.Another very interesting problem in hospitals is when working with Operating Room(OR) scheduling within a hospital setting is that it is difficult to schedule and predict available OR block times. This leads to empty and unused operating rooms leading to longer waiting times for patients for their procedures.Let’s see a use case. For a basic surgical procedure, it costs around $15-20 per minute. So, OR is a scarce and valuable resource and it needs to be utilized carefully and optimally. OR efficiency differs depending on the OR staffing and allocation, not the workload. So the loss of efficiency means a loss for the patient. So time and management are the utmost importance here.Spark and MLlib solve the problem by developing a predictive model that would identify available OR time 2 weeks in advance, allows hospitals to confirm waitlist cases two weeks in advance instead of when blocks normally release 4 days out. This OR Scheduling can be done by getting the historical data and running then linear regression model with multiple variables.This model works because:Can coordinate waitlist scheduling logistics with physicians and patients within 2 weeks of surgery.Plan staff scheduling and resources so there are less last-minutes staffing issues for nursing and anaesthesiaUtilization metrics show where elective surgical schedule and level demand can be maximized.Retail: Big retail chains have this usual problem of optimising their supply chain to minimize cost and wastage, improve customer service and gain insights into customer’s shopping behaviour to serve better and in the process optimize their profit.To achieve these goals these retail companies have a lot of challenges like to keep the inventory up to date based on sales and also to predict sales and inventory during some promotional events and sale seasons. Also, they need to keep a track on customer’s orders transit and delivery. All these pose huge technical challenges. Apache Spark and MLlib is being used by a lot of these companies to capture real-time sales and invoice data, ingest it and then figure out the inventory. The technology can also be used to identify in real-time the order’s transit and delivery status. Spark MLlib analytics and predictive models are being used to predict sales during promotions and sale seasons to match the inventory and be ready for the event. The historical data on customer’s buying behaviour is also used to provide the customer with personalized suggestions and improve customer satisfaction. A lot of stores have started using sensors to get data on customer’s location within the store, their preferences, shopping behaviour, etc to provide on-the-spot suggestions and help to find, buy a product by sending messages, using displays etc.Travel: Airline customer segmentation is a challenging field to understand due to customer’s complex behaviour. Amadeus is one of the main IT solution providers in the airline industry. It has the resources and infrastructure to manage all the ticketing and booking data as well as understanding the Airline needs and market particularities. By combining different data sources produced by different airline systems, they have applied unsupervised machine learning techniques to improve our understanding of customer behaviour.Challenges in the airline industry are to understand the health of the business:Are any segments growing or shrinkingHow is the yield developingTune marketing to specific interests within segmentsOptimize product offers using fare structures and media offersTraditional approaches for segmentation were based on business intuition and manually crafter rules set. But these approaches have limitations and prejudices which can sometimes be negative for the business. On the contrary, the data-driven approach is resilient against turn-over, prejudices and market change.With a data-driven approach and using Spark and MLlib, the model is able to extract actionable insights on typical customer behaviour and intentions. Supervised and supervised learning using Spark MLlib techniques at scale are used to train models for prediction. These are then used to assist the customer in deploying the newfound insights into day-to-day operations.Media: Media companies Netflix, Hotstar etc are using Apache Spark at the heart of their technology engine to drive their business. When a user turns on Netflix, he is able to see his favourite content playing automatically. This is achieved through recommendation engines built on Machine learning algorithms and Spark MLlib. Netflix uses historical data from users content selection, trains its ML algorithms, tests it offline and then deploys it live and checks if it works in Production as well.Netflix has built an engine something called Time Travel using Apache Spark and other big data technologies to: Snapshot online services and use the snapshot data offline to generate features and share facts and features between experiments without calling live systems.If someone is interested in exploring the details of the use case, one can look at the below link:Energy: Apache Spark is spreading its roots everywhere. A common man not related to software industry may not realise it but there are applications running or extracting data from his home environment and processed in Spark to make his life better and easier. An example we will discuss below is the British Gas.British Gas is a 200-year-old company. Connected Homes is BG’s IoT “startup”. It is a leader in the UK’s connected home market. Connected Homes is trying to predict the usage consumption patterns of the electricity, gas at the homes and provide consumers with insights so they can smartly use their devices and reduce energy consumption and save energy and money. Connected homes use Apache Spark at the core of its Data Engineering and ML engine.The challenges are there are millions of electric and gas meters and the meters are read every 30 minutes.There are:Gas and electricity meter readingsThermostat temperature dataConnected boiler dataReal-time energy consumption dataIntroducing motion sensors, window and door sensors, etcApache Spark MLlib is used to apply machine learning to these data for disaggregation, similar home comparison and smart meters used in indirect algorithms for non-smart customers.The analytics engine is used to show customers how they have spent energy, what are their top 3 spends, how can they reduce their energy consumption by showing patterns from smart consumers and smart meters etc. This gives customers a lot of insight and educates customers on optimally using energy at their homes.Gaming:Online Gaming industry is another beneficiary of the Apache Spark technology.Riot Games uses Spark for Combating abusive language in chat in the team games. The challenges in online gaming are:1% of all players are consistently unsportsmanlike2% of all games infected by serious toxicityIn-Game Toxicity 95% of all serious toxicity comes from players who are otherwise sportsmanlikeTo solve this the game developers tried to predict the words used by the gamers in the context of the game or the scenario. They used the “Word2Vec” a neural model which has 256 dimensions embedding months of chat logs. Each word in the chat is document split in spaces and lowercase. The model was trained on NLP for acronyms, short forms, colloquial words etc and the deviations could be huge. The team built a model trained to predict bad/toxic language. The gaming company has 100+ million users every month and so the data is huge. They used Spark MLlib to train their models using different algorithms, one of them Logistic Regression Random Forest Gradient Boosted Trees. The results were impressive for them as they tuned their models for better precision.Benefits of having Apache Spark for Individual companiesMany of the companies across industries have been benefiting from Apache Spark. The reasons could be different:Speed of executionMulti-language supportMachine learning libraryGraph processing libraryBatch processing as well as Stream & Structured stream processingApache Spark is beneficial for small as well as large enterprise. Spark offers a complete solution to many of the common problems like ETL and warehousing, Stream data processing, common use case of supervised and unsupervised learning for data analytics and predictive modelling. So with Apache Spark, the technology team does not require to look out for different technology stack and multiple vendors for a solution. This reduces the learning curve for additional development and maintenance. Also, since Spark has support for multiple languages Scala, Java, Python & R, it is easy to find developers.Limitations:Though there are so many benefits of Apache Spark as we have seen above, there are few limitations which Apache Spark has. We should be aware of these limitations before we decide to adopt any technology.Apache Spark does not come with an inbuilt file system and it has to depend on HDFS in most of the use cases. If not, it has to be used with some cloud-based data platform.Even though Spark has Stream processing feature it is not exactly real-time processing. It processes in batches which are called micro-batches.Apache Spark is expensive as it catches a lot of data and memory is not cheap.Spark faces issues while working with HDFS which has a very large number of small files.Though Spark MLlib provides machine learning capabilities, it does not come with a very exhaustive list of algorithms. It can solve a lot of ML problems but not all.Apache Spark does not have automatic code optimization process in place and so the code needs to be optimized manually.Reasons why you should learn Apache SparkWe have seen the wide impact and use cases if Apache Spark. So we know that Spark has become a buzzword these days. We should now also understand why we should learn Spark.Spark offers a complete package for developers and can act as a unified analytics engine. Hence it increases the productivity for the developers. So it ROI for any firm is high and most of the companies dependent on technology are aware of the fact and also willing to put their money in Spark.Learning Spark can help explore the world of Big data and data science. Both these technology fields are the future and bringing transformational changes in almost all industries. So getting exposed to Spark is becoming a necessity for all firms.With fast-paced Spark adoption by organizations, it is opening up new prospects in business and many of the applications have proved that instead of business driving technology it is becoming vice versa now, that technology is driving business.According to a survey, there is a huge demand for Spark engineers. Today, there are well over 1,000 contributors to the Apache Spark project across 250+ companies worldwide. Recently, Indeed.com listed over 2,400 full-time open positions for Apache Spark professionals across various industries including enterprise technology, e-commerce/retail, healthcare, and life sciences, oil and gas, manufacturing, and more.Apache Spark developers earn the highest average salary among all other programmers. So this is another and one of the major incentives one can get to learn and expertise Spark.One can look at an old survey by Databricks to understand the importance and impact of Apache Spark by 2016. By 2019 these number would have grown much bigger.ConclusionApache Spark has capabilities to process huge amount of data in a very efficient manner with high throughput. It can solve problems related to batch processing, near real-time processing, can be used to apply lambda architecture, can be used for Structured streaming. Also, it can solve many of the complex data analytics and predictive analytics problems with the help of the MLlib component which comes out of the box. Apache Spark has been making a big impact on the whole data engineering and data science gamut at scale.
Rated 4.5/5 based on 1 customer reviews
6610
Apache Spark Use Cases & Applications

Apache Spark was developed by a team at UC Berkele... Read More

How to Install Spark on Ubuntu

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.In this article, we will cover the installation procedure of Apache Spark on the Ubuntu operating system.PrerequisitesThis guide assumes that you are using Ubuntu and Hadoop 2.7 is installed in your system.System requirementsUbuntu OS Installed.Minimum of 8 GB RAM.At least 20 GB free space.PrerequisitesJava8 should be installed in your Machine.Hadoop should be installed in your Machine.Installation ProcedureMaking system ready:Before installing Spark ensure that you have installed Java8 in your Ubuntu Machine. If not installed, please follow below process to install java8 in your Ubuntu System.a. Install java8 using below command.sudo apt-get install oracle-java8-installerAbove command creates java-8-oracle Directory in /usr/lib/jvm/ directory in your machine. It looks like belowNow we need to configure the JAVA_HOME path in .bashrc file..bashrc file executes whenever we open the terminal.b. Configure JAVA_HOME and PATH  in .bashrc file and save. To edit/modify .bashrc file, use below command.vi .bashrc Then press i(for insert) -> then Enter below line at the bottom of the file.export JAVA_HOME= /usr/lib/jvm/java-8-oracle/ export PATH=$PATH:$JAVA_HOME/binBelow is the screen shot of that.Then Press Esc -> wq! (For save the changes) -> Enter.c. Now test Java installed properly or not by checking the version of Java. Below command should show the java version.java -versionBelow is the screenshotInstalling Spark on the System:Go to the below official download page of Apache Spark and choose the latest release. For the package type, choose ‘Pre-built for Apache Hadoop’.https://spark.apache.org/downloads.htmlThe page will look like belowOr You can use a direct link to download.https://www.apache.org/dyn/closer.lua/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgzCreating Spark directoryCreate a directory called spark under /usr/ directory. Use below command to create spark directorysudo mkdir /usr/sparkAbove command asks password to create spark directory under the /usr directory, you can give the password. Then check spark directory is created or not in the /usr directory using below commandll /usr/It should give the below results with ‘spark’ directoryGo to /usr/spark directory. Use below command to go spark directory.cd /usr/sparkDownload Spark versionDownload spark2.3.3 in spark directory using below commandwget https://www.apache.org/dyn/closer.lua/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgzIf use ll or ls command, you can see spark-2.4.0-bin-hadoop2.7.tgz in spark directory.Extract Spark fileThen extract spark-2.4.0-bin-hadoop2.7.tgz using below command.sudo tar xvzf spark-2.4.0-bin-hadoop2.7Now spark-2.4.0-bin-hadoop2.7.tgz file is extracted as spark-2.4.0-bin-hadoop2.7Check whether it extracted or not using ll command. It should give the below results.ConfigurationConfigure SPARK_HOME path in the .bashrc file by following below steps.Go to the home directory using below commandcd ~Open the .bashrc file using below commandvi .bashrcNow we will configure SPARK_HOME and PATHpress i for insert the enter SPARK_HOME and PATH  like belowSPARK_HOME=/usr/spark/spark-2.4.0-bin-hadoop2.7PATH=$PATH:$SPARK_HOME/binIt looks like belowThen save and exit by entering below commands.Press Esc -> wq! -> EnterTest Installation:Now we can verify spark is successfully installed in our Ubuntu Machine or not. To verify use below command then enter.spark-shell Above command should show below screenNow we have successfully installed spark on Ubuntu System. Let’s create RDD and Dataframe then we will end up.a. We can create RDD in 3 ways, we will use one way to create RDD.Define any list then parallelize it. It will create RDD. Below are the codes. Copy paste it one by one on the command line.val nums = Array(1,2,3,5,6) val rdd = sc.parallelize(nums)Above will create RDD.b. Now we will create a Data frame from RDD. Follow the below steps to create Dataframe.import spark.implicits._ val df = rdd.toDF("num")Above code will create Dataframe with num as a column.To display the data in Dataframe use below commanddf.show()Below is the screenshot of the above code.How to uninstall Spark from Ubuntu System: You can follow the below steps to uninstall spark on Windows 10.Remove SPARK_HOME from the .bashrc file.To remove SPARK_HOME variable from the .bashrc please follow below stepsGo to the home directory. To go to home directory use below command.cd ~Open .bashrc file. To open .bashrc file use below command.vi .bashrcPress i for edit/delete SPARK_HOME from .bashrc file. Then find SPARK_HOME the delete SPARK_HOME=/usr/spark/spark-2.4.0-bin-hadoop2.7 line from .bashrc file and save. To do follow below commandsThen press Esc -> wq! -> Press EnterWe will also delete downloaded and extracted spark installers from the system. Please do follow below command.rm -r ~/sparkAbove command will delete spark directory from the system.Open Command Line Interface then type spark-shell,  then press enter, now we get an error.Now we can confirm that Spark is successfully uninstalled from the Ubuntu System. You can also learn more about Apache Spark and Scala here.
Rated 4.5/5 based on 19 customer reviews
9899
How to Install Spark on Ubuntu

Apache Spark is a fast and general-purpose cluster... Read More