top

Search

Apache Spark Tutorial

Before we delve into Apache Spark, let us first get some context about what Big Data is, and all that it entails. It would not be wrong to say today that we live in a world of data deluge. There is an enormous amount of data that is getting generated every day. It also becomes necessary to store and process this data as a lot, if not all,of the data is very useful and meaningful for us. Traditional computing and processing systems would never have been able to process this volume of data within a given timeframe. The next question is how or who is generating this volume of data. Just take a look around you and youcan see all the sources of Big Data! There is media data like print media, video, and audio content, social media data from websites like Facebook, Twitter, LinkedIn, Snapchat, e-commerce data about products, features, reviews, financial data, then there is genomics data from the healthcare industry, demographics data, etc. It is not that all this Big data has just started to be generated now. For decades there has been generation of various forms of Big Data ; for e.g. the data from an aircraft black box was transmitted to air traffic control, and tracked and monitored  enormous amounts of data generated by the aircraft. But now the rate at which the data is growing is exponentially faster than it used to be 10-15 years back.Any big data is characterized by at least these three distinct dimensions: Volume, Velocity, and Variety. Many of the data engineers would like to add another two dimensions: Veracity and Value.Volume: Volume simply signifies the amount of data. For data to be big data, the data size should be huge with respect to the context.  Velocity: Velocity refers to the high speed at which the data is being accumulated from so many sources like mobiles, computers, IoT devices, etc.  Variety: Variety means the nature of data like structured, semi-structured or unstructured from heterogeneous sources.  Veracity: Veracity refers to the uncertainty and inconsistency in the data which is flowing in. Sometimes the data being generated is messy or inconsistent and makes processing complex.  Value: The last of the V’s but a very important one. Whatever is the amount and variety of data we accumulate, we need to find the value out of that data which is useful to the organization, else the data and the whole exercise is not useful.  Due to these dimensions, it is almost impossible to store and process big data with our traditional processing systems. So there was a need to develop a system that could store and process big data to make the data useful. This gave rise to distributed file systems and distributed processing systems like Hadoop: HDFS and MapReduce. However, MapReduce has its own limitations. This led to the evolution of Spark which makes use of MapReduce, adding many more dimensions to it and improving the performance manifold when properlyimplemented. Check out the differences between MapReduce and Spark.  
logo

Apache Spark Tutorial

Introduction to Big Data

Before we delve into Apache Spark, let us first get some context about what Big Data is, and all that it entails. It would not be wrong to say today that we live in a world of data deluge. There is an enormous amount of data that is getting generated every day. It also becomes necessary to store and process this data as a lot, if not all,of the data is very useful and meaningful for us. Traditional computing and processing systems would never have been able to process this volume of data within a given timeframe. 

The next question is how or who is generating this volume of data. Just take a look around you and youcan see all the sources of Big Data! There is media data like print media, video, and audio content, social media data from websites like Facebook, Twitter, LinkedIn, Snapchat, e-commerce data about products, features, reviews, financial data, then there is genomics data from the healthcare industry, demographics data, etc. It is not that all this Big data has just started to be generated now. For decades there has been generation of various forms of Big Data ; for e.g. the data from an aircraft black box was transmitted to air traffic control, and tracked and monitored  enormous amounts of data generated by the aircraft. But now the rate at which the data is growing is exponentially faster than it used to be 10-15 years back.

Any big data is characterized by at least these three distinct dimensions: Volume, Velocity, and Variety. Many of the data engineers would like to add another two dimensions: Veracity and Value.

Apache Spark Tutorial

  • Volume: Volume simply signifies the amount of data. For data to be big data, the data size should be huge with respect to the context.  
  • Velocity: Velocity refers to the high speed at which the data is being accumulated from so many sources like mobiles, computers, IoT devices, etc.  
  • Variety: Variety means the nature of data like structured, semi-structured or unstructured from heterogeneous sources.  
  • Veracity: Veracity refers to the uncertainty and inconsistency in the data which is flowing in. Sometimes the data being generated is messy or inconsistent and makes processing complex.  
  • Value: The last of the V’s but a very important one. Whatever is the amount and variety of data we accumulate, we need to find the value out of that data which is useful to the organization, else the data and the whole exercise is not useful.  

Due to these dimensions, it is almost impossible to store and process big data with our traditional processing systems. So there was a need to develop a system that could store and process big data to make the data useful. This gave rise to distributed file systems and distributed processing systems like Hadoop: HDFS and MapReduce. However, MapReduce has its own limitations. This led to the evolution of Spark which makes use of MapReduce, adding many more dimensions to it and improving the performance manifold when properlyimplemented. Check out the differences between MapReduce and Spark.  

Leave a Reply

Your email address will not be published. Required fields are marked *