Before we delve into Apache Spark, let us first get some context about what Big Data is, and all that it entails. It would not be wrong to say today that we live in a world of data deluge. There is an enormous amount of data that is getting generated every day. It also becomes necessary to store and process this data as a lot, if not all,of the data is very useful and meaningful for us. Traditional computing and processing systems would never have been able to process this volume of data within a given timeframe.
The next question is how or who is generating this volume of data. Just take a look around you and youcan see all the sources of Big Data! There is media data like print media, video, and audio content, social media data from websites like Facebook, Twitter, LinkedIn, Snapchat, e-commerce data about products, features, reviews, financial data, then there is genomics data from the healthcare industry, demographics data, etc. It is not that all this Big data has just started to be generated now. For decades there has been generation of various forms of Big Data ; for e.g. the data from an aircraft black box was transmitted to air traffic control, and tracked and monitored enormous amounts of data generated by the aircraft. But now the rate at which the data is growing is exponentially faster than it used to be 10-15 years back.
Any big data is characterized by at least these three distinct dimensions: Volume, Velocity, and Variety. Many of the data engineers would like to add another two dimensions: Veracity and Value.
Due to these dimensions, it is almost impossible to store and process big data with our traditional processing systems. So there was a need to develop a system that could store and process big data to make the data useful. This gave rise to distributed file systems and distributed processing systems like Hadoop: HDFS and MapReduce. However, MapReduce has its own limitations. This led to the evolution of Spark which makes use of MapReduce, adding many more dimensions to it and improving the performance manifold when properlyimplemented. Check out the differences between MapReduce and Spark.