Search

Scala Vs Python Vs R Vs Java - Which language is better for Spark & Why?

One of the most important decisions for the Big data learners or beginners is choosing the best programming language for big data manipulation and analysis. Just understanding business problems and choosing the right model is not enough but implementing them perfectly is equally important and choosing the right language (or languages) for solving the problem goes a long way. If you search top and highly effective programming languages for Big Data on Google, you will find the following top 4 programming languages: JavaScalaPythonRJavaJava is one of the oldest languages of all 4 programming languages listed here. Traditional Frameworks of Big data like Apache Hadoop and all the tools within its ecosystem are Java-based and hence using java opens up the possibility of utilizing large ecosystem of tools in the big data world.  ScalaA beautiful crossover between object-oriented and functional programming language is Scala. Scala is a highly Scalable Language. Scala was invented by the German Computer Scientist, Martin Odersky and the first version was launched in the year 2003.PythonPython was originally conceptualized by Guido van Rossum in the late 1980s. Initially, it was designed as a response to the ABC programming language and later gained its popularity as a functional language in a big data world. Python has been declared as one of the fastest-growing programming languages in 2018 as per the recently held Stack Overflow Developer Survey. Many data analysis, manipulation, machine learning, deep learning libraries are written in Python and hence it has gained its popularity in the big data ecosystem. It’s a very user-friendly language and it is its biggest advantage.  Fun factPython is not named after the snake. It’s named after the British TV show Monty Python.RR is the language of statistics. R is a language and environment for statistical computing and graphics. R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and is currently developed by the R Development Core Team. R is named partly after the first names of the first two R authors and partly as a play on the name of S*. The project was conceived in 1992, with an initial version released in 1995 and a stable beta version in 2000.*SS is a statistical programming language developed primarily by John Chambers and R is an implementation of the S programming language combined with lexical scoping semantics, inspired by Scheme.Every framework is implemented in the underlying programming language for its implementation. Ex Zend uses PHP, Panda Framework uses python similarly Hadoop framework uses Java and Spark uses Scala.However, Spark officially supports Java, Scala, Python and R, all 4 languages. If one browses through Apache Spark’s official website documentation, he/she would find many other languages utilized by the open-source community for Spark implementation.    When any developer wants to start learning Spark, the first question he stumbles upon is, out of these pools of languages, which one to use and which one to master? Solution Architects would have a tough time choosing the right language for spark framework and Organizations will always be wondering, which skill sets are relevant for my problem if one doesn’t have the right knowledge about these languages in the context of Spark.    This article will try to answer all these queries.so let’s start-JavaOldest of all and popular, widely adopted programming language of all. There is a number offeatures/advantages due to which Java is favorite for Big data developers and tool creators:Java is platform-agnostic language and hence it can run on almost any system. Java is portable due to something called Java Virtual Machine – JVM. JVM is a foundation of Hadoop ecosystem tools like Map Reduce, Storm, Spark, etc. These tools are written in Java and run on JVM.Java provides various communities support like GitHub and stack overflow etc.Java is scalable, backward compatible, stable and production-ready language. Also, supports a large variety of tried and tested libraries.It is statically typed language (We would see details of this functionality in later sections, in comparison with others)Java is mostly the choice for most of the big data projects but for the Spark framework, one has to ponder upon, whether Java would be the best fit.One major drawback of Java is its verbosity. One has to write long code (number of lines of code) to achieve simple functionality in Java.Java does not support Read-Evaluate-Print-Loop (REPL) which is a major deal-breaker when choosing a programming language for big data processing.ScalaScala is comparatively new to the programming scene but has become popular very quickly. Above are a few quotes from bigger names in the industry for Scala. From the Spark context, many experts prefer Scala over other programming languages as Spark is written in Scala. Scala is the native language of Spark. It means any new API always first be available in Scala.Scala is a hybrid functional programming language because It has both the features of object-oriented programming and functional programming. As an OO Programming Language, it considers every value as an object and all OOPS concepts apply. As a functional programming language, it defines and supports functions. All operations are done as functions. No variable stands by itself. Scala is a machine-compiled language.Scala and Java are popular programming languages that run over JVM. JVM makes these languages framework friendly. One can say, Scala is an advanced level of Java.Features/Advantages of Scala:It’s general-purpose object-oriented language with functional language properties too. It’s less verbose than Java.It can work with JVM and hence is portable.It can support Java APIs comfortably.It's fast and robust in Spark context as its Spark native.It is a statically typed language.Scala supports Read-Evaluate-Print-Loop (REPL)Drawbacks / Downsides of Scala:Scala is complex to learn due to the functional nature of language.Steep learning curve.Lack of matured machine learning languages.PythonPython is one of the de-facto languages of Data Science. It is a simple, open-source, general-purpose language and is very easy to learn. It has a rich set of libraries, utilities, ready-to-use features and support to a number of mature machine learning, big data processing, visualization libraries.Advantages of Python:It is interpreted language (i.e. support to REPL, Read, Evaluate, Print, Loop.) If you type a command into a command-line interpreter and it responds immediately. Java lacks this feature.Easy to learn, easy debugging, fewer lines of code.It is dynamically typed. i.e. can dynamically defined variable types. i.e. Python as a language is type-safe.Python is platform agnostic and scalable.Drawbacks/Disadvantages:Python is slow. Big data professionals find projects built in Java / Scala are faster and robust than the once with python.Whilst using user-defined functions or third party libraries in Python with Spark, processing would be slower as increased processing is involved as Python does not have equivalent Java/Scala native language API for these functionalities.Python does not support heavy weight processing fork() using uWSGI but it does not support true multithreading.R LanguageR is the favourite language of statisticians. R is fondly called a language of statisticians.  It’s popular for research, plotting, and data analysis. Together with RStudio, it makes a killer statistic, plotting, and data analytics application.R is majorly used for building data models to be used for data analysis.Advantages/Features of R:Strong statistical modeling and visualization capabilities.Support for ‘data science’ related work.It can be integrated with Apache Hadoop and Spark easily.Drawbacks/Disadvantages of R:R is not a general-purpose language.The code written in R cannot be directly deployed into production. It needs conversion into Java or Python.Not as fast as Java / Scala.Comparison of four languages for Apache SparkWith the introduction of these 4 languages, let’s now compare these languages for the Spark framework:These languages can be categorized into 2 buckets basis high-level spark architecture support, broadly:JVM Languages: Java and ScalaNon-JVM Languages: Python and RDue to these categorizations, performance may vary. Let’s understand architecture in little depth to understand the performance implications of using these languages. This would also help us to understand the question of when to use which language.Spark Framework High-level architecture An application written in any one of the languages is submitted on the driver node and further driver node distributes the workload by dividing the execution on multiple worker nodes.JVM compatible Application Execution Flow Consider the applications written are JVM compatible (Java/Scala). Now, Spark is also written in native JVM compatible Scala language, hence there is no explicit conversion required at any point of time to execute JVM compatible applications on Spark. Also, this makes the native language applications faster to perform on the Spark framework.There are multiple scenarios for Python/R written applications:Python/R driver talk to JVM driver by socket-based API. On the driver node, both the driver processes are invoked when the application language is non-JVM language.Scenario 1: Applications for which Equivalent Java/Scala Driver API exists - This scenario executes the same way as JVM compatible applications by invoking Java API on the driver node itself. The cost for inter-process communication through sockets is negligible and hence performance is comparable. This is with the assumption that processed data over worker nodes are not to be sent back to the Driver again.Scenario 1(b): If the assumption taken is void in scenario 1 i.e. processed data on worker nodes is to be sent back to driver then there is significant overhead and serialization required. This adds to processing time and hence performance in this scenario deteriorates.Scenario 2: Applications for which Equivalent Java/Scala Driver API do not exist – Ex. UDF (User-defined functions) / Third party python libraries. In such cases equivalent Java API doesn’t exist and hence, additional executor sessions are initiated on worker node and python API is serialized on worker node and executed. This python worker processes in addition to JVM and coordination between them is overhead. Processes also compete for resources which adds to memory contention.In addition, if the data is to send back to the driver node then processing takes a lot of time and problem scales up as volume increases and hence performance is bigger problem.As we have seen a performance, Let’s see the tabular comparison between these languages.Comparison PointsJavaScalaPythonRPerformanceFasterFaster (about 10x faster than Python)SlowerSlowerLearning CurveEasier than JavaTougher than PythonSteep learning curve than Java & PythonEasiestModerateUser GroupsWeb/Hadoop programmersBig Data ProgrammersBeginners & Data EngineersData Scientists/ StatisticiansUsageWeb development and Hadoop NativeSpark NativeData Engineering/ Machine Learning/ Data VisualizationVisualization/ Data Analysis/ Statistics use casesType of LanguageObject-Oriented, General PurposeObject-Oriented & Functional General PurposeGeneral PurposeSpecifically for Data Scientists.Needs conversion into Scala/Python before productizingConcurrencySupport ConcurrencySupport ConcurrencyDoes not Support ConcurrencyNAEase of UseVerboseLesser Verbose than ScalaLeast VerboseNAType SafetyStatically typedStatically typed (except for Spark 2.0 Data frames)Dynamically TypedDynamically TypedInterpreted Language (REPL)NoNoYesYesMaturated machine learning libraries availability/ SupportLimitedLimitedExcellentExcellentVisualization LibrariesLimitedLimitedExcellentExcellentWeb Notebooks SupportIjava Kernel in Jupyter NotebookApache Zeppelin Notebook SupportJupyter Notebook SupportR NotebookWhich language is better for Spark and Why?With the info we gathered for the languages, let's move to the main question i.e. which language to choose for Spark? My answer is not a straightforward single language for this question. I will state my point of view for choosing the proper language: If you are a beginner and want to choose a language from learning Spark perspective. If you are organization/ self employed or looking to answer a question for solutioning a project perspective. I. If you are beginner:If you are a beginner and have no prior education of programming language then Python is the language for you, as it’s easy to pick up. Simple to understand and very user-friendly. It would prove a good starting point for building Spark knowledge further. Also, If you are looking for getting into roles like ‘data engineering’, knowledge of Python along with supported libraries will go a long way. If you are a beginner but have education in programming languages, then you may find Java very familiar and easy to build upon prior knowledge. After all, it grapevine of all the languages.  If you are a hardcore bigdata programmer and love exploring complexities, Scala is the choice for you. It’s complex but experts say if once you love Scala, you will prefer it over other languages anytime.If you are a data scientist, statistician and looking to work with Spark, R is the language for you. R is more science oriented than Python. II. If you are organization/looking for choice of language for implementations:You need to answer the following important questions before choosing the language:Skills and Proficiency: Which skill-sets and proficiency over language, you already have with you/in your team?Design goals and availability of features/ Capability of language: Which libraries give you better support for the type of problem(s) you are trying to solve.Performance implications Details of these explained below: 1. Skillset: This is very straightforward. Whichever is available skill set within a team, go with that to solve your problem, after evaluating answers of other two questions. If you are self-employed, the one you have proficiency is the most likely suitable choice of language.  2. Library Support:  Following gives high-level capabilities of languages:R: Good for research, plotting, and data analysis.Python: Good for small- or medium-scale projects to build models and analyse data, especially for fast start-ups or small teams.Scala/Java: Good for robust programming with many developers and teams; it has fewer machine learning utilities than Python and R, but it makes up for it with increased code maintenance.In my opinion, Scala/Java can be used for larger robust projects to ease maintenance. Also, If one wants the app to scale quickly and needs it to be robust, Scala is the choice.Python and R: Python is more universal language than R, but R is more science oriented. Broadly, one can say Python can be implemented for Data engineering use cases and R for Data science-oriented use cases. On the other hand, if you discover these two languages have about the same library support you need, then pick the one whose syntax you prefer. You may find that you need both depending on the situation. 3. Performance: As seen earlier in the article, Scala/ Java is about 10x faster than Python/R as they are JVM supported languages. However, if you are writing Python/R applications wisely (like without using UDFs/ Not sending data back to the Driver etc) they can perform equally well.ConclusionFor learning, depending upon your prior knowledge, Python is the easiest of all to pick up. For implementations, Choice is in your hands which language to choose for implementations but let me tell you one secret or a tip, you don’t have to stick to one language until you finish your project. You can divide your problem in small buckets and utilize the best language to solve the problem. This way, you can achieve balance between optimum performance, availability, proficiency in a skill, and sub-problem at hand.  Do let us know how your experience was in learning the language comparisons and the language you think is better for Spark. Moreover, which one you think is “the one for you”, through comments below.
Scala Vs Python Vs R Vs Java - Which language is better for Spark & Why?
Shruti
Rated 4.5/5 based on 85 customer reviews
Shruti

Shruti Deshpande

Blog Author

10+ years of data-rich experience in the IT industry. It started with data warehousing technologies into data modelling to BI application Architect and solution architect.


Big Data enthusiast and data analytics is my personal interest. I do believe it has endless opportunities and potential to make the world a sustainable place. Happy to ride on this tide.


*Disclaimer* - Expressed views are the personal views of the author and are not to be mistaken for the employer or any other organization’s views.

Posts by Shruti Deshpande

Scala Vs Python Vs R Vs Java - Which language is better for Spark & Why?

One of the most important decisions for the Big data learners or beginners is choosing the best programming language for big data manipulation and analysis. Just understanding business problems and choosing the right model is not enough but implementing them perfectly is equally important and choosing the right language (or languages) for solving the problem goes a long way. If you search top and highly effective programming languages for Big Data on Google, you will find the following top 4 programming languages: JavaScalaPythonRJavaJava is one of the oldest languages of all 4 programming languages listed here. Traditional Frameworks of Big data like Apache Hadoop and all the tools within its ecosystem are Java-based and hence using java opens up the possibility of utilizing large ecosystem of tools in the big data world.  ScalaA beautiful crossover between object-oriented and functional programming language is Scala. Scala is a highly Scalable Language. Scala was invented by the German Computer Scientist, Martin Odersky and the first version was launched in the year 2003.PythonPython was originally conceptualized by Guido van Rossum in the late 1980s. Initially, it was designed as a response to the ABC programming language and later gained its popularity as a functional language in a big data world. Python has been declared as one of the fastest-growing programming languages in 2018 as per the recently held Stack Overflow Developer Survey. Many data analysis, manipulation, machine learning, deep learning libraries are written in Python and hence it has gained its popularity in the big data ecosystem. It’s a very user-friendly language and it is its biggest advantage.  Fun factPython is not named after the snake. It’s named after the British TV show Monty Python.RR is the language of statistics. R is a language and environment for statistical computing and graphics. R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and is currently developed by the R Development Core Team. R is named partly after the first names of the first two R authors and partly as a play on the name of S*. The project was conceived in 1992, with an initial version released in 1995 and a stable beta version in 2000.*SS is a statistical programming language developed primarily by John Chambers and R is an implementation of the S programming language combined with lexical scoping semantics, inspired by Scheme.Every framework is implemented in the underlying programming language for its implementation. Ex Zend uses PHP, Panda Framework uses python similarly Hadoop framework uses Java and Spark uses Scala.However, Spark officially supports Java, Scala, Python and R, all 4 languages. If one browses through Apache Spark’s official website documentation, he/she would find many other languages utilized by the open-source community for Spark implementation.    When any developer wants to start learning Spark, the first question he stumbles upon is, out of these pools of languages, which one to use and which one to master? Solution Architects would have a tough time choosing the right language for spark framework and Organizations will always be wondering, which skill sets are relevant for my problem if one doesn’t have the right knowledge about these languages in the context of Spark.    This article will try to answer all these queries.so let’s start-JavaOldest of all and popular, widely adopted programming language of all. There is a number offeatures/advantages due to which Java is favorite for Big data developers and tool creators:Java is platform-agnostic language and hence it can run on almost any system. Java is portable due to something called Java Virtual Machine – JVM. JVM is a foundation of Hadoop ecosystem tools like Map Reduce, Storm, Spark, etc. These tools are written in Java and run on JVM.Java provides various communities support like GitHub and stack overflow etc.Java is scalable, backward compatible, stable and production-ready language. Also, supports a large variety of tried and tested libraries.It is statically typed language (We would see details of this functionality in later sections, in comparison with others)Java is mostly the choice for most of the big data projects but for the Spark framework, one has to ponder upon, whether Java would be the best fit.One major drawback of Java is its verbosity. One has to write long code (number of lines of code) to achieve simple functionality in Java.Java does not support Read-Evaluate-Print-Loop (REPL) which is a major deal-breaker when choosing a programming language for big data processing.ScalaScala is comparatively new to the programming scene but has become popular very quickly. Above are a few quotes from bigger names in the industry for Scala. From the Spark context, many experts prefer Scala over other programming languages as Spark is written in Scala. Scala is the native language of Spark. It means any new API always first be available in Scala.Scala is a hybrid functional programming language because It has both the features of object-oriented programming and functional programming. As an OO Programming Language, it considers every value as an object and all OOPS concepts apply. As a functional programming language, it defines and supports functions. All operations are done as functions. No variable stands by itself. Scala is a machine-compiled language.Scala and Java are popular programming languages that run over JVM. JVM makes these languages framework friendly. One can say, Scala is an advanced level of Java.Features/Advantages of Scala:It’s general-purpose object-oriented language with functional language properties too. It’s less verbose than Java.It can work with JVM and hence is portable.It can support Java APIs comfortably.It's fast and robust in Spark context as its Spark native.It is a statically typed language.Scala supports Read-Evaluate-Print-Loop (REPL)Drawbacks / Downsides of Scala:Scala is complex to learn due to the functional nature of language.Steep learning curve.Lack of matured machine learning languages.PythonPython is one of the de-facto languages of Data Science. It is a simple, open-source, general-purpose language and is very easy to learn. It has a rich set of libraries, utilities, ready-to-use features and support to a number of mature machine learning, big data processing, visualization libraries.Advantages of Python:It is interpreted language (i.e. support to REPL, Read, Evaluate, Print, Loop.) If you type a command into a command-line interpreter and it responds immediately. Java lacks this feature.Easy to learn, easy debugging, fewer lines of code.It is dynamically typed. i.e. can dynamically defined variable types. i.e. Python as a language is type-safe.Python is platform agnostic and scalable.Drawbacks/Disadvantages:Python is slow. Big data professionals find projects built in Java / Scala are faster and robust than the once with python.Whilst using user-defined functions or third party libraries in Python with Spark, processing would be slower as increased processing is involved as Python does not have equivalent Java/Scala native language API for these functionalities.Python does not support heavy weight processing fork() using uWSGI but it does not support true multithreading.R LanguageR is the favourite language of statisticians. R is fondly called a language of statisticians.  It’s popular for research, plotting, and data analysis. Together with RStudio, it makes a killer statistic, plotting, and data analytics application.R is majorly used for building data models to be used for data analysis.Advantages/Features of R:Strong statistical modeling and visualization capabilities.Support for ‘data science’ related work.It can be integrated with Apache Hadoop and Spark easily.Drawbacks/Disadvantages of R:R is not a general-purpose language.The code written in R cannot be directly deployed into production. It needs conversion into Java or Python.Not as fast as Java / Scala.Comparison of four languages for Apache SparkWith the introduction of these 4 languages, let’s now compare these languages for the Spark framework:These languages can be categorized into 2 buckets basis high-level spark architecture support, broadly:JVM Languages: Java and ScalaNon-JVM Languages: Python and RDue to these categorizations, performance may vary. Let’s understand architecture in little depth to understand the performance implications of using these languages. This would also help us to understand the question of when to use which language.Spark Framework High-level architecture An application written in any one of the languages is submitted on the driver node and further driver node distributes the workload by dividing the execution on multiple worker nodes.JVM compatible Application Execution Flow Consider the applications written are JVM compatible (Java/Scala). Now, Spark is also written in native JVM compatible Scala language, hence there is no explicit conversion required at any point of time to execute JVM compatible applications on Spark. Also, this makes the native language applications faster to perform on the Spark framework.There are multiple scenarios for Python/R written applications:Python/R driver talk to JVM driver by socket-based API. On the driver node, both the driver processes are invoked when the application language is non-JVM language.Scenario 1: Applications for which Equivalent Java/Scala Driver API exists - This scenario executes the same way as JVM compatible applications by invoking Java API on the driver node itself. The cost for inter-process communication through sockets is negligible and hence performance is comparable. This is with the assumption that processed data over worker nodes are not to be sent back to the Driver again.Scenario 1(b): If the assumption taken is void in scenario 1 i.e. processed data on worker nodes is to be sent back to driver then there is significant overhead and serialization required. This adds to processing time and hence performance in this scenario deteriorates.Scenario 2: Applications for which Equivalent Java/Scala Driver API do not exist – Ex. UDF (User-defined functions) / Third party python libraries. In such cases equivalent Java API doesn’t exist and hence, additional executor sessions are initiated on worker node and python API is serialized on worker node and executed. This python worker processes in addition to JVM and coordination between them is overhead. Processes also compete for resources which adds to memory contention.In addition, if the data is to send back to the driver node then processing takes a lot of time and problem scales up as volume increases and hence performance is bigger problem.As we have seen a performance, Let’s see the tabular comparison between these languages.Comparison PointsJavaScalaPythonRPerformanceFasterFaster (about 10x faster than Python)SlowerSlowerLearning CurveEasier than JavaTougher than PythonSteep learning curve than Java & PythonEasiestModerateUser GroupsWeb/Hadoop programmersBig Data ProgrammersBeginners & Data EngineersData Scientists/ StatisticiansUsageWeb development and Hadoop NativeSpark NativeData Engineering/ Machine Learning/ Data VisualizationVisualization/ Data Analysis/ Statistics use casesType of LanguageObject-Oriented, General PurposeObject-Oriented & Functional General PurposeGeneral PurposeSpecifically for Data Scientists.Needs conversion into Scala/Python before productizingConcurrencySupport ConcurrencySupport ConcurrencyDoes not Support ConcurrencyNAEase of UseVerboseLesser Verbose than ScalaLeast VerboseNAType SafetyStatically typedStatically typed (except for Spark 2.0 Data frames)Dynamically TypedDynamically TypedInterpreted Language (REPL)NoNoYesYesMaturated machine learning libraries availability/ SupportLimitedLimitedExcellentExcellentVisualization LibrariesLimitedLimitedExcellentExcellentWeb Notebooks SupportIjava Kernel in Jupyter NotebookApache Zeppelin Notebook SupportJupyter Notebook SupportR NotebookWhich language is better for Spark and Why?With the info we gathered for the languages, let's move to the main question i.e. which language to choose for Spark? My answer is not a straightforward single language for this question. I will state my point of view for choosing the proper language: If you are a beginner and want to choose a language from learning Spark perspective. If you are organization/ self employed or looking to answer a question for solutioning a project perspective. I. If you are beginner:If you are a beginner and have no prior education of programming language then Python is the language for you, as it’s easy to pick up. Simple to understand and very user-friendly. It would prove a good starting point for building Spark knowledge further. Also, If you are looking for getting into roles like ‘data engineering’, knowledge of Python along with supported libraries will go a long way. If you are a beginner but have education in programming languages, then you may find Java very familiar and easy to build upon prior knowledge. After all, it grapevine of all the languages.  If you are a hardcore bigdata programmer and love exploring complexities, Scala is the choice for you. It’s complex but experts say if once you love Scala, you will prefer it over other languages anytime.If you are a data scientist, statistician and looking to work with Spark, R is the language for you. R is more science oriented than Python. II. If you are organization/looking for choice of language for implementations:You need to answer the following important questions before choosing the language:Skills and Proficiency: Which skill-sets and proficiency over language, you already have with you/in your team?Design goals and availability of features/ Capability of language: Which libraries give you better support for the type of problem(s) you are trying to solve.Performance implications Details of these explained below: 1. Skillset: This is very straightforward. Whichever is available skill set within a team, go with that to solve your problem, after evaluating answers of other two questions. If you are self-employed, the one you have proficiency is the most likely suitable choice of language.  2. Library Support:  Following gives high-level capabilities of languages:R: Good for research, plotting, and data analysis.Python: Good for small- or medium-scale projects to build models and analyse data, especially for fast start-ups or small teams.Scala/Java: Good for robust programming with many developers and teams; it has fewer machine learning utilities than Python and R, but it makes up for it with increased code maintenance.In my opinion, Scala/Java can be used for larger robust projects to ease maintenance. Also, If one wants the app to scale quickly and needs it to be robust, Scala is the choice.Python and R: Python is more universal language than R, but R is more science oriented. Broadly, one can say Python can be implemented for Data engineering use cases and R for Data science-oriented use cases. On the other hand, if you discover these two languages have about the same library support you need, then pick the one whose syntax you prefer. You may find that you need both depending on the situation. 3. Performance: As seen earlier in the article, Scala/ Java is about 10x faster than Python/R as they are JVM supported languages. However, if you are writing Python/R applications wisely (like without using UDFs/ Not sending data back to the Driver etc) they can perform equally well.ConclusionFor learning, depending upon your prior knowledge, Python is the easiest of all to pick up. For implementations, Choice is in your hands which language to choose for implementations but let me tell you one secret or a tip, you don’t have to stick to one language until you finish your project. You can divide your problem in small buckets and utilize the best language to solve the problem. This way, you can achieve balance between optimum performance, availability, proficiency in a skill, and sub-problem at hand.  Do let us know how your experience was in learning the language comparisons and the language you think is better for Spark. Moreover, which one you think is “the one for you”, through comments below.
Rated 4.5/5 based on 85 customer reviews
7890
Scala Vs Python Vs R Vs Java - Which language is b...

One of the most important decisions for the Big da... Read More

How Big is ‘Big Data’, Anyway?

When I got introduced to the data-world with my first corporate induction training, about 10 years ago. I was then still processing the difference between Data and Information. The following helped me understand the same:Data: It is raw information (unprocessed facts and figures) without any context for e.g. Number 20Information: structured Data grouped together which can have interpretation. E.g $20 for a toy.Knowledge: combination of information, experience and insight that may benefit the individual for the organisation. E.g. $20 for a toy in Black Friday Sale in a mall.Wisdom: Knowledge becomes wisdom when one can assimilate and apply this knowledge to make the right decisions. E.g. One who wants to buy a toy will wait for the Black Friday Sale to get it at a cheaper price.By the time I started understanding above differences, ‘Big data’ term was already making it big and then the obvious question in mind was,” When to call ‘data’ à ‘ Big data’? “I then made an attempt to understand ‘how big is a data to be called  big data?’ and here, I have a big revelation to make, for all of you reading this article, that ‘Big Data’ is actually misleading term and it is irrelevant with “Bigness of data” but it is to be used in relevance. In fact, it is a term which needs to be understood, only in perspective.The simplest one I could find relevant is,  Big data is the data that cannot be stored with traditional storages, cannot be processed with traditional methods/ways and within a short period of time (and these references would still be valid as time advances.) but this is not textbook or only definition of big data. Interestingly, One who finds one set of data as big data can be traditional data for others so truly it cannot be bounded in words but loosely can be described through numerous examples. I am sure by the end of the article you will be able to answer the question for yourself. Let’s start.Do you know? - NASA researchers Michael Cox and David Ellsworth use the term “big data” for the first time to describe a familiar challenge in the 1990s supercomputers generating massive amounts of information - in Cox and Ellsworth’s case, simulations of airflow around aircraft - that cannot be processed and visualized.If you go through a  brief history of big data, you would know data which is not fitting into memory or disk was called ‘Big data problem’ back in 1997.As the years passed by innovations were on rising and disruptions were made so the data universe is growing all the time. Let’s understand a few widely available and stated statistics for ‘big data’ (Collected around 2017 or before) >>On average, people  send about 500 million tweets per day.Snapchat users share 527,760 photos in a minute Instagram users post 46,740 photos in a minute More than 120 professionals join LinkedIn in a minute Users watch 4,146,600 YouTube videos in a minuteThe average U.S. customer  uses 1.8 gigabytes of data per month on his or her cell phone plan.Amazon sells  600 items per second.On average, each person who uses email  receives 88 emails per day and send 34. That adds up to more than 200 billion emails each day.MasterCard processes  74 billion transactions per year.Commercial airlines  make about 5,800 flights per day.You might be interested to read through   Domo’s Data Never Sleeps 5.0 report, for the numbers generated every minute of the day.Understanding that the above stats are probably about 1.5-2 years older and data is ever-growing, it helps to establish the fact that ‘big data‘ is a moving target and…. In short,“Today’s big data is tomorrow’s small data.”Now that we have some knowledge about transactions/tweets/snaps in a day, Let’s also understand how much data, all these “One-minute Quickies” are generating. Let’s talk about some volumes too. Afterall volumes are one of the characteristics of big data but mind you, not only characteristic of big data. It is believed that, In a single day, the world produces 2.5 quintillion bytes (2.3 trillion gigabytes) of data, in layman's terms, this is the equivalent of everyone in the world downloading 60 episodes of Breaking Bad, in HD, 20 times! [Source:  VCloud 2012] and According to estimates, the volume of data worldwide doubles every 1.2 years.IDC predicts that the collective sum of the world's data will grow from 33 zettabytes this year to a 175ZB by 2025, for a compounded annual growth rate of 61 per cent. The 175ZB figure represents a 9 per cent increase over last year's prediction of data growth by 2025 – As per the report published in Dec’2018.But, do you know: how much would be 1 zettabyte of data? Let’s understand. One zettabyte is equal to one sextillion bytes or 1021 (1,000,000,000,000,000,000,000) bytes or, one zettabyte is roughly equal to a trillion gigabytes.Fun Fact: There is a legit term coined as The Zettabyte Era (Today’s Era).The Zettabyte Era can also be understood as an age of growth of all forms of digital data that exist in the world which includes the public Internet, but also all other forms of digital data such as stored data from security cameras or voice data from cell-phone calls.You must check out this  infographic by economywatch (taken from  SearchEngineJournal) to understand how much data zettabyte consists of, putting it into context with current data storage capabilities and usage.Today’s ‘big data’ is generated from majority 3 sources i.e. People Generated: Social media uploads, Mails etc. Machine Generated: M2M (machine to machine) interactions, IOT devices etc. Business Generated: Data generated and stored into today’s OLTPs, OLAPs, Data warehouses, data marts, reports, operational data throughout the enterprise/organization.Various analytics tools available in the market today, help in solving big data challenges by providing ways for storing this data, process this data and make valuable insights from this data.As we discussed, big data is moving target as time advances, it is also interesting to know even today, data which is not of huge size but is difficult to process and of relatively smaller volume would still be categorized as Big Data. For example, unstructured data in emails, from social media platforms, data which is required to process with real-time/near real-time etc. all the examples we have seen so far, all of it is big data.   But, It would be a mistake to assume that, Big Data only as data that is analyzed using Hadoop, Spark or another complex analytics platform. As big data is moving the target and it’s ever-growing, also with various disruptive sources of data are being introduced every day, to process this data newer tools would be invented, and hence big data cannot just remain a function of tools being used to analyze it. To conclude, as discussed at the starting of the article, it would still be appropriate and reasonable to say, this moving target of big data which would always be challenged for storage, processing methods and process it within a short period as well. So big data is a function of volume and/or time and/or storage and/or variety. It was fun and exciting to know what different aspects are hidden in ‘BIG DATA’ word and I thoroughly enjoyed solving this mystery.Did you enjoy solving it too?Do let us know how was experience through comments below.Happy Learning!!!
Rated 4.5/5 based on 23 customer reviews
14027
How Big is ‘Big Data’, Anyway?

When I got introduced to the data-world with my fi... Read More

Fundamentals of Apache Spark

IntroductionBefore getting into the fundamentals of Apache Spark, let’s understand What really is ‘Apache Spark’ is? Following is the authentic one-liner definition.Apache Spark is a fast and general-purpose, cluster computing system.One would find multiple definitions when you search the term Apache Spark. All of those give similar gist, just different words. Let’s understand these special keywords which describe Apache Spark. Fast: As spark uses in-memory computing it’s fast. It can run queries 100x faster. We will get to details of architecture later to understand this aspect better little later in the article. One would find the keywords ‘Fast’ and/or ‘In-memory’ in all the definitions. General Purpose: Apache spark is a unified framework. It provides one execution model for all tasks and hence very easy for developers to learn and they can work with multiple APIs easily. Spark offers over 80 high-level operators that make it easy to build parallel apps and one can use it interactively from the Scala, Python, R, and SQL shells.Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application.Cluster Computing: Efficient processing of data on Set of computers (Refer commodity hardware here) or distributed systems. It’s also called a Parallel Data processing Engine in a few definitions. Spark is utilized for Big data analytics and related processing. One more important keyword associated with Spark is Open Source. It was open-sourced in 2010 under a   BSD license.Spark (and its RDD) was developed(earliest version as it’s seen today), in 2012, in response to limitations in the   MapReduce cluster computing paradigm. Spark is commonly seen as an in-memory replacement of MapReduce.Since its release, Apache Spark has seen rapid adoption due to its characteristics briefly discussed above.Who should go for Apache SparkBefore trying to find out whether Apache spark is for me? Or whether I have the right skill set, It's important to focus on the generality characteristic in further depth.Apache Spark consists of Spark Core and a set of libraries. The core is the distributed execution engine and the Java, Scala, and Python APIs offer a platform for distributed ETL application development. Additional libraries, built atop the core, allow diverse workloads for streaming, SQL, and machine learning.As Spark provides these multiple components, it’s evident that Spark is developed and widely utilized for big data and analytics.  Professionals who should learn Apache SparkIf one is aspiring to be landed into the following professions or anyone who has an interest in data and insights, Knowledge of spark will prove useful:Data ScientistsData EngineersPrerequisites of learning Apache SparkMost of the students looking for big data training, Apache spark is number one framework in big data. So most of the knowledge seekers looking for spark training, it is important to note that there are few prerequisites to learn apache spark.Before getting into Big data, you must have minimum knowledge on:Anyone of the programming languages >> Core   Python or Scala.Spark installations can be done on any platform but its framework is similar to Hadoop and hence having knowledge of HDFS and YARN is highly recommended. Having knowledge of Hive is an added advantage but is not mandatory.Basic knowledge of SQL. In SQL mainly select * from, joins and group by these three commands highly recommended.Optionally, knowing any cloud technology like AWS. Recommended for those who want to work with production-like environments.System requirements of Apache SparkOfficial site for  Apache Spark gives following recommendation (Traverse link for further details)Storage System: There are few ways to set this up as follows: Spark can run on the same node as HDFS. Spark standalone node cluster can be installed on the same nodes and configure Spark and Hadoop memory and CPU usage accordingly to avoid any interference.Or,1. Hadoop and Spark can execute on common Resource Manager ( Ex. Yarn etc)Or,2. Spark will be executing in same Local Area Network as HDFS but on separate nodes.Or3. If a requirement is a quick response and low latency from data stores then execute compute jobs on separate nodes than that of storage nodes.Local Disks: Typically 4-8 disks per node, configured without RAID.If underline OS is Linux then mount the disk with noatime option and in Spark environment configure spark.local.dir variable to be a comma-separated list of local disks.Note: For HDFS, it can be the same disk as HDFS.Memory: Minimum 8GB - 100s of GBs of memory per machine.A recommendation is the allocation of 75% of the memory to Spark.Network: 10GB or faster speed network.CPU cores: 8-16 Cores per machineHowever, for Training and Learning purpose and just to taste Spark, following are two available options: Run it locally Use AWS EMR (Or any cloud computing service)For learning purposes, minimum 4gb ram system with minimum 30gb disk may prove enough.History of Apache SparkSpark was primarily developed to Overcome the Limitations of MapReduce.Versioning: Spark initial version was version 0, version 1.6 is assumed to be a stable version and is being used in multiple commercial corporate projects. Version 2.3 is the latest available version. MapReduce is cluster computing  paradigm, which forces a particular linear  data flow structure on distributed programs: MapReduce programs read input data from disk,  map a function across the data,  reduce the results of the map, and store reduction results on disk. Due to multiple copies of data and multiple I/O as described, MapReduce takes lots of time to process the volume of data. MapReduce can do only batch time processing and is unsuitable for real-time data processingIt is unsuitable for trivial join like transformations. It’s unfit for large data on a network and also with OLTP data.Also, not suitable for graphics and interactive data.Spark overcomes all these limitations and able to do faster processing too on the local disk as well.Why Apache Spark?Numerous advantages of Spark have made its a market favorite.Let’s discuss one by one.Speed: Extends MapReduce Model to support computations like stream processing and interactive queries.Single Combination for processes and multiple tools:  Covers multiple workloads ( in a traditional system, it used to require different distributed systems), which makes combining different processing types and ease of tool management.Unification: Developers have to learn only one platform unlike multiple languages and tools in a traditional system.Support to different Resource Managers: Spark supports Hadoop HDFS system, and YARN for resource management but it’s not the only resource manager it supports. It works on MESOS and on any standalone scheduler like spark resource manager.Support for cutting-edge Innovation: Spark provides capabilities and support for an array of new-age technologies ranging from built-in machine learning libraries,   visualization tools, support for near processing (which was in a way the biggest challenge pre- spark era) and supports seamless integration with other deep learning frameworks like TensorFlow. This enables Spark to provide an innovative solution for new age use-cases.Spark can access diverse data sources and make sense of them all and hence it’s trending in the market over any other cluster computing software available. Who uses Apache SparkListing a few use cases of Apache spark below :1. Analytics - Spark can be very useful when building real-time analytics from a stream of incoming data.2. E-commerce - Information about the real-time transaction can be passed to streaming clustering algorithms like alternating least squares or K-means clustering algorithm. The results can be combined with data from other sources like social media profiles, product reviews on forums, customer comments, etc. to enhance the recommendations to customers based on new trends.Shopify: At Shopify, we underwrite credit card transactions, exposing us to the risk of losing money. We need to respond to risky events as they happen, and a traditional ETL pipeline just isn’t fast enough. Spark Streaming is an incredibly powerful real-time data processing framework based on Apache Spark. It allows you to process real-time streams like Apache Kafka using Python with incredible simplicity.Alibaba: Alibaba Taobao operates one of the world’s largest e-commerce platforms. We collect hundreds of petabytes of data on this platform and use Apache Spark to analyze these enormous amounts of data.3. Healthcare Industry –Healthcare has multiple use-cases of unstructured data to be processed in real-time. It has data ranging from image formats like scans etc to specific medical industry standards and wearable tracking devices. Many healthcare providers are keen on using spark for patient’s records to build 360 degrees view of the patient to do accurate diagnosis.MyFitnessPal: MyFitnessPal needed to deliver a new feature called “Verified Foods.” The feature demanded a faster pipeline to execute a number of highly sophisticated algorithms. Their legacy non-distributed Java-based data pipeline was slow, did not scale, and lacked flexibility.Here are a few other examples from industry leaders:Regeneron: Future of Drug Discovery with Genomics at Scale powered by SparkZeiss: Using Spark Structured Streaming for Predictive MaintenanceDevon Energy: Scaling Geographic Analytics with Spark GraphXYou can also learn more about use cases of Apache Spark  here.Career Benefits:Career Benefits of Spark for you as an individual:Apache Spark developers earn the highest average salary among all other programmers. According to its  2015 Data Science Salary Survey, O’Reilly found strong correlations between those who used Apache Spark and those who were paid more money. In one of its models, using Spark added more than $11,000 to the median salary.If you’re considering switching to this extremely in-demand career then taking up the  Apache Spark training will be an added advantage. Learning Spark will give you a steep competitive edge and can land you up in market best-paying jobs with top companies. Spark has gained enough adherents over the years to place it high on the list of fastest-growing skills; data scientists and sysadmins have evaluated the technology and clearly seen what they liked.  April’s Dice Report explored the fastest-growing technology skills, based on an analysis of job postings and data from Dice’s annual salary survey. The results are below; percentages are based on year-over-year growth in job postings:Benefits of Spark implementing Spark in your organization:Apache spark is now a decade older but still going strong. Due to lightning-fast processing and numerous other advantages discussed so far, Spark is still the first choice of many organizations.Spark is considered to be the most popular open-source project on the planet, with more than 1,000 contributors from 250-plus organizations, according to Databricks.ConclusionTo sum up, Spark helps to simplify the computationally intensive task of processing high volumes of real-time or batch data. It can seamlessly integrate with complex capabilities such as machine learning and graph algorithms. In short, Spark brings exclusive Big Data processing (which earlier was only for giant companies like Google) to the masses.Do let us know how your learning experience was, through comments below.Happy Learning!!!
Rated 4.5/5 based on 13 customer reviews
9804
Fundamentals of Apache Spark

IntroductionBefore getting into the fundamentals o... Read More

Apache Kafka Vs Apache Spark: Know the Differences

A new breed of ‘Fast Data’ architectures has evolved to be stream-oriented, where data is processed as it arrives, providing businesses with a competitive advantage. - Dean Wampler (Renowned author of many big data technology-related books)Dean Wampler makes an important point in one of his webinars. The demand for stream processing is increasing every day in today’s era. The main reason behind it is, processing only volumes of data is not sufficient but processing data at faster rates and making insights out of it in real time is very essential so that organization can react to changing business conditions in real time.And hence, there is a need to understand the concept “stream processing “and technology behind it. So, what is Stream Processing?Think of streaming as an unbounded, continuous real-time flow of records and processing these records in similar timeframe is stream processing. AWS (Amazon Web Services) defines “Streaming Data” is data that is generated continuously by thousands of data sources, which typically send in the data records simultaneously, and in small sizes (order of Kilobytes). This data needs to be processed sequentially and incrementally on a record-by-record basis or over sliding time windows and used for a wide variety of analytics including correlations, aggregations, filtering, and sampling.In stream processing method, continuous computation happens as the data flows through the system.Stream processing is highly beneficial if the events you wish to track are happening frequently and close together in time. It is also best to utilize if the event needs to be detected right away and responded to quickly.There is a subtle difference between stream processing, real-time processing (Rear real-time) and complex event processing (CEP). Let’s quickly look at the examples to understand the difference. Stream Processing: Stream processing is useful for tasks like fraud detection and cybersecurity. If transaction data is stream-processed, fraudulent transactions can be identified and stopped before they are even complete.Real-time Processing: If event time is very relevant and latencies in the second's range are completely unacceptable then it’s called Real-time (Rear real-time) processing. For ex. flight control system for space programsComplex Event Processing (CEP): CEP utilizes event-by-event processing and aggregation (for example, on potentially out-of-order events from a variety of sources, often with large numbers of rules or business logic).We have multiple tools available to accomplish above-mentioned Stream, Realtime or Complex event Processing. Spark Streaming, Kafka Stream, Flink, Storm, Akka, Structured streaming are to name a few. We will try to understand Spark streaming and Kafka stream in depth further in this article. As historically, these are occupying significant market share. Apache Kafka Stream: Kafka is actually a message broker with a really good performance so that all your data can flow through it before being redistributed to applications. Kafka works as a data pipeline.Typically, Kafka Stream supports per-second stream processing with millisecond latency.  Kafka Streams is a client library for processing and analyzing data stored in Kafka. Kafka streams can process data in 2 ways. Kafka -> Kafka: When Kafka Streams performs aggregations, filtering etc. and writes back the data to Kafka, it achieves amazing scalability, high availability, high throughput etc.  if configured correctly. It also does not do mini batching, which is “real streaming”.Kafka -> External Systems (‘Kafka -> Database’ or ‘Kafka -> Data science model’): Typically, any streaming library (Spark, Flink, NiFi etc) uses Kafka for a message broker. It would read the messages from Kafka and then break it into mini time windows to process it further. Representative view of Kafka streaming: Note:Sources here could be event logs, webpage events etc. etc. DB/Models would be accessed via any other streaming application, which in turn is using Kafka streams here. Kafka Streams is built upon important stream processing concepts such as properly distinguishing between event time and processing time, windowing support, and simple (yet efficient) management of application state. It is based on many concepts already contained in Kafka, such as scaling by partitioning.Also, for this reason, it comes as a lightweight library that can be integrated into an application.The application can then be operated as desired, as mentioned below: Standalone, in an application serverAs a Docker container, or Directly, via a resource manager such as Mesos.Why one will love using dedicated Apache Kafka Streams?Elastic, highly scalable, fault-tolerantDeploy to containers, VMs, bare metal, cloudEqually viable for small, medium, & large use casesFully integrated with Kafka securityWrite standard Java and Scala applicationsExactly-once processing semanticsNo separate processing cluster requiredDevelop on Mac, Linux, WindowsApache Spark Streaming:Spark Streaming receives live input data streams, it collects data for some time, builds RDD, divides the data into micro-batches, which are then processed by the Spark engine to generate the final stream of results in micro-batches. Following data flow diagram explains the working of Spark streaming. Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data. DStreams can be created either from input data streams from sources such as Kafka, Flume, and Kinesis, or by applying high-level operations on other DStreams. Internally, a DStream is represented as a sequence of RDDs. Think about RDD as the underlying concept for distributing data over a cluster of computers. Why one will love using Apache Spark Streaming?It makes it very easy for developers to use a single framework to satisfy all the processing needs. They can use MLib (Spark's machine learning library) to train models offline and directly use them online for scoring live data in Spark Streaming. In fact, some models perform continuous, online learning, and scoring.Not all real-life use-cases need data to be processed at real real-time, few seconds delay is tolerated over having a unified framework like Spark Streaming and volumes of data processing. It provides a range of capabilities by integrating with other spark tools to do a variety of data processing.  Spark Streaming Vs Kafka StreamNow that we have understood high level what these tools mean, it’s obvious to have curiosity around differences between both the tools. Following table briefly explain you, key differences between the two. Sr.NoSpark streamingKafka Streams1Data received form live input data streams is Divided into Micro-batched for processing.processes per data stream(real real-time)2Separated processing Cluster is requriedNo separated processing cluster is requried.3Needs re-configuration for Scaling Scales easily by just adding java processes, No reconfiguration requried.4At least one semanticsExactly one semantics5Spark streaming is better at processing group of rows(groups,by,ml,window functions etc.)Kafka streams provides true a-record-at-a-time processing capabilities. it's better for functions like rows parsing, data cleansing etc.6Spark streaming is standalone framework.Kafka stream can be used as part of microservice,as it's just a library.Kafka streams Use-cases:Following are a couple of many industry Use cases where Kafka stream is being used: The New York Times: The New York Times uses Apache Kafka and Kafka Streams to store and distribute, in real-time, published content to the various applications and systems that make it available to the readers.Pinterest: Pinterest uses Apache Kafka and the Kafka Streams at large scale to power the real-time, predictive budgeting system of their advertising infrastructure. With Kafka Streams, spend predictions are more accurate than ever.Zalando: As the leading online fashion retailer in Europe, Zalando uses Kafka as an ESB (Enterprise Service Bus), which helps us in transitioning from a monolithic to a micro services architecture. Using Kafka for processing event streams enables our technical team to do near-real time business intelligence.Trivago: Trivago is a global hotel search platform. We are focused on reshaping the way travellers search for and compare hotels while enabling hotel advertisers to grow their businesses by providing access to a broad audience of travellers via our websites and apps. As of 2017, we offer access to approximately 1.8 million hotels and other accommodations in over 190 countries. We use Kafka, Kafka Connect, and Kafka Streams to enable our developers to access data freely in the company. Kafka Streams powers parts of our analytics pipeline and delivers endless options to explore and operate on the data sources we have at hand.Broadly, Kafka is suitable for microservices integration use cases and have wider flexibility.Spark Streaming Use-cases:Following are a couple of the many industries use-cases where spark streaming is being used: Booking.com: We are using Spark Streaming for building online Machine Learning (ML) features that are used in Booking.com for real-time prediction of behaviour and preferences of our users, demand for hotels and improve processes in customer support. Yelp: Yelp’s ad platform handles millions of ad requests every day. To generate ad metrics and analytics in real-time, they built the ad event tracking and analyzing pipeline on top of Spark Streaming. It allows Yelp to manage a large number of active ad campaigns and greatly reduce over-delivery. It also enables them to share ad metrics with advertisers in a timelier fashion.Spark Streaming’s ever-growing user base consists of household names like Uber, Netflix, and Pinterest.Broadly, spark streaming is suitable for requirements with batch processing for massive datasets, for bulk processing and have use-cases more than just data streaming. Dean Wampler explains factors to evaluation for tool basis Use-cases beautifully, as mentioned below: Sr.NoEvaluation CharacteristicResponse Time windowTypical Use Case Requirement1.Latency tolerancePico to Microseconds (Real Real time)Flight control system for space programs etc.Latency tolerance< 100 MicrosecondsRegular stock trading market transactions, Medical diagnostic equipment outputLatency tolerance< 10 millisecondsCredit cards verification window when consumer buy stuff onlineLatency tolerance< 100 millisecondshuman attention required Dashboards, Machine learning modelsLatency tolerance< 1 second to minutesMachine learning model trainingLatency tolerance1 minute and abovePeriodic short jobs(typical ETL applications)2.Evaluation CharacteristicTransaction/events frequencyTypical Use Case RequirementVelocity1M per secondNest Thermostat, Big spikes during specific time period.3Evaluation CharacteristicTypes of data processingNAData Processing Requirement1. SQLNA2. ETL3. Dataflow4. Training and/or Serving Machine learning modelsData Processing Requirement1. Bulk data processingNA2. Individual Events/Transaction processing4.Evaluation CharacteristicUse of toolNAFlexibility of implementation1. Kafka : flexible as provides library.NA2. Spark: Not flexible as it’s part of a distributed frameworkConclusionKafka Streams is still best used in a ‘Kafka -> Kafka’ context, while Spark Streaming could be used for a ‘Kafka -> Database’ or ‘Kafka -> Data science model’ type of context.Although, when these 2 technologies are connected, they bring complete data collection and processing capabilities together and are widely used in commercialized use cases and occupy significant market share. 
Rated 4.5/5 based on 19 customer reviews
9968
Apache Kafka Vs Apache Spark: Know the Differences

A new breed of ‘Fast Data’ architectures has e... Read More

What is Big Data — An Introductory Guide

The massive world of Big DataIf one strolls around any IT office premises, over every decade (nowadays time span is even lesser, almost every 3-4 years) one would overhear professionals discussing new jargons from the hottest trends in technology. Around 5 -6 years ago, one such word has started ruling IT services is ‘BIG data’ and still has been interpreted by a layman to tech geeks in various ways.Although services industries started talking about big data solutions widely from 5-6 years, it is believed that the term was in use since the 1990s by John Mashey from Silicon Graphics, whereas credit for coining the term ‘big data’ aligning to its modern definition goes to Roger Mougalas from O’Reilly Media in 2005.Let’s first understand why everyone going gaga about ‘BIG data’ and what are the real-world problems it is supposed to solve and then we will try to answer what and how aspects of it.Why is Big Data essential for today’s digital world?Pre smart-phones era, internet and web world were around for many years, but smart-phones made it mobile with on-the-go usage. Social Media, mobile apps started generating tons of data. At the same time, smart-bands, wearable devices ( IoT, M2M ), have given newer dimensions for data generation. This newly generated data became a new oil to the world. If this data is stored and analyzed, it has the potential to give tremendous insights which could be put to use in numerous ways.You will be amazed to see the real-world use cases of BIG data. Every industry has a unique use case and is even unique to every client who is implementing the solutions. Ranging from data-driven personalized campaigning (you do see that item you have browsed on some ‘xyz’ site onto Facebook scrolling, ever wondered how?) to predictive maintenance of huge pipes across countries carrying oils, where manual monitoring is practically impossible. To relate this to our day to day life, every click, every swipe, every share and every like we casually do on social media is helping today’s industries to take future calculated business decisions. How do you think Netflix predicted the success of ‘House of Cards’ and spent $100 million on the same? Big data analytics is the simple answer.Talking about all this, the biggest challenge in the past was traditional methods used to store, curate and analyze data, which had limitations to process this data generated from newer sources and which were huge in volumes generated from heterogeneous sources and was being generated  really fast(To give you an idea, roughly 2.5 quintillion data is generated per day as on today – Refer infographic released by Domo called “Data Never Sleeps 5.0.” ), Which given rise to term BIG data and related solutions.Understanding Big Data: Experts’ viewpoint BIG data literally means Massive data (loosely > 1TB) but that’s not the only aspect of it. Distributed data or even complex datasets which could not be analyzed through traditional methods can be categorized into ‘Big data’ and hence Big data theoretical definition makes a lot of sense with this background:“Gartner (2012) defines, Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation.”Generic data possessing characteristics of big data are 3Vs namely Variety, Velocity, and VolumeBut due to the changing nature of data in today’s world and to gain most insights of it, 3 more Vs are added to the definition of BIG DATA, namely Variability, Veracity and Value.The diagram below illustrates each V in detail:Diagram: 6 V’s of Big DataThis 6Vs help understanding the characteristics of “BIG Data” but let’s also understand types of data in BIG Data processing.  “Variety” of above characteristics caters to different types of data can be processed through big data tools and technologies. Let’s drill down a bit for understanding what those are:Structured ex. Mainframes, traditional databases like Teradata, Netezza, Oracle, etc.Unstructured ex. Tweets, Facebook posts, emails, etc.Semi/Multi structured or Hybrid ex. E-commerce, demographic, weather data, etc.As the technology is advancing, the variety of data is available and its storage, processing, and analysis are made possible by big data. Traditional data processing techniques were able to process only structured data.Now, that we understand what big data and limitations of old traditional techniques are of handling such data, we could safely say, we need new technology to handle this data and gain insights out of it. Before going further, do you know, what were the traditional data management techniques?Traditional Techniques of Data Processing are:RDBMS (Relational Database Management System)Data warehousing and DataMartOn a high level, RDBMS catered to OLTP needs and data warehousing/DataMart facilitated OLAP needs. But both the systems work with structured data.I hope. now one can answer, ‘what is big data?’ conceptually and theoretically both.So, it’s time that we understand how it is being done in actual implementations.only storing of “big data” will not help the organizations, what’s important is to turn data into insights and business value and to do so, following are the key infrastructure elements:Data collectionData storageData analysis andData visualization/outputAll major big data processing framework offerings are based on these building blocks.And in an alignment of the above building blocks, following are the top 5 big data processing frameworks that are currently being used in the market:1. Apache Hadoop : Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.First up is the all-time classic, and one of the top frameworks in use today. So prevalent is it, that it has almost become synonymous with Big Data.2 Apache Spark : unified analytics engine for large-scale data processing.Apache Spark and Hadoop are often contrasted as an "either/or" choice,  but that isn't really the case.Above two frameworks are popular but apart from that following 3 are available and are comparable frameworks:3. Apache Storm : free and open source distributed real-time computation system. You can also take up Apache Storm training to learn more about Apache Storm.4. Apache Flink : streaming dataflow engine, aiming to provide facilities for distributed computation over streams of data. Treating batch processes as a special case of streaming data, Flink is effectively both batch and real-time processing framework, but one which clearly puts streaming first.5. Apache Samza : distributed Stream processing framework.Frameworks help processing data through building blocks and generate required insights. The framework is supported by the whopping number of tools providing the required functionality.Big Data processing frameworks and technology landscapeBig data tools and technology landscape can be better understood with layered big data architecture. Give a good read to a great article by Navdeep singh Gill on XENONSTACK for understanding the layered architecture of big data.By taking inspiration from layered architecture, different available tools in the market are mapped to layers to understand big data technology landscape in depth. Note that, layered architecture fits very well with infrastructure elements/building blocks discussed in the above section.Few of the tools are briefed below for further understanding:  1. Data Collection / Ingestion Layer Cassandra: is a free and open-source, distributed, wide column store, NoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failureKafka: is used for building real-time data pipelines and streaming apps. Event streaming platformFlume: log collector in HadoopHBase: columnar database in Hadoop2. Processing Layer Pig: scripting language in the Hadoop frameworkMapReduce: processing language in Hadoop3. Data Query Layer Impala: Cloudera Impala:  modern, open source, distributed SQL query engine for Apache Hadoop. (often compared with hive)Hive: Data Warehouse software for data Query and analysisPresto: Presto is a high performance, distributed SQL query engine for big data. Its architecture allows users to query a variety of data sources such as Hadoop, AWS S3, Alluxio, MySQL, Cassandra, Apache Kafka, and MongoDB4. Analytical EngineTensorFlow: n source machine learning library for research and production.5. Data storage LayerIgnite: open-source distributed database, caching and processing platform designed to store and compute on large volumes of data across a cluster of nodesPhoenix: hortonworks: Apache Phoenix is an open source, massively parallel, relational database engine supporting OLTP for Hadoop using Apache HBase as its backing storePolyBase: s a new feature in SQL Server 2016. It is used to query relational and non-relational databases (NoSQL). You can use PolyBase to query tables and files in Hadoop or in Azure Blob Storage. You can also import or export data to/from Hadoop.Sqoop: ETL toolBig data in EXCEL: Few people like to process big datasets with current excel capabilities and it's known as Big Data in Excel6. Data Visualization LayerMicrosoft HDInsight: Azure HDInsight is a Hadoop service offering hosted in Azure that enables clusters of managed Hadoop instances. Azure HDInsight deploys and provisions Apache Hadoop clusters in the cloud, providing a software framework designed to manage, analyze, and report on big data with high reliability and availability. Hadoop administration training will give you all the technical understanding required to manage a Hadoop cluster, either in a development or a production environment.Best Practices in Big Data  Every organization, industry, business, may it be small or big wants to get benefit out of “big data” but it's essential to understand that it can prove of maximum potential only if organization adhere to best practices before adapting big data:Answering 5 basic questions help clients know the need for adapting Big Data for organizationTry to answer why Big Data is required for the organization. What problem would it help solve?Ask the right questions.Foster collaboration between business and technology teams.Analyze only what is required to use.Start small and grow incrementally.Big Data industry use-cases We talked about all the things in the Big Data world except real use cases of big data. In the starting, we did discuss few but let me give you insights into the real world and interesting big data use cases and for a few, it’s no longer a secret ☺. In fact, it’s penetrating to the extent you name the industry and plenty of use cases can be told. Let’s begin.1. Streaming PlatformsAs I had given an example of ‘House of Cards’ at the start of the article, it’s not a secret that Netflix uses Big Data analytics. Netflix spent $100mn on 26 episodes of ‘House of Cards’ as they knew the show would appeal to viewers of original British House of Cards and built in director David Fincher and actor Kevin Spacey. Netflix typically collects behavioral data and it then uses this data to create a better experience for the user.But Netflix uses Big Data for more than that, they monitor and analyze traffic details for various devices, spot problem areas and adjust network infrastructure to prepare for future demand. (later is action out of Big Data analytics, how big data analysis is put to use). They also try to get insights into types of content viewers to prefer and help them make informed decisions.   Apart from Netflix, Spotify is also a known great use case.2. Advertising and Media / Campaigning /EntertainmentFor decades marketers were forced to launch campaigns while blindly relying on gut instinct and hoping for the best. That all changed with digitization and big data world. Nowadays, data-driven campaigns and marketing is on the rise and to be successful in this landscape, a modern marketing campaign must integrate a range of intelligent approaches to identify customers, segment, measure results, analyze data and build upon feedback in real time. All needs to be done in real time, along with the customer’s profile and history, based on his purchasing patterns and other relevant information and Big Data solutions are the perfect fit.Event-driven marketing is also could be achieved through big data, which is another way of successful marketing in today’s world. That basically indicates, keeping track of events customer are directly and indirectly involved with and campaign exactly when a customer would need it rather than random campaigns. For. Ex if you have searched for a product on Amazon/Flipkart, you would see related advertisements on other social media apps you casually browse through. Bang on, you would end up purchasing it as you anyway needed options best to choose from.3. Healthcare IndustryHealthcare is one of the classic use case industries for Big Data applications. The industry generates a huge amount of data.Patients medical history, past records, treatments given, available and latest medicines, Medicinal latest available research the list of raw data is endless.All this data can help give insights and Big Data can contribute to the industry in the following ways:Diagnosis time could be reduced, and exact requirement treatment could be started immediately. Most of the illnesses could be treated if a diagnosis is perfect and treatment can be started in time. This can be achieved through evidence-based past medical data available for similar treatments to doctor treating the illness, patients’ available history and feeding symptoms real-time into the system.  Government Health department can monitor if a bunch of people from geography reporting of similar symptoms, predictive measures could be taken in nearby locations to avoid outbreak as a cause for such illness could be the same.   The list is long, above were few representative examples.4. SecurityDue to social media outbreak, today, personal information is at stake. Almost everything is digital, and majority personal information is available in the public domain and hence privacy and security are major concerns with the rise in social media. Following are few such applications for big data.Cyber Crimes are common nowadays and big data can help to detect, predicting crimes.Threat analysis and detection could be done with big data.  5. Travel and TourismFlight booking sites, IRCTC track the clicks and hits along with IP address, login information, and other details and as per demand can do dynamic pricing for the flights/ trains. Big Data helps in dynamic pricing and mind you it’s real time. Am sure each one of us has experienced this. Now you know who is doing it :DTelecommunications, Public sector, Education, Social media and gaming, Energy and utility every industry have implemented are implementing several of these Big Data use cases day in and day out. If you look around am sure you would find them on the rise.Big Data is helping everyone industries, consumers, clients to make informed decisions, whatever it may be and hence wherever there is such a need, Big Data can come handy.Challenges faced by Big Data in the real world for adaptationAlthough the world is going gaga about big data, there are still a few challenges to implement and adopt Big Data and hence service industries are still striving towards resolving those challenges to implement best Big Data solution without flaws.An October 2016 report from Gartner found that organizations were getting stuck at the pilot stage of their big data initiatives. "Only 15 percent of businesses reported deploying their big data project to production, effectively unchanged from last year (14 per cent)," the firm said.Let’s discuss a few of them to understand what are they?1. Understanding Big Data and answering Why for the organization one is working with.As I started the article saying there are many versions of Big Data and understanding real use cases for organization decision makers are working with is still a challenge. Everyone wants to ride on a wave but not knowing the right path is still a struggle. As every organization is unique thus its utmost important to answer ‘why big data’ for each organization. This remains a major challenge for decision makers to adapt to big data.2. Understanding Data sources for the organizationIn today’s world, there are hundreds and thousands of ways information is being generated and being aware of all these sources and ingest all of them into big data platforms to get accurate insight is essential. Identifying sources is a challenge to address.It's no surprise, then, that the IDG report found, "Managing unstructured data is growing as a challenge – rising from 31 per cent in 2015 to 45 per cent in 2016."Different tools and technologies are on the rise to address this challenge.3. Shortage if Big Data Talent and retaining themBig Data is changing technology and there are a whopping number of tools in the Big Data technology landscape. It is demanded out of Big Data professionals to excel in those current tools and keep up self to ever-changing needs. This gets difficult for employees and employers to create and retain talent within the organization.The solution to this would be constant upskilling, re-skilling and cross-skilling and increasing budget of organization for retaining talent and help them train.4. The Veracity VThis V is a challenge as this V means inconsistent, incomplete data processing. To gain insights through big data model, the biggest step is to predict and fill missing information.This is a tricky part as filling missing information can lead to decreasing accuracy of insights/ analytics etc.To address this concern, there is a bunch of tools. Data curation is an important step in big data and should have a proper model. But also, to keep in mind that Big Data is never 100% accurate and one must deal with it.5. SecurityThis aspect is given low priority during the design and build phases of Big Data implementations and security loopholes can cost an organization and hence it’s essential to put security first while designing and developing Big Data solutions. Also, equally important to act responsibly for implementations for regulatory requirements like GDPR.  6. Gaining Valuable InsightsMachine learning data models go through multiple iterations to conclude on insights as they also face issues like missing data and hence the accuracy. To increase accuracy, lots of re-processing is required, which has its own lifecycle. Increasing accuracy of insights is a challenge and which relates to missing data piece. Which most likely can be addressed by addressing missing data challenge.This can also be caused due to unavailability of information from all data sources. Incomplete information would lead to incomplete insights which may not benefit to required potential.Addressing these discussed challenges would help to gain valuable insights through available solutions.With Big Data, the opportunities are endless. Once understood, the world is yours!!!!Also, now that you understand BIG DATA, it's worth understanding the next steps:Gary King, who is a professor at Harvard says “Big data is not about the data. It is about the analytics”You can also take up Big Data and Hadoop training to enhance your skills furthermore.Did the article helps you to understand today’s massive world of big data and getting a sneak peek into it Do let us know through the comment section below?
Rated 4.5/5 based on 2 customer reviews
26701
What is Big Data — An Introductory Guide

The massive world of Big DataIf one strolls around... Read More