One of the most important decisions for the Big data learners or beginners is choosing the best programming language for big data manipulation and analysis. Just understanding business problems and choosing the right model is not enough but implementing them perfectly is equally important and choosing the right language (or languages) for solving the problem goes a long way.
If you search top and highly effective programming languages for Big Data on Google, you will find the following top 4 programming languages:
Java is one of the oldest languages of all 4 programming languages listed here. Traditional Frameworks of Big data like Apache Hadoop and all the tools within its ecosystem are Java-based and hence using java opens up the possibility of utilizing large ecosystem of tools in the big data world.
A beautiful crossover between object-oriented and functional programming language is Scala. Scala is a highly Scalable Language. Scala was invented by the German Computer Scientist, Martin Odersky and the first version was launched in the year 2003.
Python was originally conceptualized by Guido van Rossum in the late 1980s. Initially, it was designed as a response to the ABC programming language and later gained its popularity as a functional language in a big data world. Python has been declared as one of the fastest-growing programming languages in 2018 as per the recently held Stack Overflow Developer Survey. Many data analysis, manipulation, machine learning, deep learning libraries are written in Python and hence it has gained its popularity in the big data ecosystem. It’s a very user-friendly language and it is its biggest advantage.
Python is not named after the snake. It’s named after the British TV show Monty Python.
R is the language of statistics. R is a language and environment for statistical computing and graphics. R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and is currently developed by the R Development Core Team. R is named partly after the first names of the first two R authors and partly as a play on the name of S*. The project was conceived in 1992, with an initial version released in 1995 and a stable beta version in 2000.
S is a statistical programming language developed primarily by John Chambers and R is an implementation of the S programming language combined with lexical scoping semantics, inspired by Scheme.
Every framework is implemented in the underlying programming language for its implementation. Ex Zend uses PHP, Panda Framework uses python similarly Hadoop framework uses Java and Spark uses Scala.
However, Spark officially supports Java, Scala, Python and R, all 4 languages. If one browses through Apache Spark’s official website documentation, he/she would find many other languages utilized by the open-source community for Spark implementation.
When any developer wants to start learning Spark, the first question he stumbles upon is, out of these pools of languages, which one to use and which one to master? Solution Architects would have a tough time choosing the right language for spark framework and Organizations will always be wondering, which skill sets are relevant for my problem if one doesn’t have the right knowledge about these languages in the context of Spark.
This article will try to answer all these queries.so let’s start-
Oldest of all and popular, widely adopted programming language of all. There is a number of
features/advantages due to which Java is favorite for Big data developers and tool creators:
Java is mostly the choice for most of the big data projects but for the Spark framework, one has to ponder upon, whether Java would be the best fit.
One major drawback of Java is its verbosity. One has to write long code (number of lines of code) to achieve simple functionality in Java.
Java does not support Read-Evaluate-Print-Loop (REPL) which is a major deal-breaker when choosing a programming language for big data processing.
Scala is comparatively new to the programming scene but has become popular very quickly. Above are a few quotes from bigger names in the industry for Scala. From the Spark context, many experts prefer Scala over other programming languages as Spark is written in Scala. Scala is the native language of Spark. It means any new API always first be available in Scala.
Scala is a hybrid functional programming language because It has both the features of object-oriented programming and functional programming. As an OO Programming Language, it considers every value as an object and all OOPS concepts apply. As a functional programming language, it defines and supports functions. All operations are done as functions. No variable stands by itself. Scala is a machine-compiled language.
Scala and Java are popular programming languages that run over JVM. JVM makes these languages framework friendly. One can say, Scala is an advanced level of Java.
Python is one of the de-facto languages of Data Science. It is a simple, open-source, general-purpose language and is very easy to learn. It has a rich set of libraries, utilities, ready-to-use features and support to a number of mature machine learning, big data processing, visualization libraries.
Whilst using user-defined functions or third party libraries in Python with Spark, processing would be slower as increased processing is involved as Python does not have equivalent Java/Scala native language API for these functionalities.
R is the favourite language of statisticians. R is fondly called a language of statisticians. It’s popular for research, plotting, and data analysis. Together with RStudio, it makes a killer statistic, plotting, and data analytics application.
R is majorly used for building data models to be used for data analysis.
With the introduction of these 4 languages, let’s now compare these languages for the Spark framework:
These languages can be categorized into 2 buckets basis high-level spark architecture support, broadly:
Due to these categorizations, performance may vary. Let’s understand architecture in little depth to understand the performance implications of using these languages. This would also help us to understand the question of when to use which language.
An application written in any one of the languages is submitted on the driver node and further driver node distributes the workload by dividing the execution on multiple worker nodes.
Consider the applications written are JVM compatible (Java/Scala). Now, Spark is also written in native JVM compatible Scala language, hence there is no explicit conversion required at any point of time to execute JVM compatible applications on Spark. Also, this makes the native language applications faster to perform on the Spark framework.
There are multiple scenarios for Python/R written applications:
Python/R driver talk to JVM driver by socket-based API. On the driver node, both the driver processes are invoked when the application language is non-JVM language.
Scenario 1: Applications for which Equivalent Java/Scala Driver API exists - This scenario executes the same way as JVM compatible applications by invoking Java API on the driver node itself. The cost for inter-process communication through sockets is negligible and hence performance is comparable. This is with the assumption that processed data over worker nodes are not to be sent back to the Driver again.
Scenario 1(b): If the assumption taken is void in scenario 1 i.e. processed data on worker nodes is to be sent back to driver then there is significant overhead and serialization required. This adds to processing time and hence performance in this scenario deteriorates.
Scenario 2: Applications for which Equivalent Java/Scala Driver API do not exist – Ex. UDF (User-defined functions) / Third party python libraries. In such cases equivalent Java API doesn’t exist and hence, additional executor sessions are initiated on worker node and python API is serialized on worker node and executed. This python worker processes in addition to JVM and coordination between them is overhead. Processes also compete for resources which adds to memory contention.
In addition, if the data is to send back to the driver node then processing takes a lot of time and problem scales up as volume increases and hence performance is bigger problem.
As we have seen a performance, Let’s see the tabular comparison between these languages.
|Performance||Faster||Faster (about 10x faster than Python)||Slower||Slower|
|Learning Curve||Easier than Java|
Tougher than Python
|Steep learning curve than Java & Python||Easiest||Moderate|
|User Groups||Web/Hadoop programmers||Big Data Programmers||Beginners & Data Engineers||Data Scientists/ Statisticians|
|Usage||Web development and Hadoop Native||Spark Native||Data Engineering/ Machine Learning/ Data Visualization||Visualization/ Data Analysis/ Statistics use cases|
|Type of Language||Object-Oriented, General Purpose||Object-Oriented & Functional General Purpose||General Purpose||Specifically for Data Scientists.|
Needs conversion into Scala/Python before productizing
|Concurrency||Support Concurrency||Support Concurrency||Does not Support Concurrency||NA|
|Ease of Use||Verbose||Lesser Verbose than Scala||Least Verbose||NA|
|Type Safety||Statically typed||Statically typed (except for Spark 2.0 Data frames)||Dynamically Typed||Dynamically Typed|
|Interpreted Language (REPL)||No||No||Yes||Yes|
|Maturated machine learning libraries availability/ Support||Limited||Limited||Excellent||Excellent|
|Web Notebooks Support||Ijava Kernel in Jupyter Notebook||Apache Zeppelin Notebook Support||Jupyter Notebook Support||R Notebook|
With the info we gathered for the languages, let's move to the main question i.e. which language to choose for Spark?
My answer is not a straightforward single language for this question. I will state my point of view for choosing the proper language:
You need to answer the following important questions before choosing the language:
Details of these explained below:
1. Skillset: This is very straightforward. Whichever is available skill set within a team, go with that to solve your problem, after evaluating answers of other two questions.
If you are self-employed, the one you have proficiency is the most likely suitable choice of language.
2. Library Support:
Following gives high-level capabilities of languages:
3. Performance: As seen earlier in the article, Scala/ Java is about 10x faster than Python/R as they are JVM supported languages. However, if you are writing Python/R applications wisely (like without using UDFs/ Not sending data back to the Driver etc) they can perform equally well.
For learning, depending upon your prior knowledge, Python is the easiest of all to pick up.
For implementations, Choice is in your hands which language to choose for implementations but let me tell you one secret or a tip, you don’t have to stick to one language until you finish your project. You can divide your problem in small buckets and utilize the best language to solve the problem. This way, you can achieve balance between optimum performance, availability, proficiency in a skill, and sub-problem at hand.
Do let us know how your experience was in learning the language comparisons and the language you think is better for Spark. Moreover, which one you think is “the one for you”, through comments below.