Now that we have seen howmost of the concepts and internals of Apache Spark work, we will take a look at how to install Apache Spark on our local machines (desktops/laptops).
Apache Spark is easy to install on Unix/Linux/Mac operating systems. It can be installed on a standalone machine and the steps are most common across the operating system. Let us look at the steps to install Apache Spark on a Mac machine as currently, I am on a Mac laptop.
Verifying Java Installation: The first step is to verify the Java installation. Since Apache Spark is developed in Scala which works on JVM, we definitely need Java installation to go ahead with any other installation. To verify Java installation on a Mac please perform the below step.
If you can't find Java installation please go ahead and install any of Oracle Java or OpenJDK version 8 or above.
Verifying Scala installation: The next step is to install Scala on your machine. Please verify the Scala installation on your machine using the following step:
If you don’t have Scala installed on your machine, you need to install Scala first before proceeding with the Spark installation.
Downloading Scala: Scala can be downloaded from the following link:
Please install the latest version. My version in the above screenshot might be different than what you might see when you click on the link above. That should not matter. Download the Scala binaries as per your operating system.
Installing Scala: After the binaries are downloaded, please install Scala from the downloaded binary. On MacOS Scala can also be installed using
After the installation is complete please verify again by running the “Scala -version” command to confirm the installation is properly completed.
Downloading Apache Spark: Now we are ready to install Apache Spark. Apache Spark can be downloaded from the Apache Spark website https://spark.apache.org/downloads.html
Please select the latest stable release of Spark and the corresponding Hadoop version build can also be chosen. The Hadoop version is important if you are installing Spark and you have a HDFS setup locally installed. Please note that we do not need HDFS to be installed locally for Spark to work on our local machine to get started with Spark.
Installing Spark: After the download is complete, please install Spark from the binary. On MacOS, Spark can also be installed using Homebrew:
> brew install apache-spark
Verifying the Spark Installation: After all the above steps are done, please verify the Spark installation as below:
The version installed on my Mac is 2.4.3 which is the latest version of Spark at the time of writing this tutorial. This is all it takes to install Spark and get started with getting your hands dirty with Spark. The shell above is a Scala interactive shell that can be used for running spark commands interactively. The shell can also be used to write small programs in Scala and run examples of Spark code.
If you want to work on Python and use pyspark you can install Python and then pyspark using
> pip install pyspark
You can verify the pyspark installation as per below. If your installation of Python and pyspark is proper you will see something like below:
You can use the Scala shell or Pyspark shell to start learning Spark in the language of your choice. For Java there is no Spark shell available, so you need to start working on IntelliJ or Eclipse with Scala compiler added to your IDE. Both Eclipse and IntelliJ have very good support for Scala. For Python programming, you can use PyCharm or any IDE of your choice.
While programming for Spark on IDE you might need to download Spark artifacts. They are available in hosted in Maven Central. You can add Maven dependency as below:
groupId: org.apache.spark artifactId: spark-core_2.11 version: 2.4.3
This is all about the Spark installation.
This module showed us how we can install Spark. I have used Mac installation, but the process is very similar and easy for other operating systems like Windows/Linux/Unix etc.