How to Install Spark on Ubuntu

Read it in 4 Mins

Last updated on
07th Jun, 2022
Published
20th May, 2019
Views
11,362
How to Install Spark on Ubuntu

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

In this article, we will cover the installation procedure of Apache Spark on the Ubuntu operating system. Also, check how to Install Jenkins on Ubuntu.

Prerequisites

This guide assumes that you are using Ubuntu and Hadoop 2.7 is installed in your system.

  1. Java8 should be installed in your Machine.
  2. Hadoop should be installed in your Machine.

System requirements

How to Install Spark on Ubuntu

  • Ubuntu OS Installed.
  • Minimum of 8 GB RAM.
  • At least 20 GB free space.

Installation Procedure

Making system ready

Before installing Spark ensure that you have installed Java8 in your Ubuntu Machine. If not installed, please follow below process to install java8 in your Ubuntu System.

a. Install java8 using below command.

sudo apt-get install oracle-java8-installer

Above command creates java-8-oracle Directory in /usr/lib/jvm/ directory in your machine. It looks like below

How to Install Spark on Ubuntu

Now we need to configure the JAVA_HOME path in .bashrc file.

.bashrc file executes whenever we open the terminal.

b. Configure JAVA_HOME and PATH  in .bashrc file and save. To edit/modify .bashrc file, use below command.

vi .bashrc 

Then press i(for insert) -> then Enter below line at the bottom of the file.

export JAVA_HOME= /usr/lib/jvm/java-8-oracle/
export PATH=$PATH:$JAVA_HOME/bin

Below is the screen shot of that.

How to Install Spark on Ubuntu

Then Press Esc -> wq! (For save the changes) -> Enter.

c. Now test Java installed properly or not by checking the version of Java. Below command should show the java version.

java -version

Below is the screenshot

How to Install Spark on Ubuntu

Installing Spark on the System

Go to the below official download page of Apache Spark and choose the latest release. For the package type, choose ‘Pre-built for Apache Hadoop’.

https://spark.apache.org/downloads.html

The page will look like below

How to Install Spark on Ubuntu

Or You can use a direct link to download.

https://www.apache.org/dyn/closer.lua/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz

Creating Spark directory

Create a directory called spark under /usr/ directory. Use below command to create spark directory

sudo mkdir /usr/spark

Above command asks password to create spark directory under the /usr directory, you can give the password. Then check spark directory is created or not in the /usr directory using below command

ll /usr/

It should give the below results with ‘spark’ directory

How to Install Spark on Ubuntu

Go to /usr/spark directory. Use below command to go spark directory.

cd /usr/spark

Download Spark version

Download spark2.3.3 in spark directory using below command

wget https://www.apache.org/dyn/closer.lua/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz

If use ll or ls command, you can see spark-2.4.0-bin-hadoop2.7.tgz in spark directory.

Extract Spark file

Then extract spark-2.4.0-bin-hadoop2.7.tgz using below command.

sudo tar xvzf spark-2.4.0-bin-hadoop2.7

Now spark-2.4.0-bin-hadoop2.7.tgz file is extracted as spark-2.4.0-bin-hadoop2.7

Check whether it extracted or not using ll command. It should give the below results.

How to Install Spark on Ubuntu

Configuration

Configure SPARK_HOME path in the .bashrc file by following below steps.

Go to the home directory using below command

cd ~

Open the .bashrc file using below command

vi .bashrc

Now we will configure SPARK_HOME and PATH

press i for insert the enter SPARK_HOME and PATH  like below

SPARK_HOME=/usr/spark/spark-2.4.0-bin-hadoop2.7

PATH=$PATH:$SPARK_HOME/bin

It looks like below

How to Install Spark on Ubuntu

Then save and exit by entering below commands.

Press Esc -> wq! -> Enter

Test Installation:

Now we can verify spark is successfully installed in our Ubuntu Machine or not. To verify use below command then enter.

spark-shell 

Above command should show below screen

How to Install Spark on Ubuntu

Now we have successfully installed spark on Ubuntu System. Let’s create RDD and Dataframe then we will end up.

a. We can create RDD in 3 ways, we will use one way to create RDD.

Define any list then parallelize it. It will create RDD. Below are the codes. Copy paste it one by one on the command line.

val nums = Array(1,2,3,5,6)
val rdd = sc.parallelize(nums)

Above will create RDD.

b. Now we will create a Data frame from RDD. Follow the below steps to create Dataframe.

import spark.implicits._
val df = rdd.toDF("num")

Above code will create Dataframe with num as a column.

To display the data in Dataframe use below command

df.show()

Below is the screenshot of the above code.

How to Install Spark on Ubuntu

How to uninstall Spark from Ubuntu System: 

You can follow the below steps to uninstall spark on Windows 10.

  1. Remove SPARK_HOME from the .bashrc file.

To remove SPARK_HOME variable from the .bashrc please follow below steps

Go to the home directory. To go to home directory use below command.

cd ~

Open .bashrc file. To open .bashrc file use below command.

vi .bashrc

Press i for edit/delete SPARK_HOME from .bashrc file. Then find SPARK_HOME the delete SPARK_HOME=/usr/spark/spark-2.4.0-bin-hadoop2.7 line from .bashrc file and save. To do follow below commands

Then press Esc -> wq! -> Press Enter

We will also delete downloaded and extracted spark installers from the system. Please do follow below command.

rm -r ~/spark

Above command will delete spark directory from the system.

Open Command Line Interface then type spark-shell,  then press enter, now we get an error.

Now we can confirm that Spark is successfully uninstalled from the Ubuntu System. You can also learn more about Apache Spark and Scala here.

Profile

Ravichandra Reddy Maramreddy

Blog Author

Ravichandra is a developer and specialized in Spark and Hadoop Ecosystems, HDFS and MapReduce which includes estimations, requirement analysis, design development, coordination, validation in-depth understanding of game design practices. Having extensive experience in Spark, Spark Streaming, Pyspark, Scala, Shell, Oozie, Hive, HBase, Hue, Java, SparkSQL, Kafka, WSO2. Having extensive experience in using Data structures and algorithms.