Search

How to Install Spark on Ubuntu

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.In this article, we will cover the installation procedure of Apache Spark on the Ubuntu operating system.PrerequisitesThis guide assumes that you are using Ubuntu and Hadoop 2.7 is installed in your system.Java8 should be installed in your Machine.Hadoop should be installed in your Machine.System requirementsUbuntu OS Installed.Minimum of 8 GB RAM.At least 20 GB free space.Installation ProcedureMaking system readyBefore installing Spark ensure that you have installed Java8 in your Ubuntu Machine. If not installed, please follow below process to install java8 in your Ubuntu System.a. Install java8 using below command.sudo apt-get install oracle-java8-installerAbove command creates java-8-oracle Directory in /usr/lib/jvm/ directory in your machine. It looks like belowNow we need to configure the JAVA_HOME path in .bashrc file..bashrc file executes whenever we open the terminal.b. Configure JAVA_HOME and PATH  in .bashrc file and save. To edit/modify .bashrc file, use below command.vi .bashrc Then press i(for insert) -> then Enter below line at the bottom of the file.export JAVA_HOME= /usr/lib/jvm/java-8-oracle/ export PATH=$PATH:$JAVA_HOME/binBelow is the screen shot of that.Then Press Esc -> wq! (For save the changes) -> Enter.c. Now test Java installed properly or not by checking the version of Java. Below command should show the java version.java -versionBelow is the screenshotInstalling Spark on the SystemGo to the below official download page of Apache Spark and choose the latest release. For the package type, choose ‘Pre-built for Apache Hadoop’.https://spark.apache.org/downloads.htmlThe page will look like belowOr You can use a direct link to download.https://www.apache.org/dyn/closer.lua/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgzCreating Spark directoryCreate a directory called spark under /usr/ directory. Use below command to create spark directorysudo mkdir /usr/sparkAbove command asks password to create spark directory under the /usr directory, you can give the password. Then check spark directory is created or not in the /usr directory using below commandll /usr/It should give the below results with ‘spark’ directoryGo to /usr/spark directory. Use below command to go spark directory.cd /usr/sparkDownload Spark versionDownload spark2.3.3 in spark directory using below commandwget https://www.apache.org/dyn/closer.lua/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgzIf use ll or ls command, you can see spark-2.4.0-bin-hadoop2.7.tgz in spark directory.Extract Spark fileThen extract spark-2.4.0-bin-hadoop2.7.tgz using below command.sudo tar xvzf spark-2.4.0-bin-hadoop2.7Now spark-2.4.0-bin-hadoop2.7.tgz file is extracted as spark-2.4.0-bin-hadoop2.7Check whether it extracted or not using ll command. It should give the below results.ConfigurationConfigure SPARK_HOME path in the .bashrc file by following below steps.Go to the home directory using below commandcd ~Open the .bashrc file using below commandvi .bashrcNow we will configure SPARK_HOME and PATHpress i for insert the enter SPARK_HOME and PATH  like belowSPARK_HOME=/usr/spark/spark-2.4.0-bin-hadoop2.7PATH=$PATH:$SPARK_HOME/binIt looks like belowThen save and exit by entering below commands.Press Esc -> wq! -> EnterTest Installation:Now we can verify spark is successfully installed in our Ubuntu Machine or not. To verify use below command then enter.spark-shell Above command should show below screenNow we have successfully installed spark on Ubuntu System. Let’s create RDD and Dataframe then we will end up.a. We can create RDD in 3 ways, we will use one way to create RDD.Define any list then parallelize it. It will create RDD. Below are the codes. Copy paste it one by one on the command line.val nums = Array(1,2,3,5,6) val rdd = sc.parallelize(nums)Above will create RDD.b. Now we will create a Data frame from RDD. Follow the below steps to create Dataframe.import spark.implicits._ val df = rdd.toDF("num")Above code will create Dataframe with num as a column.To display the data in Dataframe use below commanddf.show()Below is the screenshot of the above code.How to uninstall Spark from Ubuntu System: You can follow the below steps to uninstall spark on Windows 10.Remove SPARK_HOME from the .bashrc file.To remove SPARK_HOME variable from the .bashrc please follow below stepsGo to the home directory. To go to home directory use below command.cd ~Open .bashrc file. To open .bashrc file use below command.vi .bashrcPress i for edit/delete SPARK_HOME from .bashrc file. Then find SPARK_HOME the delete SPARK_HOME=/usr/spark/spark-2.4.0-bin-hadoop2.7 line from .bashrc file and save. To do follow below commandsThen press Esc -> wq! -> Press EnterWe will also delete downloaded and extracted spark installers from the system. Please do follow below command.rm -r ~/sparkAbove command will delete spark directory from the system.Open Command Line Interface then type spark-shell,  then press enter, now we get an error.Now we can confirm that Spark is successfully uninstalled from the Ubuntu System. You can also learn more about Apache Spark and Scala here.

How to Install Spark on Ubuntu

11K
How to Install Spark on Ubuntu

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

In this article, we will cover the installation procedure of Apache Spark on the Ubuntu operating system.

Prerequisites

This guide assumes that you are using Ubuntu and Hadoop 2.7 is installed in your system.

  1. Java8 should be installed in your Machine.
  2. Hadoop should be installed in your Machine.

System requirementsSystem Requirements of Install Spark on Ubuntu

  • Ubuntu OS Installed.
  • Minimum of 8 GB RAM.
  • At least 20 GB free space.

Installation Procedure

Making system ready

Before installing Spark ensure that you have installed Java8 in your Ubuntu Machine. If not installed, please follow below process to install java8 in your Ubuntu System.

a. Install java8 using below command.

sudo apt-get install oracle-java8-installer

Above command creates java-8-oracle Directory in /usr/lib/jvm/ directory in your machine. It looks like below

Installation Procedure in Spark on ubuntu

Now we need to configure the JAVA_HOME path in .bashrc file.

.bashrc file executes whenever we open the terminal.

b. Configure JAVA_HOME and PATH  in .bashrc file and save. To edit/modify .bashrc file, use below command.

vi .bashrc 

Then press i(for insert) -> then Enter below line at the bottom of the file.

export JAVA_HOME= /usr/lib/jvm/java-8-oracle/
export PATH=$PATH:$JAVA_HOME/bin

Below is the screen shot of that.

Installation Procedure in Spark on ubuntu

Then Press Esc -> wq! (For save the changes) -> Enter.

c. Now test Java installed properly or not by checking the version of Java. Below command should show the java version.

java -version

Below is the screenshot

Installation Procedure in Spark on ubuntu

Installing Spark on the System

Go to the below official download page of Apache Spark and choose the latest release. For the package type, choose ‘Pre-built for Apache Hadoop’.

https://spark.apache.org/downloads.html

The page will look like below

Installation Procedure in Spark on ubuntu

Or You can use a direct link to download.

https://www.apache.org/dyn/closer.lua/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz

Creating Spark directory

Create a directory called spark under /usr/ directory. Use below command to create spark directory

sudo mkdir /usr/spark

Above command asks password to create spark directory under the /usr directory, you can give the password. Then check spark directory is created or not in the /usr directory using below command

ll /usr/

It should give the below results with ‘spark’ directory

Go to /usr/spark directory. Use below command to go spark directory.

cd /usr/spark

Download Spark version

Download spark2.3.3 in spark directory using below command

wget https://www.apache.org/dyn/closer.lua/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz

If use ll or ls command, you can see spark-2.4.0-bin-hadoop2.7.tgz in spark directory.

Extract Spark file

Then extract spark-2.4.0-bin-hadoop2.7.tgz using below command.

sudo tar xvzf spark-2.4.0-bin-hadoop2.7

Now spark-2.4.0-bin-hadoop2.7.tgz file is extracted as spark-2.4.0-bin-hadoop2.7

Check whether it extracted or not using ll command. It should give the below results.

Installation Procedure in Spark on ubuntu

Configuration

Configure SPARK_HOME path in the .bashrc file by following below steps.

Go to the home directory using below command

cd ~

Open the .bashrc file using below command

vi .bashrc

Now we will configure SPARK_HOME and PATH

press i for insert the enter SPARK_HOME and PATH  like below

SPARK_HOME=/usr/spark/spark-2.4.0-bin-hadoop2.7

PATH=$PATH:$SPARK_HOME/bin

It looks like below

Installation Procedure in Spark on ubuntu

Then save and exit by entering below commands.

Press Esc -> wq! -> Enter

Test Installation:

Now we can verify spark is successfully installed in our Ubuntu Machine or not. To verify use below command then enter.

spark-shell 

Above command should show below screen

Test Installation in Spark on Ubuntu

Now we have successfully installed spark on Ubuntu System. Let’s create RDD and Dataframe then we will end up.

a. We can create RDD in 3 ways, we will use one way to create RDD.

Define any list then parallelize it. It will create RDD. Below are the codes. Copy paste it one by one on the command line.

val nums = Array(1,2,3,5,6)
val rdd = sc.parallelize(nums)

Above will create RDD.

b. Now we will create a Data frame from RDD. Follow the below steps to create Dataframe.

import spark.implicits._
val df = rdd.toDF("num")

Above code will create Dataframe with num as a column.

To display the data in Dataframe use below command

df.show()

Below is the screenshot of the above code.

Test Installation in Spark on Ubuntu

How to uninstall Spark from Ubuntu System: 

You can follow the below steps to uninstall spark on Windows 10.

  1. Remove SPARK_HOME from the .bashrc file.

To remove SPARK_HOME variable from the .bashrc please follow below steps

Go to the home directory. To go to home directory use below command.

cd ~

Open .bashrc file. To open .bashrc file use below command.

vi .bashrc

Press i for edit/delete SPARK_HOME from .bashrc file. Then find SPARK_HOME the delete SPARK_HOME=/usr/spark/spark-2.4.0-bin-hadoop2.7 line from .bashrc file and save. To do follow below commands

Then press Esc -> wq! -> Press Enter

We will also delete downloaded and extracted spark installers from the system. Please do follow below command.

rm -r ~/spark

Above command will delete spark directory from the system.

Open Command Line Interface then type spark-shell,  then press enter, now we get an error.

Now we can confirm that Spark is successfully uninstalled from the Ubuntu System. You can also learn more about Apache Spark and Scala here.

Ravichandra

Ravichandra Reddy Maramreddy

Blog Author

Ravichandra is a developer and specialized in Spark and Hadoop Ecosystems, HDFS and MapReduce which includes estimations, requirement analysis, design development, coordination, validation in-depth understanding of game design practices. Having extensive experience in Spark, Spark Streaming, Pyspark, Scala, Shell, Oozie, Hive, HBase, Hue, Java, SparkSQL, Kafka, WSO2. Having extensive experience in using Data structures and algorithms.

Join the Discussion

Your email address will not be published. Required fields are marked *

Suggested Blogs

Master Big Data With A Hadoop Certification

Ever wondered how your social media posts or online transaction details are always available? And not just yours, but anyone who uses the internet can access their data whenever they want. With the advent of technology and the internet, the amount of data generated online is humongous. According to a report, almost 90% of the data that we have today have been created over the last couple of years. Whether it’s networking sites or weather reports, data is being generated every second and it needs to be stored for various purposes. How does this happen? Surely, all of these information can’t be stored in physical storage devices like pen drives or hard disks, unless there’s a huge football field to accommodate them. This is where ‘Big Data’ plays a huge role. Let’s find out more about it. What Is Big Data? Big data is a technology that can collate and store huge amounts of data every day. The information stored is important as it allows companies to assess their customers and market their products. For example, when a product or service is advertised on Facebook, users like or comment on the post. This data is then used by companies to judge the popularity of their product and further promote or improve their marketing campaigns accordingly. Big data, therefore, is one of the most important technologies of the modern world. However, most of the data is unstructured, which means it can’t be used for analysis and data mining. This where you need a software that can sort the unstructured data and provide data security. How Does Hadoop Help Big Data? Hadoop is an open-source, Java-based programming framework that’s capable of storing and processing large amounts of data. Hadoop makes use of a distributed computing framework, wherein data is formatted and stored in clusters of commodity hardware. Simultaneously, it also processes data by using cheap computers. This software is available for free download and is run and maintained by developers from all around the world. However, nowadays, commercial Hadoop software are being made available to suit the various data processing and storage needs of organizations. What Are The Advantages Of Hadoop? Apart from the fact that Hadoop can process and store data quickly, there are many other reasons that makes it the most preferred data storage choice. Let’s take a look at some of them: ● Hadoop offers you the flexibility of storing data as you want. For example, in traditional databases, you would have to first process the data and then store it. But, in Hadoop, you can store anything and then analyze it later and this includes unstructured data like images, videos, and texts. ● When you’re using the Hadoop framework, you can be assured that data will not be lost due to hardware failure. If any one of the nodes become faulty, the data will automatically distributed amongst other nodes. Also, several copies of the data will be made and stored automatically. ● It’s a free open-source software that relies on commodity hardware to process and store data. You can scale it up as per your needs by adding nodes. Why Take A Big Data And Hadoop Certification? If you want to begin a career in the IT industry or would like to become a data analyst, then you can’t do without big data. It’s everywhere and every internet-based service relies on big data technology to store their information and analyze it. No matter which field you choose, right from social media to weather reports, big data plays a big role in keeping them up and running. Therefore, it only makes senses that you take a certification in big data and Hadoop to add another point to your resume and eventually land better jobs. How Will The Certification Help Me? When you’re preparing for the certifying exam, you can take up a training course to better acquaint yourself with the subject. During your training, you will be taught about the various aspects of Hadoop and how it’s used to store big data. Some of the things that you’ll be learning during the training are: ● A clear understanding of the Hadoop ecosystem that includes Flame and Apache oozie workflow scheduler ● Mastery over the basic and advanced concepts of Hadoop 2.7 framework ● Learning to write MapReduce programs ● Conduct detailed data analysis with the help of Pig and Hive Apart from these, you will also be given training on setting up configurations for Hadoop clusters. With big data becoming an integral part of most businesses, mastering the Hadoop technology will help you land well-paying jobs. If you’ve been on the lookout for big data analytics jobs or want to become a software developer and architect, then a Hadoop certification will open up a world of opportunities for you.
14447
Master Big Data With A Hadoop Certification

Ever wondered how your social media posts or onlin... Read More

What Is Big Data and Why Use Hadoop?

What is Big Data and Why Use Hadoop? Big data specifies datasets that are very big. It’s a hoard of large datasets that cannot be processed by the traditional methods of computing. Big data is related to a complete subject rather than merely data that can be processed using various techniques, tools, and framework. Hadoop is an open-source frame, which is based on Java Programming and supports the storage and processing capabilities of extremely large datasets in a computing environment that is distributed across branches. Hadoop was developed by a team of computer scientists, which comprised of Mike Cafarella and Doug Cutting in 2005, to support the distribution capabilities of search engines. There are pros & cons in hadoop, but compared to pros, cons are negotiable. Benefits of Hadoop • Scalable: Hadoop is a storage platform that is highly scalable, as it can easily store and distribute very large datasets at a time on servers that could be operated in parallel. • Cost effective: Hadoop is very cost-effective compared to traditional database-management systems. • Fast: Hadoop manages data through clusters, thus providing a unique storage method based on distributed file systems. Hadoop’s unique feature of mapping data on the clusters provides a faster data processing. • Flexible: Hadoop enables enterprises to access and process data in a very easy way to generate the values required by the company, thereby providing the enterprises with the tools to get valuable insights from various types of data sources operating in parallel. • Failure resistant: One of the great advantages of Hadoop is its fault tolerance. This fault resistance is provided by replicating the data to another node in the cluster, thus in the event of a failure, the data from the replicated node can be used, thereby maintaining data consistency. Careers with Hadoop Big data with Hadoop training could make a great difference in getting your dream career. Employees with capabilities of handling big data are considered more valuable to the organisation. Hadoop skills are in great demand and thus it is very important for the IT professionals to keep up with the current trend, because the amount of data generated day by day is ever increasing. According to the Forbes magazine report of 2015, around 80% of the global organisations are reported to make high- or medium-level investments in big data analytics. They consider this investment to be very significant and so they plan to increase their investment in big data analytics. There are more job opportunities with Hadoop. Looking at the market forecast for Big Data, it looks like the need for Big Data engineers is going to increase. Big Data is here to stay, as the data is ever increasing and does not seem to slow down in the coming years. A research conducted by the Avendus Capital reported that the IT market in India for big data is hovering near $1.15 billion in the year 2015. Big data analytics contributed for about one-fifth of the nation’s KPO market, which is considered to be worth almost $5.6 billion. The Hindu also predicted that by the end of year 2018, India alone would be facing a shortage of almost quarter million Big Data scientists. Therefore, Big Data Analysis with Hadoop presents a great career and tremendous growth opportunity.
16348
What Is Big Data and Why Use Hadoop?

What is Big Data and Why Use Hadoop? Big data s... Read More

Types Of Big Data

Big Data is creating a revolution in the IT field, every year the use of analytics is increasing drastically every year. We are creating 2.5 quintillion bytes of data every day hence the field is expanding in B2C apps. Big Data has entered almost every industry today and is a dominant driving force behind the success of enterprises and organizations across the Globe. Let us first discuss- “What is Big Data?” “Data” is defined as ‘the quantities, characters, or symbols on which operations are performed by a computer, which may be stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or mechanical recording media’, as a quick google search will show. The concept of Big Data is nothing complex; as the name suggests, “Big Data” refers to copious amounts of data which are too large to be processed and analyzed by traditional tools, and the data is not stored or managed efficiently. Since the amount of Big Data increases exponentially- more than 500 terabytes of data are uploaded to Facebook alone, in a single day- it represents a real problem in terms of analysis. Before we jump into the article, let's have a visual introduction on what is Big data and its types. (Structured Data, Semi-Structured & Unstructured Data) Types of Big Data: Classification is essential for the study of any subject. So Big Data is widely classified into three main types, which are- Structured Unstructured Semi-structured 1. Structured data Structured Data is used to refer to the data which is already stored in databases, in an ordered manner. It accounts for about 20% of the total existing data and is used the most in programming and computer-related activities. There are two sources of structured data- machines and humans. All the data received from sensors, weblogs, and financial systems are classified under machine-generated data. These include medical devices, GPS data, data of usage statistics captured by servers and applications and the huge amount of data that usually move through trading platforms, to name a few. Human-generated structured data mainly includes all the data a human input into a computer, such as his name and other personal details. When a person clicks a link on the internet, or even makes a move in a game, data is created- this can be used by companies to figure out their customer behavior and make the appropriate decisions and modifications. Let’s understand Structured data with an example. Top 3 players who have scored most runs in international T20 matches are as follows: Player Country Scores No of Matches played                Brendon McCullum New Zealand                                 2140                                           71                    Rohit Sharma India     2237          90 Virat Kohli  India      2167          65 2. Unstructured data While structured data resides in the traditional row-column databases, unstructured data is the opposite- they have no clear format in storage. The rest of the data created, about 80% of the total account for unstructured big data. Most of the data a person encounters belong to this category- and until recently, there was not much to do to it except storing it or analyzing it manually. Unstructured data is also classified based on its source, into machine-generated or human-generated. Machine-generated data accounts for all the satellite images, the scientific data from various experiments and radar data captured by various facets of technology. Human-generated unstructured data is found in abundance across the internet since it includes social media data, mobile data, and website content. This means that the pictures we upload to Facebook or Instagram handle, the videos we watch on YouTube and even the text messages we send all contribute to the gigantic heap that is unstructured data. Examples of unstructured data include text, video, audio, mobile activity, social media activity, satellite imagery, surveillance imagery – the list goes on and on. The following image will clearly help you to understand what exactly Unstructured data is The Unstructured data is further divided into – Captured User-Generated data a. Captured data: It is the data based on the user’s behavior. The best example to understand it is GPS via smartphones which help the user each and every moment and provides a real-time output. b. User-generated data: It is the kind of unstructured data where the user itself will put data on the internet every movement. For example, Tweets and Re-tweets, Likes, Shares, Comments, on Youtube, Facebook, etc. 3. Semi-structured data: The line between unstructured data and semi-structured data has always been unclear since most of the semi-structured data appear to be unstructured at a glance. Information that is not in the traditional database format as structured data, but contains some organizational properties which make it easier to process, are included in semi-structured data. For example, NoSQL documents are considered to be semi-structured, since they contain keywords that can be used to process the document easily. Big Data analysis has been found to have definite business value, as its analysis and processing can help a company achieve cost reductions and dramatic growth. So it is imperative that you do not wait too long to exploit the potential of this excellent business opportunity. Diagram showing Semi-structured data Difference between Structured, Semi-structured and Unstructured data       Factors      Structured data       Semi-structured data     Unstructured data Flexibility It is dependent and less flexible It is more flexible than structured data but less than flexible than unstructured data It is flexible in nature and there is an absence of a schema Transaction Management Matured transaction and various concurrency technique The transaction is adapted from DBMS not matured No transaction management and no concurrency Query performance Structured query allow complex joining Queries over anonymous nodes are possible An only textual query is possible Technology It is based on the relational database table It is based on RDF and XML This is based on character and library data Big data is indeed a revolution in the field of IT. The use of Data analytics is increasing every year. In spite of the demand, organizations are currently short of experts. To minimize this talent gap many training institutes are offering courses on Big data analytics which helps you to upgrade skills set needed to manage and analyze big data. If you are keen to take up data analytics as a career then taking up Big data training will be an added advantage .
4371
Types Of Big Data

Big Data is creating a revolution in the IT field,... Read More

Useful links