For enquiries call:

Phone

+1-469-442-0620

Easter Sale-mobile

HomeBlogBig DataHow to install Apache Spark on Windows?

How to install Apache Spark on Windows?

Published
05th Sep, 2023
Views
view count loader
Read it in
9 Mins
In this article
    How to install Apache Spark on Windows?

    Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python, and R and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools, including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

    In this document, we will cover the installation procedure of Apache Spark on the Windows 10 operating system.

    Prerequisites

    This guide assumes that you are using Windows 10 and the user has admin permissions.

    System requirements:

    • Windows 10 OS
    • At least 4 GB RAM
    • Free space of at least 20 GB

    Installation Procedure

    Step 1: Go to Apache Spark's official download page and choose the latest release. For the package type, choose ‘Pre-built for Apache Hadoop’.

    The page will look like the one below.

    Apache Spark installation Process

    Step 2:  Once the download is completed, unzip the file, unzip the file using WinZip or WinRAR, or 7-ZIP.

    Step 3: Create a folder called Spark under your user Directory like below and copy and paste the content from the unzipped file.

    C:\Users\<USER>\Spark

    It looks like the below after copy-pasting into the Spark directory.

    Apache Spark installation Process

    Step 4: Go to the conf folder and open the log file called log4j.properties. template. Change INFO to WARN (It can be an ERROR to reduce the log). This and the next steps are optional.

    Remove. template so that Spark can read the file.

    Before removing. template all files look like below.

    Apache Spark installation Process

    After removing. template extension, files will look like below

    Apache Spark installation Process

    Step 5: Now, we need to configure the path.

    Go to Control Panel -> System and Security -> System -> Advanced Settings -> Environment Variables

    Add below new user variable (or System variable) (To add a new user variable, click on the New button under User variable for <USER>)

    Apache Spark installation Process

    Click OK.

    Add %SPARK_HOME%\bin to the path variable.

    Apache Spark installation Process

    Click OK.

    Step 6: Spark needs a piece of Hadoop to run. For Hadoop 2.7, you need to install winutils.exe.

    You can find winutils.exe on this page. You can download it for your ease.

    Step 7: Create a folder called winutils in C drive and create a folder called bin inside. Then, move the downloaded winutils file to the bin folder.

    C:\winutils\bin

    Apache Spark installation Process

    Add the user (or system) variable %HADOOP_HOME% like SPARK_HOME.

    Apache Spark installation Process


    Apache Spark installation Process

    Click OK.

    Step 8: To install Apache Spark, Java should be installed on your computer. If you don’t have java installed on your system. Please follow the below process

    Java Installation Steps

    1. Go to the official Java site mentioned below the page.

    Accept Licence Agreement for Java SE Development Kit 8u201

    2. Download jdk-8u201-windows-x64.exe file

    3. Double Click on the Downloaded .exe file, and you will see the window is shown below.

    Java Installation Steps

    4. Click Next.

    5. Then below window will be displayed.

    Java Installation Steps

    6. Click Next.

    7. Below window will be displayed after some process.

    Java Installation Steps

    8. Click Close.

    Test Java Installation

    Open Command Line and type java -version, then it should display the installed version of Java

    Java Installation Steps

    You should also check JAVA_HOME and the path of %JAVA_HOME%\bin included in user variables (or system variables)

    1. In the end, the environment variables have 3 new paths (if you need to add a Java path, otherwise SPARK_HOME and HADOOP_HOME).

    Java Installation Steps

    2. Create c:\tmp\hive directory. This step is not necessary for later versions of Spark. When you first start Spark, it creates the folder by itself. However, it is the best practice to create a folder.

    C:\tmp\hive

    Test Installation

    Open the command line and type spark-shell, and you will get the result below.

    Test Installation in Apache Spark

    We have completed the spark installation on the Windows system. Let’s create RDD and     Data frame

    We create one RDD and Data frame; then we will end up.

    1. We can create RDD in 3 ways; we will use one way to create RDD.

    Define any list, then parallelize it. It will create RDD. Below is the code, and copy and paste it one by one on the command line.

    val list = Array(1,2,3,4,5)
    val rdd = sc.parallelize(list)

    The above will create RDD.

    2. Now, we will create a Data frame from RDD. Follow the below steps to create Dataframe.

    import spark.implicits._
    val df = rdd.toDF("id")

    The above code will create Dataframe with id as a column.

    To display the data in Dataframe, use the below command.

    Df.show()

    It will display the below output.

    Test Installation in Apache Spark

    How to uninstall Spark from Windows 10 System

    Please follow the below steps to uninstall spark on Windows 10.

    1. Remove the below System/User variables from the system.
    2. SPARK_HOME
    3. HADOOP_HOME

    To remove System/User variables, please follow the below steps:

    Go to Control Panel -> System and Security -> System -> Advanced Settings -> Environment Variables, then find SPARK_HOME and HADOOP_HOME then, select them, and press the DELETE button.

    Find Path variable Edit -> Select %SPARK_HOME%\bin -> Press DELETE Button

    Select % HADOOP_HOME%\bin -> Press DELETE Button -> OK Button

    Open Command Prompt, type spark-shell, then enter, and now we get an error. Now we can confirm that Spark is successfully uninstalled from the System.

    Unleash your data superpowers with our advanced data science courses. Dive deep into the world of analytics and gain the skills to unlock valuable insights. Join us today and become a data wizard!

    Conclusion 

    Java 8 or a more recent version is required to install Apache Spark on Windows, so obtain and install it by visiting Oracle. You may download OpenJDK from this page if you'd like. 

    Double-click the downloaded.exe (jdk-8u201-windows-x64.exe) file to install it on your Windows machine when it has finished downloading. Alternatively, you may stick with the default directory.  The article provides all the details on setting up your Apache Spark from Scratch.  

    Once you have covered all the steps mentioned in this article, Apache Spark should operate perfectly on Windows 10. Start off by launching a Spark instance in your Windows environment. If you are facing any problems, let us know in the comments.  

    Frequently Asked Questions (FAQs)

    1How to install Spark in Windows cmd?

    Spark is a free and open-source framework for handling massive amounts of stream data from many sources. Spark is used in distributed computing for graph-parallel processing, data analytics, and machine learning applications. We have mentioned the procedure to install Spark in Windows cmd in detail through this article. Give it a read and try out the procedure. 

    2How do I download Apache Spark for Windows?
    Here are the steps to download Apache Spark for Windows:  
    • Download Java Apache Spark needs Java version 8. 
    • Python installation 
    • Install Apache Spark.
    • Check the Spark Software File. 
    • Set up Apache Spark 
    • Add the file winutils.exe 
    • Set Environment Parameters 
    • Start Spark 
    • Test Spark.
    3Can I run PySpark on Windows?

    You can, indeed. PySpark is a Spark library created in Python to run Python programs leveraging the capabilities of Apache Spark. There isn't a PySpark library available for download. You only need Spark. 

    4Is PySpark the same as Apache Spark?

    The Spark Python API. It is a Python and Apache Spark partnership. It is a Python API for Apache Spark that enables you to use both the ease of Python and the strength of Apache Spark to control Big Data. 

    Fast and versatile engine for processing lots of data. Spark is a general-purpose, quick processing engine that works with Hadoop data. It can process data in HDFS, HBase, Cassandra, Hive, and any other Hadoop InputFormat, and it can operate in Hadoop clusters using YARN or Spark's standalone mode. Both batch processing (like MapReduce) and novel workloads like streaming, interactive queries, and machine learning are supported by its architecture. 

    Profile

    Ravichandra Reddy Maramreddy

    Blog Author

    Ravichandra is a developer and specialized in Spark and Hadoop Ecosystems, HDFS and MapReduce which includes estimations, requirement analysis, design development, coordination, validation in-depth understanding of game design practices. Having extensive experience in Spark, Spark Streaming, Pyspark, Scala, Shell, Oozie, Hive, HBase, Hue, Java, SparkSQL, Kafka, WSO2. Having extensive experience in using Data structures and algorithms.

    Share This Article
    Ready to Master the Skills that Drive Your Career?

    Avail your free 1:1 mentorship session.

    Select
    Your Message (Optional)

    Upcoming Big Data Batches & Dates

    NameDateFeeKnow more
    Course advisor icon
    Course Advisor
    Whatsapp/Chat icon