10X Sale
kh logo
All Courses
  1. Tutorials
  2. Big Data

Apache Spark Programming with RDD

Updated on Oct 7, 2025
 
20,568 Views

Introduction

In this section we will look at a concrete example of an RDD transformation function and try to see the output by executing it on the Spark shell.

We have seen above the functions we can use with RDDs. These could be Transformations which produce another RDD or Actions which produce anything other than RDDs and send the result to the Driver or write to the disk or stable storage.

Implementations of RDD Transformations and Actions with an example:

Let us look at a concrete example of executing RDD transformation and action on real data. There are many examples available in Scala, Python and Java which are readily available with Apache Spark installation and they can be executed on the Spark shell. The examples are available in Spark Github at: https://github.com/apache/spark/tree/master/examples/src/main/scala/org/apache/spark/examples

- Procedure for executing [the example]: All of these examples can be executed by submitting the examples.jar provided with Spark installation. We can also execute these interactively on the Spark shell. Let us execute a simple one Word Count example on the Spark shell to understand in detail.

- Open Spark-Shell: The first step is to open the spark-shell on your machine where Spark is installed. Please execute the following command on the command line

> spark-shell

This should open the Spark shell as below:

Spark Code

- Create an RDD: The next step is to create an RDD by reading a text file for which we are going to count the words.

I have a file called “Spark.txt”. You can similarly have any .txt file and note the location. The first step is to create an RDD by reading the file as below:

Create an RDD

- Execute Word count Transformation: The next step is to execute the steps of the word count transformations:

- Each of the lines in the file is split into words using flatMap RDD transformation. flatMap applies a function that returns a sequence for each element in the list, and it flattens the results into the original list.

Execute Word count Transformation

- Each word is read and key-value pairs are used to create the map transformation. This assigns the value ‘1’ to each of the word-keys.

Execute Word count Transformation

- In the last step the values of matching keys are added to get the final count of each of the words using reduceByKey function.

Execute Word count Transformation

Please note that I have executed .collect() step only for demonstration purpose to show the intermediate to get the understanding better. This is not required in actual programming.

- Current RDD: In our example above we have different RDDs at the different steps. If we want to know about the current RDD, we can execute the following command and get more details about the RDD.

> counts.toDebugString

This gives the whole dependencies of the RDD for debugging purposes.

Current RDD

- Caching the Transformations: If we look at the 3 step execution of our word count example in detail, we note that each time I executed .collect(), the execution started from reading the file. So every time an action was called, it re-computed all the steps in my execution which is not what we would like. So we can avoid this by persisting or caching the RDD. This can be done by persist or cache methods. This caches the RDD in memory after the action is called and then the next iterative step will not re-compute the same steps but will use the cache and will perform better.

Current RDD Caching the TransformationsCurrent RDD Caching the Transformations

- Applying the Action: As we already know, all Spark transformations are executed only when an action is called. This results into the actual computation of the whole dependencies and gets the result for the computation.

We can execute the following code to save our output.

Applying the Action

- Checking the Output: We can check the output of our program by opening another terminal and running the command:

> ls -l /Users/home/Downloads/output/

Applying the Action

Output: The output can be seen by running the below command on the 2 part files created as the output of our program.

Spark Code

This is not the full output as my screen could not capture the full terminal window as results span across windows due to bigger input file.

The output also contains _SUCCESS file which shows that the program execution is completed successfully. This comes from the similar MapReduce concept in Hadoop.

Conclusion

We saw above how to work with a transformation on the Spark shell. Working with other set of transformations and actions is very similar.

+91

By Signing up, you agree to ourTerms & Conditionsand ourPrivacy and Policy

Get your free handbook for CSM!!
Recommended Courses