In this section we will look at a concrete example of an RDD transformation function and try to see the output by executing it on the Spark shell.
We have seen above the functions we can use with RDDs. These could be Transformations which produce another RDD or Actions which produce anything other than RDDs and send the result to the Driver or write to the disk or stable storage.
Implementations of RDD Transformations and Actions with an example:
Let us look at a concrete example of executing RDD transformation and action on real data. There are many examples available in Scala, Python and Java which are readily available with Apache Spark installation and they can be executed on the Spark shell. The examples are available in Spark Github at: https://github.com/apache/spark/tree/master/examples/src/main/scala/org/apache/spark/examples
Procedure for executing [the example]: All of these examples can be executed by submitting the examples.jar provided with Spark installation. We can also execute these interactively on the Spark shell. Let us execute a simple one Word Count example on the Spark shell to understand in detail.
Open Spark-Shell: The first step is to open the spark-shell on your machine where Spark is installed. Please execute the following command on the command line
This should open the Spark shell as below:
Create an RDD: The next step is to create an RDD by reading a text file for which we are going to count the words.
I have a file called “Spark.txt”. You can similarly have any .txt file and note the location. The first step is to create an RDD by reading the file as below:
Execute Word count Transformation: The next step is to execute the steps of the word count transformations
Please note that I have executed .collect() step only for demonstration purpose to show the intermediate to get the understanding better. This is not required in actual programming.
Current RDD: In our example above we have different RDDs at the different steps. If we want to know about the current RDD, we can execute the following command and get more details about the RDD.
This gives the whole dependencies of the RDD for debugging purposes.
Caching the Transformations: If we look at the 3 step execution of our word count example in detail, we note that each time I executed .collect(), the execution started from reading the file. So every time an action was called, it re-computed all the steps in my execution which is not what we would like. So we can avoid this by persisting or caching the RDD. This can be done by persist or cache methods. This caches the RDD in memory after the action is called and then the next iterative step will not re-compute the same steps but will use the cache and will perform better.
Applying the Action: As we already know, all Spark transformations are executed only when an action is called. This results into the actual computation of the whole dependencies and gets the result for the computation.
We can execute the following to save our output.
Checking the Output: We can check the output of our program by opening another terminal and running the command:
> ls -l /Users/home/Downloads/output/
Output: The output can be seen by running the below command on the 2 part files created as the output of our program.
This is not the full output as my screen could not capture the full terminal window as results span across windows due to bigger input file.
The output also contains _SUCCESS file which shows that the program execution is completed successfully. This comes from the similar MapReduce concept in Hadoop.
We saw above how to work with a transformation on the Spark shell. Working with other set of transformations and actions is very similar.