In this section we will look at different ways in which Spark uses persistence and caching to help improve performance of our application. We will also see what are the good practices when we cache objects in memory and when should they be released for optimum performance of Spark application.
When we persist or cache an RDD in Spark it holds some memory(RAM) on the machine or the cluster. It is usually a good practice to release this memory after the work is done. But before we remove this memory let us see how to check this when the Spark application is running. We can go to the URL http://localhost:4040 and look at the Storage tab. This shows us all the objects which have been persisted or cached in Spark’s memory. Since we just persisted one RDD in the previous example, as we can see below the Storage tab shows one object.
On clicking the object we can see other details like no. of partitions and size occupied by the object.
Once we are sure we no longer need the object in Spark's memory for any iterative process optimizations we can call the method unpersist(). Once this is done we can again check the Storage tab in Spark's UI. We can note below that the object no longer exists in Spark memory.
Now that we have seen how to cache or persist an RDD and its benefits, we can look into details about RDD storage levels.
Let us see them one at a time:
- MEMORY_ONLY:In this storage level the whole of the RDD is stored as deserialized Java object in the JVM. It tries to use all the memory available for caching so it does not have to use the disk. If the memory is insufficient it will not cache some of the partitions and those will be recomputed when needed. In this mode the storage utilization is maximum and CPU computation is least.
- MEMORY_AND_DISK: In this storage level the RDD is stored in memory and only the excess partitions are spilled to the disk. When needed again the partitions are read from the disk and served. In this storage level, the memory utilization is maximum along with some disk and CPU utilization is medium.
- MEMORY_ONLY_SER: In this level Spark stores the RDD as a serialized Java object, one byte-array per partition. It is very much optimized for space compared to deserialized Java object, especially in case of fast serializer. But this makes CPU utilization very high and it does not use the disk at all.
- MEMORY_AND_DISK_SER: This level is very similar to MEMORY_ONLY_SER with the only difference that the excess partitions which do not fit in memory are written to disk. So they are read from disk when needed, instead of recompute being triggered. The space used for storage is low but CPU computation time increases.
- DISK_ONLY: In this level, the RDD is persisted on disk only and nothing in memory. So the memory utilization is minimal but the CPU computation time increases a lot.
- MEMORY_ONLY_2 and MEMORY_AND_DISK_2:These are similar to MEMORY_ ONLY and MEMORY_ AND_DISK. The only difference is that each partition of the RDD is replicated on two nodes on the cluster.
Here, we learnt about the different modes of persistence in Spark. This understanding will help us to use these in different scenarios that we may encounter while writing a Spark application.