top
April flash sale

Search

Apache Spark Tutorial

IntroductionIn this section we will look at some of the Spark’s advanced programming concepts like broadcast variables and accumulators. The understanding of these concepts is not compulsory but is helpful in cases where we want to fine tune a Spark application.When we want to perform operations like map, filter and reduce, we invoke them by passing closures or functions to Spark. These closures or functions usually refer to the variables where these get created. Usually, when these closures run on the worked node, these variables get copied to the worker nodes by the Spark execution engine. But, other than this usual way, there is also a mechanism by which Spark supports the creation of two special variables that can be shared, and they have restricted usage and support very common usage patterns. We call these variables as Broadcast variables and Accumulators.Let us see the explanation below.Explanation with exampleBroadcast variables: If we encounter a scenario where we need a large piece of data which is mainly read-only and is used in multiple operations running in parallel, like a lookup table or a static map, it would be much better if we could distribute this piece of data to the workers only once and not with every function or closure call. Spark provides the programmer this functionality by way of “broadcast variable” which are read-only objects and are copied to each worker node only once.As we know, when a Spark is executed, there are stages that get created and they are separated by distributed “shuffle” operations. Spark can distribute the commonly used data required by tasks at any stage. This type of data broadcasting caches the data in serialized form. These will then be deserialized when running each of the tasks. So, in essence, broadcast variables will be useful when we need the same data across many different stages or when we need to cache the deserialized data.We can create the broadcast variable by simply calling the method broadcast in the SparkContext and passing the variable say v, to the method. The variable we then get is just a wrapper around the variable v. We can look at the code below to see how this can be done.Once our so-called wrapper Broadcast variable is created we should be using this “broadcastVar”  variable instead of the variable we had v. This way the variable v will not be shipped with the closure/function call. We should also make sure that the broadcast variable once created should not be modified later as every worked node should have the same state of the broadcast variable; else the behavior will not be consistent and we might see unpredictable results.Accumulators: Accumulators are special types of Spark variables in which the worker nodes are only able to add using an associative operation. Only the driver is able to read the Accumulator value but not the workers. One very specific use case of accumulator is in MapReduce programming to implement the counters and to provide a syntax for parallel runs. Since accumulators allow only to add operations they are very easily made fault-tolerant.Accumulators can be created using the SparkContext using the function accumulator and passing the variable v to the functions. Once this is done the tasks can be added to it using the add method or we can also use the Scala operator “+=”. However, the workers will not be able to read the value of the accumulator. Only the driver program will be able to read the value of an accumulator variable using the value method. Let us see an example of how this can be done.scala> valmyaccum = sc.accumulator(0, "MyAccumulator") myaccum: spark.Accumulator[Int] = 0 scala> sc.parallelize(Array(1, 2, 3, 4)).foreach(x => myaccum += x)  scala> myaccum.value  res4: Int = 10The accumulators can be seen in the tasks if given a proper name. The spark web UI shows the accumulator in the task table. The accumulators can be helpful in understanding the progress of running stagesConclusionIn this section we understood how and when to use broadcast variables and accumulators and how these can be used in our application.
logo

Apache Spark Tutorial

Advanced Spark Programming

Introduction

In this section we will look at some of the Spark’s advanced programming concepts like broadcast variables and accumulators. The understanding of these concepts is not compulsory but is helpful in cases where we want to fine tune a Spark application.

When we want to perform operations like map, filter and reduce, we invoke them by passing closures or functions to Spark. These closures or functions usually refer to the variables where these get created. Usually, when these closures run on the worked node, these variables get copied to the worker nodes by the Spark execution engine. But, other than this usual way, there is also a mechanism by which Spark supports the creation of two special variables that can be shared, and they have restricted usage and support very common usage patterns. We call these variables as Broadcast variables and Accumulators.

Let us see the explanation below.

Explanation with example

Broadcast variables: If we encounter a scenario where we need a large piece of data which is mainly read-only and is used in multiple operations running in parallel, like a lookup table or a static map, it would be much better if we could distribute this piece of data to the workers only once and not with every function or closure call. Spark provides the programmer this functionality by way of “broadcast variable” which are read-only objects and are copied to each worker node only once.

As we know, when a Spark is executed, there are stages that get created and they are separated by distributed “shuffle” operations. Spark can distribute the commonly used data required by tasks at any stage. This type of data broadcasting caches the data in serialized form. These will then be deserialized when running each of the tasks. So, in essence, broadcast variables will be useful when we need the same data across many different stages or when we need to cache the deserialized data.

We can create the broadcast variable by simply calling the method broadcast in the SparkContext and passing the variable say v, to the method. The variable we then get is just a wrapper around the variable v. We can look at the code below to see how this can be done.

Spark Code

Once our so-called wrapper Broadcast variable is created we should be using this “broadcastVar”  variable instead of the variable we had v. This way the variable v will not be shipped with the closure/function call. We should also make sure that the broadcast variable once created should not be modified later as every worked node should have the same state of the broadcast variable; else the behavior will not be consistent and we might see unpredictable results.

AccumulatorsAccumulators are special types of Spark variables in which the worker nodes are only able to add using an associative operation. Only the driver is able to read the Accumulator value but not the workers. One very specific use case of accumulator is in MapReduce programming to implement the counters and to provide a syntax for parallel runs. Since accumulators allow only to add operations they are very easily made fault-tolerant.

Accumulators can be created using the SparkContext using the function accumulator and passing the variable v to the functions. Once this is done the tasks can be added to it using the add method or we can also use the Scala operator “+=”. However, the workers will not be able to read the value of the accumulator. Only the driver program will be able to read the value of an accumulator variable using the value method. Let us see an example of how this can be done.

scalavalmyaccum = sc.accumulator(0, "MyAccumulator")
myaccumspark.Accumulator[Int] = 0
scalasc.parallelize(Array(1, 2, 3, 4)).foreach(x => myaccum += x) 
scalamyaccum.value 
res4: Int = 10Spark Code

The accumulators can be seen in the tasks if given a proper name. The spark web UI shows the accumulator in the task table. The accumulators can be helpful in understanding the progress of running stages

Accumulators

Conclusion

In this section we understood how and when to use broadcast variables and accumulators and how these can be used in our application.

Leave a Reply

Your email address will not be published. Required fields are marked *

Comments

alvi

I feel very grateful that I read this. It is very helpful and very informative, and I really learned a lot from it.

alvi

I would like to thank you for the efforts you have made in writing this post. I wanted to thank you for this website! Thanks for sharing. Great website!

alvi

I feel very grateful that I read this. It is very helpful and informative, and I learned a lot from it.

sandipan mukherjee

yes you are right...When it comes to data and its management, organizations prefer a free-flow rather than long and awaited procedures. Thank you for the information.

liana

thanks for info