You can view my other articles on Spark RDD at below links

Apache Spark RDD API using Pyspark…

Tips and Tricks for Apache Spark RDD API, Dataframe API

How did Spark become so efficient in data processing as compared to MapReduce? It comes with a very advanced Directed Acyclic Graph (DAG) data processing engine. What it means is that for every Spark job, a DAG of tasks is created to be executed by the engine. The DAG in mathematical parlance consists of a set of vertices and directed edges connecting them. The tasks are executed as per the DAG layout. In the MapReduce case, the DAG consists of only two vertices, with one vertex for the map task and the other one for the reduce task. The edge is directed from the map vertex to the reduce vertex. The in-memory data processing combined with its DAG-based data processing engine makes Spark very efficient. In Spark’s case, the DAG of tasks can be as complicated as it can. Thankfully, Spark comes with utilities that can give excellent visualization of the DAG of any Spark job that is running.


The DAG of the Spark job generated on the fly something will look like below:

The Spark programming paradigm is very powerful and exposes a uniform programming model supporting the application development in multiple programming languages. Spark supports programming in Scala, Java, Python, and R even though there is no functional parity across all the programming languages supported. Apart from writing Spark applications in these programming languages, Spark has an interactive shell with Read, Evaluate, Print, and Loop (REPL) capabilities.

The Spark REPL enables easy prototyping, debugging, and much more.
In addition to the core data processing engine, Spark comes with a powerful stack of domain specific libraries that use the core Spark libraries and provide various functionalities useful for various big data processing needs. The following table lists the supported libraries:

  • SparkSQl
  • Spark Streaming
  • Spark MLib
  • Spark Graphx

Spark can be deployed on a variety of platforms including distributed and cloud platform. Spark runs on the operating systems (OS) Windows and UNIX (such as Linux and Mac OS). Spark can access data from a wide variety of data stores, and some of the most popular ones include HDFS, Apache Cassandra, Hbase, Hive, and so on.

The Spark master, which is part of the Spark core library, and the Hadoop YARN Resource Manager, are some of the cluster managers that Spark supports. In the case of a Hadoop YARN deployment of Spark, the Spark driver program runs inside the Hadoop YARN application master process or the Spark driver program runs as a client to the Hadoop YARN.

In any distributed application, it is common to have a driver program that controls the execution and there will be one or more worker nodes. The driver program allocates the tasks to the appropriate workers. This is the same even if Spark is running in standalone mode. In the case of a Spark application, its SparkContext object is the driver program and it communicates with the appropriate cluster manager to run the tasks. The Spark master, which is part of the Spark core library, the Mesos master, and the Hadoop YARN Resource Manager, are some of the cluster managers that Spark supports. In the case of a Hadoop YARN deployment of Spark, the Spark driver program runs inside the Hadoop YARN application master process or the Spark driver program runs as a client to the Hadoop YARN.

Figure describes the standalone deployment of Spark:

Figure describes the Hadoop YARN deployment of Spark:

Spark installation
Visit here for understanding, choosing, and downloading the right type of installation for your computer.

Development tool installation
Spark applications in Scala, the Scala build tool (sbt). Visit here for downloading and installing sbt.
Maven is the preferred build tool for building Java applications. Maven will come in handy if Spark is to be built from source. Visit here for downloading and installing Maven.
There are many Integrated Development Environments (IDEs) available for Scala as well as Java. It is a personal choice, and the developer can choose the tool of his/her choice for the language in which he/she is developing Spark applications.

Spark applications in Scala, it is good to have sbt-based Scala projects and develop them using a supported IDE, but not limited to Eclipse or IntelliJ IDEA. Visit the appropriate website for downloading and installing the preferred IDE for Scala.

Notebook style application development tools are very common these days among data analysts and researchers. In the case of Spark application development in Python, IPython provides an excellent notebook-style development tool, which is a Python language kernel for Jupyter. Spark can be integrated with IPython, so that when the Spark REPL for Python is invoked, it will start the IPython notebook. Then, create a notebook and start writing code in the notebook just like the way commands are given in the Spark REPL for Python.

Among the R user community, the preferred IDE for R is the RStudio. RStudio can be used to develop Spark applications in R as well. Visit here to download and install RStudio.

Apache Zeppelin is another promising project that is getting incubated right now. It is a web-based notebook similar to Jupyter but supporting multiple languages, shells, and technologies through its interpreter strategy enabling Spark application development inherently.
Zeppelin supports a good number of interpreters or backends out of the box such as Spark, Spark SQL, Shell, Markdown, and many more. In terms of the frontend, again it is a pluggable architecture, namely, the Helium Framework. The data generated by the backend is displayed by the frontend components such as Angular JS. There are various options to display the data in tabular format, raw format as generated by the interpreters, charts, and plots.

Spark RDD is immutable
There are some strong rules based on which an RDD is created. Once an RDD is created, intentionally or unintentionally, it cannot be changed. This gives another insight into the construction of an RDD. Because of that, when the nodes processing some part of an RDD die, the driver program can recreate those parts and assign the task of processing it to another node, ultimately, completing the data processing job successfully.
Since the RDD is immutable, splitting a big one to smaller ones, distributing them to various worker nodes for processing, and finally compiling the results to produce the final result can be done safely without worrying about the underlying data getting changed.

Spark RDD is distributable
If Spark is run in a cluster mode where there are multiple worker nodes available to take the tasks, all these nodes will have different execution contexts. The individual tasks are distributed and run on different JVMs. All these activities of a big RDD getting divided into smaller chunks, getting distributed for processing to the worker nodes, and finally, assembling the results back, are completely hidden from the users.

Spark RDD lives in memory
Spark does keep all the RDDs in the memory as much as it can. Only in rare situations, where Spark is running out of memory or if the data size is growing beyond the capacity, is it written to disk. Most of the processing on RDD happens in the memory, and that is the reason why Spark is able to process the data at a lightning fast speed.

Spark RDD is strongly typed
Spark RDD can be created using any supported data types. These data types can be Scala/Java supported intrinsic data types or custom created data types such as your own classes. The biggest advantage coming out of this design decision is the freedom from runtime errors. If it is going to break because of a data type issue, it will break during compile time.

The following table captures the structure of an RDD containing tuples of a retail bank account data. It is of the type RDD[(string, string, string, double)]:

RDD
The most important feature that Spark took from Scala is the ability to use functions as parameters to the Spark transformations and Spark actions. Quite often, the RDD in Spark behaves just like a collection object in Scala. Because of that, some of the data transformation method names of Scala collections are used in Spark RDD to do the same thing. This is a very neat approach and those who have expertise in Scala will find it very easy to program with RDDs. We will see a few important features in the following sections.

Spark RDD is immutable
There are some strong rules based on which an RDD is created. Once an RDD is created, intentionally or unintentionally, it cannot be changed. This gives another insight into the construction of an RDD. Because of that, when the nodes processing some part of an RDD die, the driver program can recreate those parts and assign the task of processing it to another node, ultimately, completing the data processing job successfully.
Since the RDD is immutable, splitting a big one to smaller ones, distributing them to various worker nodes for processing, and finally compiling the results to produce the final result can be done safely without worrying about the underlying data getting changed.

Spark RDD is distributable
If Spark is run in a cluster mode where there are multiple worker nodes available to take the tasks, all these nodes will have different execution contexts. The individual tasks are distributed and run on different JVMs. All these activities of a big RDD getting divided into smaller chunks, getting distributed for processing to the worker nodes, and finally, assembling the results back, are completely hidden from the users.

Spark has its own mechanism for recovering from the system faults and other forms of errors which occur during the data processing and hence this data abstraction is highly resilient.

Spark RDD lives in memory
Spark does keep all the RDDs in the memory as much as it can. Only in rare situations, where Spark is running out of memory or if the data size is growing beyond the capacity, is it written to disk. Most of the processing on RDD happens in the memory, and that is the reason why Spark is able to process the data at a lightning fast speed

Spark RDD is strongly typed
Spark RDD can be created using any supported data types. These data types can be Scala/Java supported intrinsic data types or custom created data types such as your own classes. The biggest advantage coming out of this design decision is the freedom from runtime errors. If it is going to break because of a data type issue, it will break during compile time.

The following table captures the structure of an RDD containing tuples of a retail bank account data. It is of the type RDD[(string, string, string, double)]:

AccountNo – TranNo – TranAmount

Suppose this RDD is going through a process to calculate the total amount of all these accounts in a cluster of three nodes, N1, N2, and N3; it can be split and distributed for something such as parallelizing the data processing. Lets say 2 rows contains the elements of the RDD distributed to node N1 for processing and sayy 3 rows of table distributed to Node N2. On node N1, the summation process happens and the result is returned to the Spark driver program. Similarly, on node N2, the summation process happens, the result is returned to the Spark driver program, and the final result is computed.
Spark has very deterministic rules on splitting a big RDD into smaller chunks for distribution to various nodes and because of that, even if something happens to, say, node N1, Spark knows how to recreate exactly the chunk that was lost in the node N1 and continue with the data processing operation by sending the same payload to node N3.

Data transformations and actions with RDDs

Lets say we have data in format : AccountNo – TranNo – TranAmount

To calculate the account level summary of the transactions from the RDD of the form (AccountNo,TranNo,TranAmount):
First it has to be transformed to the form of key-value pairs (AccountNo,TranAmount), where AccountNo is the key but there will be multiple elements with the same key. On this key, do a summation operation on TranAmount, resulting in another RDD of the form (AccountNo,TotalAmount),where every AccountNo will have only one element and TotalAmount is the sum of all the TranAmount for the given AccountNo. Now sort the key-value pairs on the AccountNo and store the output.


In the whole process described, all are Spark transformations except the storing of the output. Storing of the output is a Spark action. Spark does all these operations on a need-to-do basis. Spark does not act when a Spark transformation is applied. The real act happens when the first Spark action in the chain is called. Then it diligently applies all the preceding Spark transformations in order, and then does the first encoun

When a transformation is done on an RDD, a new RDD gets created. This is because RDDs are inherently immutable. These RDDs that are getting created at the end of each transformation can be saved for future reference, or they will go out of scope eventually.

For more information about the Spark transformation and Spark action go to my this article.

Spark web user interface (UI) is starting up by going to http://localhost:8080/. The assumption here is that there is no other application running in the 8080 port. If for some reason, there is a need to run this application on a different port, the command line option –webui-port can be used in the script while starting the web user interface.

to be continue…

Leave a Reply

Your email address will not be published. Required fields are marked *