Overview

In this post I want to go over some Spark application concepts. These concepts allow us to better understand what Spark is doing behind the scenes when its executing our programs. The material in this post is taken from the excellent book Learning Spark. Lighting-fasts data analytics. Moreover, you can find a complete set of notes on Spark at Statistics, Probability \& Machine Learning Notes and code snippets at Computational Statistics with Scala.

Some application concepts for Apache Spark

Let's begin by introducing the following terms [1]

Application
SparkSession
Job
Stage
Task

An application is a user's program that is built using Spark's APIs. It consists of a driver program and executors on the cluster. A job consists, possibly, of many stages that depend on each other. This dependency is organized into a Direct Acyclic Graph or DAG. Each node in the DAG represents a stage [1]. Furthermore, each stage is composed of tasks. A task is the unit of work in Spark that is executed by an executor. An executor is typically a computing node on the cluster that Spark is deployed. A computing node may be a multicore machine. Thus, each task is mapped to a single core and works on a single partition of the data. The following article Spark Basics : RDDs,Stages,Tasks and DAG describes nicely these concepts. The last term is SparkSession. It provides an entry point for interacting with the underlying functionality that Spark offers.

A SparkSession is either created automatically for us, this will be the case when using the shell, or the application needs to instantiate one. The following snippet shows this

import org.apache.spark.sql.SparkSession

class MySparkApp{

  def main(args: Array[String]) {

    ...

    val appName: String = "MySparkApp"

    val spark = SparkSession
    .builder()
    .appName(appName)
    .getOrCreate()

    ...
}

}

Note that there can be only one SparkSession per JVM.

Let's now turn into what operations can we execute. Operations on a distributed data set can be classified into two types; transformations and actions [1]. You can find more information on these two operations in RDD Programming Guide. Below we just give a brief overview of what each operation entails.

Transformations

Transformations in Spark transform a DataFrame into a new one. This is done without altering the original data. Hence a transformation is an immutable operation as far as the original data is concerned. Some examples of transformations are listed below

orderBy()
groupBy()
filter()
join()

All transformations in Spark are evaluated lazily [1] (see Lazy evaluation). What this means is that their results are not computed immediately; instead a transformation in encoded as a lineage. This allows Spark to rearrange certain transformations, coalesce them or perform certain optimizations for more efficient execution (e.g. by joining or pipelining some operations and assign them to a stage) [1].

Wide and narrow transformations

We can classify transformations according to the dependencies they has as transformations with narrow dependencies or transformations with wide dependencies [1]. A narrow transformation is a transformation that can be computed from a single input. In particular, a narrow transformation does not need to exchange data with other partitions (of the data) in order to compute the result. Wide transformations read data from other partitions in order to compute the result. groupBy or orderBy are two transformations that instruct Spark to perform wide transformations [1]. Evidently, we should avoid wide transformations if possible.

Actions

An action triggers the lazy evaluation of all the recorded transformations [1]. A list of actions is given below.

show()
take()
count()
collect()

Remark

Lazy evaluation allows Spark to optimize our queries. Lineage and data immutability allow for fault tolerance [1]. Since Spark records all transformations in its lineage and the DataFrames are immutable between the transformations, it can reproduce the origin data by simply replaying the recorded lineage [1].

Query plan

Both actions and transformations contribute to a query plan in Spark 1. Nothing in this plan is executed until an action is invoked [1].

Summary

In this post we went briefly onto some of the basic but core concepts in Spark. Specifically, we saw what a job, a stage, and a task are. And how these are used to organize computation in Spark. Furthermore, we touched upon the SparkSession construct. This entity gives us access to all the functionality provided by Spark. Finally, we saw the two types of operations that we can apply of an RDD. Namely transformations and actions. We will come back to these topics as these an occurring theme when working with Spark.

In Apache Spark. Submit a self-contained Scala application we describe how to submit a standalone Scala application to Spark for execution.

References

Jules S. Damji, Brooke Wenig, Tathagata Das, Deny Lee, Learning Spark. Lighting-fasts data analytics, 2nd Edition, O'Reilly.