Apache Spark: RDD operations with Scala
Published:
In this notebook, we go over Spark’s resilient distributed dataset or RDD. The official programming guide can be found here. RDDs form the backbone of Spark’s data structures. The DataSet
and DataFrame
are based on RDD.
Acknowledgements
The content of this notebooks is to a large extent edited from [1].
RDD operations with Scala
A Spark RDD provides two types of operations
- Transformations
- Actions
Transformations
A transformation operation creates a new RDD from an existing RDD. Moreover, we can apply a chain of transformations once the data is loaded into memory. Some common transformations are filter
and map
Transformations examples
In this section, I will review some common RDD transformations.
map(function)
: It returns a new data set by operating on each element of the source RDD.flatMap(function)
: Similar to map, but each item can be mapped to zero, one, or more items.mapPartitions(function)
: Similar to map, but works on the partition level.mapPartitionsWithIndex(function)
: Similar tomapPartitions
, but provides a function with an Int value to indicate the index position of the partition.filter(function)
: It returns a new RDD that contains only elements that satisfy the predicate.union(otherDataset)
: It returns a new data set that contains the elements of the source RDD and theotherDataset
RDD. Note that the participating RDDs should be of the same data type.intersection(otherDataset)
: It returns a new data set that contains the intersection of elements from the source RDD and the argument RDD.
Actions
Spark transformations are lazy evaluated. What this means that a transformation is applied only when an action is called, Let’s see some examples of actions.
Actions examples
collect()
: Returns all the elements of the data set are returned as an array to the driver program.count()
: Returns the number of elements in the data set.reduce(function)
: It returns a data set by aggregating the elements of the RDD it is applied on. The aggregation is done by using the user providedfunction
argument. Thefunction
should take two arguments and returns a single argument. Moreover it should be commutative and associative so that it can be operated in parallel.first()
: Returns the first element in the data set.take(n)
: Returns the firstn
elements in the data set as an array.takeOrdered(n, [ordering])
: Return the firstn
elements of the RDD using either their natural order or a custom comparator.takeSample(withReplacement, num, [seed])
: Returns an array with a random sample of num elements of the dataset, with or without replacement, optionally pre-specifying a random number generator seed.saveAsTextFile(path)
: Write the elements of the RDD as a text file in the local file system, HDFS, or any another supported storage system.foreach(function)
: Applies thefunction
argument on each element in the RDD.
References
- Subhashini Chellappan, Dharanitharan Ganesan,
Practical Apache Spark. Using the Scala API
, Apress