Overview

In post Apache Spark. Application concepts we went over some basic but core concepts associated with Spark. In this series of posts, we will be using Scala as the main programming language. Thus, in this post, we describe the steps you need to take in order to submit a Scala application to be executed by Spark. You should also check the official documentation on Self-contained Applications. The code snippets can be found Computational Statistics with Scala.

Submit a self-contained Scala application

I will assume that the environment is already set. Meaning, Scala is already installed, in this post I use version 2.13.3, Spark is also installed, for this post I use version 3.0.1. Finally, I am using SBT for building and packaging the application.

The Scala application simply informs us about the version of Spark we are using, the name of the master node and whether we run in local or distributed mode. It is shown below

//HelloSpark.scala


package spark
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

object HelloSpark {
  def main(args: Array[String]) {

    val conf = new SparkConf().setAppName("Hello Scala Spark")
    val sc = new SparkContext(conf)
    println("Spark version:            " + sc.version)
    println("Spark master:             " + sc.master)
    println("Spark running 'locally'?: " + sc.isLocal)
  }
}

Notice, that I placed the application under the spark package. This will be used when submitting the class to Spark. For the moment we need to structure properly our code for SBT to work. In particular, we need a file structure as follows

build.sbt
+src/
    +main/
         +scala/
               +spark/HelloSpark.scala

The build.sbt script is shown below

name := "Hello Spark"

version := "0.0.1"

scalaVersion := "2.13.3"

libraryDependencies += "org.apache.spark" % "spark-core_2.12" % "3.0.1"
libraryDependencies += "org.apache.spark" % "spark-sql_2.12" % "3.0.1"

As a side not, observe that I don't use double percentages. The reason why is explained here.

In order to compile our code, call sbt at the level where the build.sbt script is located. This will bring up the sbt console. Once in the console, type compile to build the project. When the compilation finishes and still in the sbt console, type package to create the application .jar file. The whole process for this application should not take long. Once finished, we can submit our application to Spark for execution. I use the following bash shell script for convenience.

/home/alex/MySoftware/spark-3.0.1-bin-hadoop2.7/bin/spark-submit \
  --class "spark.HelloSpark" \
  --master local[4] \
  target/scala-2.13/hello-spark_2.13-0.0.1.jar

Notice how I specify the class to execute by prefixing it with the package name it belongs to. Upon execution of the script you should see something similar to what follows

Spark version:            3.0.1
Spark master:             local[4]
Spark running 'locally'?: true

According to the official documentation, applications should define a main() method instead of extending scala.App, as the latter may not work correctly.

Summary

In this post, I described the steps I need to take in order to build and submit a Scala application to Spark.