Apache Spark. Submit a self-contained Scala application
Submit a Scala application to Spark.
In post Apache Spark. Application concepts we went over some basic but core concepts associated with Spark. In this series of posts, we will be using Scala as the main programming language. Thus, in this post, we describe the steps you need to take in order to submit a Scala application to be executed by Spark. You should also check the official documentation on Self-contained Applications. The code snippets can be found Computational Statistics with Scala.
I will assume that the environment is already set. Meaning, Scala is already installed, in this post I use version 2.13.3, Spark is also installed, for this post I use version 3.0.1. Finally, I am using SBT for building and packaging the application.
The Scala application simply informs us about the version of Spark we are using, the name of the master node and whether we run in local or distributed mode. It is shown below
//HelloSpark.scala
package spark
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object HelloSpark {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Hello Scala Spark")
val sc = new SparkContext(conf)
println("Spark version: " + sc.version)
println("Spark master: " + sc.master)
println("Spark running 'locally'?: " + sc.isLocal)
}
}
Notice, that I placed the application under the spark
package. This will be used when submitting the class to Spark. For the moment we need to structure properly our code for SBT to work. In particular, we need a file structure as follows
build.sbt
+src/
+main/
+scala/
+spark/HelloSpark.scala
The build.sbt
script is shown below
name := "Hello Spark"
version := "0.0.1"
scalaVersion := "2.13.3"
libraryDependencies += "org.apache.spark" % "spark-core_2.12" % "3.0.1"
libraryDependencies += "org.apache.spark" % "spark-sql_2.12" % "3.0.1"
As a side not, observe that I don't use double percentages. The reason why is explained here.
In order to compile our code, call sbt
at the level where the build.sbt
script is located. This will bring up the sbt
console. Once in the console, type compile
to build the project. When the compilation finishes and still in the sbt
console, type package
to create the application .jar
file. The whole process for this application should not take long. Once finished, we can submit our application to Spark for execution. I use the following bash shell script for convenience.
/home/alex/MySoftware/spark-3.0.1-bin-hadoop2.7/bin/spark-submit \
--class "spark.HelloSpark" \
--master local[4] \
target/scala-2.13/hello-spark_2.13-0.0.1.jar
Notice how I specify the class to execute by prefixing it with the package name it belongs to. Upon execution of the script you should see something similar to what follows
Spark version: 3.0.1
Spark master: local[4]
Spark running 'locally'?: true
According to the official documentation, applications should define a main()
method instead of extending scala.App
, as the latter may not work correctly.
In this post, I described the steps I need to take in order to build and submit a Scala application to Spark.