Apache Spark. Explore the Column API
Explore the Column API
In this notebook, we will explore the Column
type API. The source code can be found here.
Named columns in a DataFrame
are similar to named columns in pandas data frames; they describe a type of field [1]. This is also similar to an RDBMS table. In Spark, a column is an object represented by the Column
type [1]. Let's explore what we can do with columns in Spark. The examples are taken from [1]. The following Scala snippet shows some of the Column
type API functions in action.
package train.spark
/*
Explore the DataFrame Column API
*/
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
object ExploreDataFrameColumnAPI {
def main(args: Array[String]) {
val csvFile = "/home/alex/qi3/learn_scala/scripts/spark/data/train.csv"
val appName: String = "Spark DataFrame API Demo"
val spark = SparkSession
.builder()
.appName(appName)
.getOrCreate()
// specify the schema
val customSchema = StructType(Array(
StructField("mu-1", DoubleType, false),
StructField("mu-2", DoubleType, false),
StructField("label", IntegerType, false)))
// read the data frame
val df = spark.read.schema(customSchema).csv(csvFile)
// print the schema
df.printSchema()
// get the columns
df.columns
// access a particular column by using col
// it returns a Column type
val colId = df.col("Id")
// we can use expressions on columns
df.select(expr("Hits * 2")).show(2)
// compute a value
df.select(col("Hits") * 2).show(2)
// add a new column in the data frame
df.withColumn("Big Hitters", (expr("Hits > 10000"))).show()
// concatenate columns and create a new column
df.withColumn("AuthorsId", (concat(expr("First"), expr("Last"), expr("Id"))))
.select(col("AuthorsId"))
.show(4)
}
}
- Jules S. Damji, Brooke Wenig, Tathagata Das, Denny Lee,
Learning Spark. Lighting-fast data analytics
, O'Reilly, 2nd Edition.