Overview

In this notebook, we will explore the Column type API. The source code can be found here.

Explore the Column API

Named columns in a DataFrame are similar to named columns in pandas data frames; they describe a type of field [1]. This is also similar to an RDBMS table. In Spark, a column is an object represented by the Column type [1]. Let's explore what we can do with columns in Spark. The examples are taken from [1]. The following Scala snippet shows some of the Column type API functions in action.


package train.spark

/*

Explore the DataFrame Column API

*/

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._

object ExploreDataFrameColumnAPI {


  def main(args: Array[String]) {

    val csvFile = "/home/alex/qi3/learn_scala/scripts/spark/data/train.csv" 
    val appName: String = "Spark DataFrame API Demo"

    val spark = SparkSession
    .builder()
    .appName(appName)
    .getOrCreate()

    // specify the schema 
    val customSchema = StructType(Array(
            StructField("mu-1", DoubleType, false),
            StructField("mu-2", DoubleType, false),
            StructField("label", IntegerType, false)))

    // read the data frame
    val df = spark.read.schema(customSchema).csv(csvFile)

    // print the schema
    df.printSchema()

    // get the columns
    df.columns

    // access a particular column by using col
    // it returns a Column type
    val colId = df.col("Id")

    // we can use expressions on columns
    df.select(expr("Hits * 2")).show(2)

    // compute a value
    df.select(col("Hits") * 2).show(2)


    // add a new column in the data frame
    df.withColumn("Big Hitters", (expr("Hits > 10000"))).show() 

    // concatenate columns and create a new column
    df.withColumn("AuthorsId", (concat(expr("First"), expr("Last"), expr("Id"))))
      .select(col("AuthorsId"))
      .show(4)

  }
}

References

  1. Jules S. Damji, Brooke Wenig, Tathagata Das, Denny Lee, Learning Spark. Lighting-fast data analytics, O'Reilly, 2nd Edition.