SPARK Catalyst at Work

Overview

In this lab, we will look at several transformations and examine the optimizations that Catalyst performs.


We’ll first work with the Wikimedia page count data again, and see how Catalyst helps to optimize queries involving filtering.

// Load the file into Spark
> val viewsDF=spark.read.text("[path to file]")
// Split on whitespace
> val splitViewsDF = viewsDF.select(split('value, "\\s+").as("splitLine"))
// Use a better schema
> val viewsWithSchemaDF = splitViewsDF.select('splitLine(0).as("domain"), 'splitLine(1).as("pageName"), 'splitLine(2).cast("integer").as("viewCount"), 'splitLine(3).cast("long").as("size"))

Push Down Predicate

Work with DataSets and lambdas

We’ll now create a DataSet and filter it using a lambda. We’ll look at how the lambda affects the Catlyst optimizations.

case class WikiViews(domain:String, pageName:String, viewCount:Integer, size:Long)
val viewsDS = viewsWithSchemaDF.as[WikiViews]

Review Other Transformations

> folksDF.filter('age>25).orderBy('age).explain
== Physical Plan ==
*Sort [age#42L ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(age#42L ASC NULLS FIRST, 200)
   +- *Project [age#42L, gender#43, name#44]
      +- *Filter (isnotnull(age#42L) && (age#42L > 25))
         +- *FileScan json [age#42L,gender#43,name#44] Batched: false, Format: JSON, Location: ..., PartitionFilters: [], PushedFilters: [IsNotNull(age), GreaterThan(age,25)], ReadSchema: struct<age:bigint,gender:string,name:string>

Results

Catalyst rules!

We’ve explained it, now you see it.

But lambdas can overthrow the ruler - so be cautious with them.