spark-dataset -api

Scala is statically typed but python and R are not what this means is that , that the type checking is done at compile time for scala and at run time for Python and r and any other dynamically typed program. if you want to catch errors early , you want to use the statically typed language. However this makes it more stringent and less flexible so there are tradeoffs, generally my preference is to go with statically typed language ( and yes this is an acquired taste )

Spark has a structured API called Datasets, for writing statically typed code in Java and Scala. this is not applicable for python and R for reasons explained above

DataFrames are a distributed collection of objects of type Row that can hold various types of tabular data. The Dataset API gives users the ability to assign a Java/Scala class to the records within a DataFrame and manipulate it as a collection of typed objects, similar to a Java ArrayList or Scala Seq. The APIs available on Datasets are type-safe, meaning that you cannot accidentally view the objects in a Dataset as being of another class than the class you put in initially. This makes Datasets especially attractive for writing large applications, with which multiple software engineers must interact through well-defined interfaces.

lets look at defining a Dataset , we could look at using scalas case class

A scala case class comes with a default apply method which means it can build the objects for us and is useful for pattern matching . its all made of vals so its immutable and great for modeling immutable data.

// in Scala

case class Flight(DEST_COUNTRY_NAME: String, 
                  ORIGIN_COUNTRY_NAME: String,  
                   count: BigInt)
val flightsDF = spark.read 
               .parquet("/data/flight-data/parquet/2010-summary.parquet/")
val flights = flightsDF.as[Flight]

the flights val is a dataset built on top of the flightsDF dataframe

Share this: