schemas in Spark DataFrame

A schema is a StructType made up of a number of fields, StructFields, that have a name , type, a Boolean Flag which specifies whether that column can contain missing or null values

Schema on read on inferring schema on a given dataframe is ok for ad-hoc analysis but from a performance perspective , its better to actually define the schema manually , this has 2 advantages – 1 . increase performance , the burden of schema inference is lifted and 2 . – better precision , since long type could get incorrectly set to Integer etc .

the next steps show how to define schema manually

import org.apache.spark.sql.types.{StructField , StructType , StringType, LongType}
import org.apache.spark.sql.types.Metadata

val myManualSchema = StructType(Array(
StructField("DEST_COUNTRY_NAME",StringType,true),
StructField("ORIGIN_COUNTRY_NAME",StringType,true),
StructField("count",LongType,false,
Metadata.fromJson ("{ \"somekey \" : \" somemetadata \" }") )))

val dfwithmanualschema = spark.read.format("json").schema(myManualSchema).load("dbfs:/FileStore/tables/2015_summary.json")

Metadata is a wrapper over Map[String, Any] that limits the value type to simple ones: Boolean, Long, Double, String, Metadata, Array[Boolean], Array[Long], Array[Double], Array[String], and Array[Metadata]. JSON is used for serialization.

The default constructor is private. User should use either MetadataBuilder or Metadata.fromJson() to create Metadata instances.

param: map an immutable map that stores the data