A schema is a StructType made up of a number of fields, StructFields, that have a name , type, a Boolean Flag which specifies whether that column can contain missing or null values
Schema on read on inferring schema on a given dataframe is ok for ad-hoc analysis but from a performance perspective , its better to actually define the schema manually , this has 2 advantages – 1 . increase performance , the burden of schema inference is lifted and 2 . – better precision , since long type could get incorrectly set to Integer etc .
the next steps show how to define schema manually
import org.apache.spark.sql.types.{StructField , StructType , StringType, LongType} import org.apache.spark.sql.types.Metadata val myManualSchema = StructType(Array( StructField("DEST_COUNTRY_NAME",StringType,true), StructField("ORIGIN_COUNTRY_NAME",StringType,true), StructField("count",LongType,false, Metadata.fromJson ("{ \"somekey \" : \" somemetadata \" }") ))) val dfwithmanualschema = spark.read.format("json").schema(myManualSchema).load("dbfs:/FileStore/tables/2015_summary.json")
Metadata is a wrapper over Map[String, Any] that limits the value type to simple ones: Boolean, Long, Double, String, Metadata, Array[Boolean], Array[Long], Array[Double], Array[String], and Array[Metadata]. JSON is used for serialization.
The default constructor is private. User should use either MetadataBuilder
or Metadata.fromJson()
to create Metadata instances.
param: map an immutable map that stores the data