spark-dataset -api

Scala is statically typed but python and R are not what this means is that , that the type checking is done at compile time for scala and at run time for Python and r and any other dynamically typed program. if you want to catch errors early , you want to use the statically typed language. However this makes it more stringent and less flexible so there are tradeoffs, generally my preference is to go with statically typed language ( and yes this is an acquired taste )

Spark has a structured API called Datasets, for writing statically typed code in Java and Scala. this is not applicable for python and R for reasons explained above

DataFrames are a distributed collection of objects of type Row that can hold various types of tabular data. The Dataset API gives users the ability to assign a Java/Scala class to the records within a DataFrame and manipulate it as a collection of typed objects, similar to a Java ArrayList or Scala Seq. The APIs available on Datasets are type-safe, meaning that you cannot accidentally view the objects in a Dataset as being of another class than the class you put in initially. This makes Datasets especially attractive for writing large applications, with which multiple software engineers must interact through well-defined interfaces.

lets look at defining a Dataset , we could look at using scalas case class

A scala case class comes with a default apply method which means it can build the objects for us and is useful for pattern matching . its all made of vals so its immutable and great for modeling immutable data.

// in Scala

case class Flight(DEST_COUNTRY_NAME: String, 
                  ORIGIN_COUNTRY_NAME: String,  
                   count: BigInt)
val flightsDF = spark.read 
               .parquet("/data/flight-data/parquet/2010-summary.parquet/")
val flights = flightsDF.as[Flight]

the flights val is a dataset built on top of the flightsDF dataframe

mount adls in DataBricks with SPN and oauth2

Here is the overall flow to mount the adls store in DataBricks using Oauth

steps to mount data lake file system in azure data bricks

1st step is to register an app in azure directory

this creates the application (client id) and the directory ( tenant ) id.

within Azure Ad app registration -> create a client secret -> once generated you have to copy the key value

once its hidden , it stays hidden forever – hence very important to rememeber to store the secret.

this secret key gets exchanged for a token at the time when we are trying to mount the file system

  • next step – store key in key vault

open up key vault -> click on generate /import -> paste in the secret generated in the previous step

  • once this step is done , go to data bricks
  • why there is no direct link to create a scope is beyond me , but there are two options – web method or databricks cli , i will use the web method to create the scope , will cover the databricks cli later – its my preferred approach but i have not

first step – go to key vault and get the dns name and resource id

once you get this – go to the web page as shown in step 6 below

and copy the corresponding DNS name and resource id

in this case we have created a scope called dbtravelscope

At this point we have created a scope with the client secret stored. We should be able to proceed with the steps outlined in this link below to get the adls mounted on Data Bricks

https://docs.databricks.com/data/data-sources/azure/azure-datalake-gen2.html#mount-an-azure-data-lake-storage-gen2-account-using-a-service-principal-and-oauth-20&language-scala