Sparks distributed architecture

The following diagram shows how the components of Spark are structured

RDDs present a logical distributed data model on the cloud storage

the RDD is resilient and immutable , which means any transformations gets applied , get recorded and it creates a new RDD

RDD(1) -Transformation-> RDD(2) -> Transformation ->RDD(3)

rdds are compile time type-safe and the data can be unstructured / structured type .

Rdds follow lazy execution model , so it doesnt execute until an action is executed

Datasets and DataFrames

A Dataset is a distributed collection of data. Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine. A Dataset can be constructed from JVM objects and then manipulated using functional transformations (mapflatMapfilter, etc.). The Dataset API is available in Scala and Java. Python does not have the support for the Dataset API. But due to Python’s dynamic nature, many of the benefits of the Dataset API are already available (i.e. you can access the field of a row by name naturally row.columnName). The case for R is similar.

A DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs. The DataFrame API is available in Python, Scala, Java and R. In Scala and Java, a DataFrame is represented by a Dataset of Rows. In the Scala APIDataFrame is simply a type alias of Dataset[Row]. While, in Java API, users need to use Dataset<Row> to represent a DataFrame.

the picture above shows the internal architecture

Driver is where the spark context or spark session is created and is sort of the brain and it works with the cluster manager to get the resources it needs , once it gets the resources it distributes tasks to the worker node allocated . Executors are processes on worker nodes that execute tasks. Spark sessions are created automatically in databricks

Spark application will have multiple jobs , in databricks these would be like a workflow which would be a collection of jobs , these jobs are broken into stages and these stages are individually broken into tasks

Tasks within a stage run in parallel in a shared nothing mode