spark – basics

we will now get into some more details about data frames

column in data frames is really a subset of expression. An expression is a set of transformation that you apply to one or more values in a record ( rows) in a data Frame . so an expr(“colname”) is essentially same as col(“colname”). columns and transformation of columns compile to the same logical plan as parsed expressions

  1. the “col” function or “column” function are the same from a functionality perspective
  2. columns function gives programmatic access to all columns in a data frame just like printschema for interactive use.
  3. in scala we can use .toDF to convert Seq to dataframes but this does not handle null types well , so not recommended for use in production
  4. the preferred approach is to use the createDataFrame function and pass the row object and schema to this schema. so you are not relying on the implicits for inferring the schema with the toDF method.
  5. Select and SelectExpr allows you to do DataFrame equivalent of SQL queries on a table
  6. you can refer to column in scala in multiple ways
df.col("colname") or 
col("colname") or 
column("colname") or 
'colname  or 
$"colname" or
expr("colname")

all of these statements above will resolve to the same column name 
 
  • because select is followed by a series of expressions we have a selectExpr that can be used to combine it all in one

sample code

df.selectExpr("avg(COLCOUNT)","count(distinct(COLSOMEOTHER)").show(2)

in scala is same as 

Select avg("COLCOUNT"),count(distinct("COLSOMEOTHER")) from dftable  LIMIT 2  

in spark sql 

both spark SQL and scala code evaluate to the same logical plan so most people stick with Spark sql since thats easier and more familiar

The lit() function creates a column object out of a literal value . you can use the lit( 20) to pass the literal value of integer 20 to spark . We can use the withColumn function to pass this lit(20) and it will create a new column with 20 added to each row of the dataframe

df.withColumn("constvalcol",lit(20)).show(10)

this will add 20 to a new column named constvalcol and display 10 rows. 

you can use cast just like you would in sql

to drop a column(s) in df , there are 2 ways

df.drop("coltobedropped").columns 

or

df.drop("coltobedropped1", "coltobedropped2") 

you can use df.filter(col(“count”) > 5) or df.where(col(“count”) > 5 ) to filter