we will now get into some more details about data frames
column in data frames is really a subset of expression. An expression is a set of transformation that you apply to one or more values in a record ( rows) in a data Frame . so an expr(“colname”) is essentially same as col(“colname”). columns and transformation of columns compile to the same logical plan as parsed expressions
- the “col” function or “column” function are the same from a functionality perspective
- columns function gives programmatic access to all columns in a data frame just like printschema for interactive use.
- in scala we can use .toDF to convert Seq to dataframes but this does not handle null types well , so not recommended for use in production
- the preferred approach is to use the createDataFrame function and pass the row object and schema to this schema. so you are not relying on the implicits for inferring the schema with the toDF method.
- Select and SelectExpr allows you to do DataFrame equivalent of SQL queries on a table
- you can refer to column in scala in multiple ways
df.col("colname") or col("colname") or column("colname") or 'colname or $"colname" or expr("colname") all of these statements above will resolve to the same column name
- because select is followed by a series of expressions we have a selectExpr that can be used to combine it all in one
sample code
df.selectExpr("avg(COLCOUNT)","count(distinct(COLSOMEOTHER)").show(2) in scala is same as Select avg("COLCOUNT"),count(distinct("COLSOMEOTHER")) from dftable LIMIT 2 in spark sql
both spark SQL and scala code evaluate to the same logical plan so most people stick with Spark sql since thats easier and more familiar
The lit() function creates a column object out of a literal value . you can use the lit( 20) to pass the literal value of integer 20 to spark . We can use the withColumn function to pass this lit(20) and it will create a new column with 20 added to each row of the dataframe
df.withColumn("constvalcol",lit(20)).show(10) this will add 20 to a new column named constvalcol and display 10 rows.
you can use cast just like you would in sql
to drop a column(s) in df , there are 2 ways
df.drop("coltobedropped").columns or df.drop("coltobedropped1", "coltobedropped2")
you can use df.filter(col(“count”) > 5) or df.where(col(“count”) > 5 ) to filter