parquet vs avro

Ever wondered if parquet or avro is a better choice for your data lake file system , well let the numbers speak for themselves .

Here is a simple job in azure data factory that picks up a table in MySQL and loads it into data lake .

This image has an empty alt attribute; its file name is avrostat.jpg

We are loading 34.86 MB of data from the source system . This can be seen in the left under the MySQL section and on the right we have the data written which says 17.551MB , so AVRO format compressed the size by 2X . The copy task lasted about 2 minutes 39 seconds

Enter Parquet – tada !

Lets run the same job and write it to parquet file and see what the numbers look like

This image has an empty alt attribute; its file name is parquetstat.jpg

its the same file loaded into the same Data Lake and lets see the numbers this time . The Data written on the right side is 2.57 MB , thats 17X compression !!! . Notice the throughput and the copy duration numbers , those look so much better. …

Parquet is clearly the winner in this scenario , Avro is a row based format , stores the contents in binary with the meta data in JSON , however is columnar ( with row groups ) with summarization stats added to the metadata . The columnar lends itself it to much better compression and hence we see the performance gains above. enjoy !