Ever wondered if parquet or avro is a better choice for your data lake file system , well let the numbers speak for themselves .
Here is a simple job in azure data factory that picks up a table in MySQL and loads it into data lake .

We are loading 34.86 MB of data from the source system . This can be seen in the left under the MySQL section and on the right we have the data written which says 17.551MB , so AVRO format compressed the size by 2X . The copy task lasted about 2 minutes 39 seconds
Enter Parquet – tada !
Lets run the same job and write it to parquet file and see what the numbers look like

its the same file loaded into the same Data Lake and lets see the numbers this time . The Data written on the right side is 2.57 MB , thats 17X compression !!! . Notice the throughput and the copy duration numbers , those look so much better. …
Parquet is clearly the winner in this scenario , Avro is a row based format , stores the contents in binary with the meta data in JSON , however is columnar ( with row groups ) with summarization stats added to the metadata . The columnar lends itself it to much better compression and hence we see the performance gains above. enjoy !