There are certain data modeling advantages when it comes to data modeling in the MPP ( Massively Parallel processing ) columnar stores
- Grain – Typically the grain in the fact table is set at the level where you would like to drill the report down to , This is to balance between the performance and storage needs of the analytical database. With MPP database , performance can be scaled and storage costs have gone down in the past. This gives us the ability to store the fact table data at the lowest grain even if the current needs don’t require it at that level. The columnar approach lends itself to compression and we can leverage that to reduce storage consumption.
- Distribution Strategy – this is by farm the most important aspect of a distributed parallel system . if all of the data is located in one node, you are not taking advantage of the rest of the nodes, so the way you distribute the data is the single most important factor when deriving value out of a MPP database. so here are some common sense logic that we may want to consider when distributing the data
- Do not distribute based on columns that are used in the where clause, or filtered by column, since this may exclude some of the nodes at the query execution time .
- Do not use dates as a distribution key , this will divide the data by each day or whatever time unit you pick , but reading data distributed by time key will always give bad performance.
- you can always add nodes , so use columns that have high cardinality or ( larger number of distinct values ) . If you have a 30 node cluster and lets say the column that you choose as your distribution key only has 10 distinct values , then the data will be written to only 10 of the 30 nodes you have – this is a super simplistic view , but you get the point.
- if you don’t have high cardinality , consider using multi distribution keys
- Denormalization is good , add the dimension values to the fact table , this avoids the joins and the columnar compression can help with keeping the size manageable
- Slowly changing dimension can be handled by adding another column ( type 3) instead of a new row (type 2 ) . This is considered better.
- With the columnar stores , bulk load is much more efficient – so use Bulk load wherever possible . Standard row inserts should be avoided.
- Be very careful on updating distribution keys , since this affects the distribution, this can affect the skew
- Try to avoid deletes unless absolutely needed and in such cases , consolidate deletes. its better to drop the table and bulk load the data.
Finally , its best to always try out different distribution strategy and figure out which approach gives the best performance . You can record performance of each approach for a set of use cases and this eventually can become a reference set for future requirements for your particular environment.