November 2020 – DataLyseis

Data Modeling with MPP Columnar Stores

There are certain data modeling advantages when it comes to data modeling in the MPP ( Massively Parallel processing ) columnar stores

Grain – Typically the grain in the fact table is set at the level where you would like to drill the report down to , This is to balance between the performance and storage needs of the analytical database. With MPP database , performance can be scaled and storage costs have gone down in the past. This gives us the ability to store the fact table data at the lowest grain even if the current needs don’t require it at that level. The columnar approach lends itself to compression and we can leverage that to reduce storage consumption.
Distribution Strategy – this is by farm the most important aspect of a distributed parallel system . if all of the data is located in one node, you are not taking advantage of the rest of the nodes, so the way you distribute the data is the single most important factor when deriving value out of a MPP database. so here are some common sense logic that we may want to consider when distributing the data
- Do not distribute based on columns that are used in the where clause, or filtered by column, since this may exclude some of the nodes at the query execution time .
- Do not use dates as a distribution key , this will divide the data by each day or whatever time unit you pick , but reading data distributed by time key will always give bad performance.
- you can always add nodes , so use columns that have high cardinality or ( larger number of distinct values ) . If you have a 30 node cluster and lets say the column that you choose as your distribution key only has 10 distinct values , then the data will be written to only 10 of the 30 nodes you have – this is a super simplistic view , but you get the point.
- if you don’t have high cardinality , consider using multi distribution keys

Denormalization is good , add the dimension values to the fact table , this avoids the joins and the columnar compression can help with keeping the size manageable
Slowly changing dimension can be handled by adding another column ( type 3) instead of a new row (type 2 ) . This is considered better.
With the columnar stores , bulk load is much more efficient – so use Bulk load wherever possible . Standard row inserts should be avoided.
Be very careful on updating distribution keys , since this affects the distribution, this can affect the skew
Try to avoid deletes unless absolutely needed and in such cases , consolidate deletes. its better to drop the table and bulk load the data.

Finally , its best to always try out different distribution strategy and figure out which approach gives the best performance . You can record performance of each approach for a set of use cases and this eventually can become a reference set for future requirements for your particular environment.

Introduction to PostgreSQL

PostgreSQL is an open source database and is giving quite a competition to the likes of Oracle , SQL server and other vended databases out there. These next series of blogs will delve into PostgreSQL and its features.

PostgreSQL comes with a fairly extensive set of data types and you can add your own by using the create type statement. A few notable examples would be json for textual , jsonb for json in binary etc, cidr for ip address, macaddr for macaddresses etc . Postgres actually creates types for any tables you define. Third party providers use this feature to provide domain specific constructs and make it efficient and performant.

Postgres has a fairly complex security mechanism and is a full fledged database so its not really suited for embedded or low foot print solutions.

you can use psql for its command line , pgAdmin for gui , and phpPgAdmin for web based Gui tool for administration.

in some cases of installation of pgAdmin , you may run into a problem where the web page keeps saying its loading infinitely , this is because of a java script issue and you need to update the registry key to javascript from plain text , the specifics are in the pgadmin web site.

Some interesting things about postgres

Tables are inheritable – since it creates a custom data type , you can treat tables as a class
you can update a view as long as its derived from a single table
Extensions are like packages and you can extend these extensions to create new ones. its best to create a separate schema for extensions , since it installs all of its objects, its best to keep it separate.
Functions can be created using PLs and stored procedures are also called functions . The default language for functions are SQL , pl/pgSQL and C. you can add additional languages ( using extensions of course ! )
Operators are symbolically named aliases for functions , you can assign special meaning to symbols such as * & + etc
Foreign tables are virtual tables linked to data outside the database like in flat files, webs service , or a table in another database. It implements the management of external Data Standard . Foreign Data Wrappers ( FDW) for different data sources are already implemented and once the extension is installed its available for use. See this link for implementing FDW to Oracle
Triggers are special functions that give access to special variables that store data before and after the triggering event.
Catalogs store system schemas that store functions and metadata

FTS – full text search is natural language search , see image above for the components associated with it – FTS configuration, FTS dictionaries , FTS parsers, and FTS templates.
Types – postgres has composite data types and we can make new ones too , my instance has 91 of these

Cast – used to convert data from one data type to another . this can be implicit or explicit