Data Vault – Hubs

In the previous post we looked at high level introduction to Data vault. In this post we can look at the Hub entity in detail. As described earlier Hubs capture the Business Key for the business entity its representing . The business key can be composite key . The hub tracks the arrival of a new business key in the data warehouse and as such it needs metadata to go along with it. So it captures the source system called as the record source and the date/time stamp called as the load date. In addition it generates a Hash Key that is based on the business key. Its this hash key that gets loaded in the corresponding This is an important step, since when you open up a hub table in a data vault based design , you will see all these hash keys that’s not pleasant to look at , but it has a lot of functional advantages that you don’t get when you use a typical sequence id or surrogate id.

So a typical Hub would look like this , in this case I have a modeled the employee hub that I talked about in the previous post.

Last seen date is an optional attribute

Dan Linstedt recommends that the load date and record source be kept at the beginning of the entity , just to keep the design clean . All hubs start with the same attribute and makes the maintenance of the data vault easier.

One of the key elements is to identify the Business keys , its a good practice to select keys that are common across all operational system . in the case of an employee , the employee id may be the same across Payroll, Time Reporting , HR , etc , but each of these systems may be generating a surrogate id that specific to each system and may not hold meaning outside of the system. So its important to stick with globally unique business keys even if its a composite id. Do not use surrogate ids.

There should be a unique index on the business key . If its a composite key , we are free to merge it into a single field or split it into separated fields with the unique index spanning across the fields. if needed , we can store the single field and the split fields together in the same hub as well.

Hash key is the hash of the business key and can be generated on any system as long we use the same hash method (MD5 etc ) across the organization and this becomes the primary key of the hub entity and is used as the foreign key to reference entities such as links and satellites.

The load date is system generated and indicates when the business key initially arrived in the data warehouse.

The record source should point to where the business key is being derived from and should be as granular as possible to give as much transparency and auditability

The last seen date is really to maintain when was the last time the business key was observed in the source systems. With regulations such as GDPR , where we need to delete records from any system , its a good idea to implement this field, since any business keys that dont show up for an agreed upon time can be deleted after the last seen date + window is exceeded.

in the next post we will look at link entities in detail