April 2021 – DataLyseis

Data Vault – Links

Links link Hubs and represent relationships or transactions. Links therefore contain the hash key of each connected Hub long with some metadata. So in the case of employee and Department hubs , we will have a empdeptlink that will have the following fields

As you can see the link table stores the Hash key of the employee hub table and Dept Hub table

the primary key of the link table is the Hash key which is really the hash code for the combination of all the Business Keys in the link.

Just like Hubs , the load date and the record source are the only two meta data fields that are added to the link table.

Notice there is no slowly changing dimension logic built for links or hubs , those are captured in the satellite entities .

A Link consists of two or more foreign keys. These can be hash keys from Hubs or from other Links. The primary key of a Link table is the hash value calculated over all the foreign keys together with the load date. The foreign keys are, of course, hash values themselves because they reference the hash keys of the Hub tables

there are two other optional fields that may be added to the link table. one is the Last seen date , the logic for which we have described in the post for Hubs and the other is the Dependent Child key . In case of a customer placing an order , the order line number would be a dependent child key , since that affects the grain of the data – like quantity and amount etc. This is also called as a degenerate field . These fields cannot stand on its own like a hub and has no meaning unless you look at the context and have no descriptors on their own. The dependent child key is also used as an identifying element of the link structure so the hash key is derived from the business keys of the referenced hubs and dependent child key

The link table doesnot have any descriptive information so in the above example the link table does not have any information of the line item quantity or price etc. these details are stored in the satellite table for link . Link acts like a bridge table to represent transcations between the hubs and it essentially implements a many to many relationship between hubs. One to many relationship is a subset of many to many relationship. Link should go down to the lowest level of detail and this establishes the grain of the data warehouse and in modern data warehouse its best to always go down to the lowest avaibale grain .

We will look at Satellite next

Data Vault – Hubs

In the previous post we looked at high level introduction to Data vault. In this post we can look at the Hub entity in detail. As described earlier Hubs capture the Business Key for the business entity its representing . The business key can be composite key . The hub tracks the arrival of a new business key in the data warehouse and as such it needs metadata to go along with it. So it captures the source system called as the record source and the date/time stamp called as the load date. In addition it generates a Hash Key that is based on the business key. Its this hash key that gets loaded in the corresponding This is an important step, since when you open up a hub table in a data vault based design , you will see all these hash keys that’s not pleasant to look at , but it has a lot of functional advantages that you don’t get when you use a typical sequence id or surrogate id.

So a typical Hub would look like this , in this case I have a modeled the employee hub that I talked about in the previous post.

Dan Linstedt recommends that the load date and record source be kept at the beginning of the entity , just to keep the design clean . All hubs start with the same attribute and makes the maintenance of the data vault easier.

One of the key elements is to identify the Business keys , its a good practice to select keys that are common across all operational system . in the case of an employee , the employee id may be the same across Payroll, Time Reporting , HR , etc , but each of these systems may be generating a surrogate id that specific to each system and may not hold meaning outside of the system. So its important to stick with globally unique business keys even if its a composite id. Do not use surrogate ids.

There should be a unique index on the business key . If its a composite key , we are free to merge it into a single field or split it into separated fields with the unique index spanning across the fields. if needed , we can store the single field and the split fields together in the same hub as well.

Hash key is the hash of the business key and can be generated on any system as long we use the same hash method (MD5 etc ) across the organization and this becomes the primary key of the hub entity and is used as the foreign key to reference entities such as links and satellites.

The load date is system generated and indicates when the business key initially arrived in the data warehouse.

The record source should point to where the business key is being derived from and should be as granular as possible to give as much transparency and auditability

The last seen date is really to maintain when was the last time the business key was observed in the source systems. With regulations such as GDPR , where we need to delete records from any system , its a good idea to implement this field, since any business keys that dont show up for an agreed upon time can be deleted after the last seen date + window is exceeded.

in the next post we will look at link entities in detail

Data Vault – introduction

Data vault is one of the newer data modeling approach and its designed to support agility and scale. Typical data warehouse design approaches require a lot of changes to be made at the 3NF layer to conform the data that is coming from multiple sources. Data Vault aims at building this layer in a more efficient manner by keeping the changes to the existing structure to a minimum . In this post and the next series of posts we can look at how to use the data vault approach.

Data vault focuses on using business keys to create a business -centric model for data warehousing. This makes it easier to represent the way businesses integrate , connect and access information in the same manner as the business does.

There are three basic entities that are derived from the source systems and these are Hubs, Links and Satellites. Lets look at each of them in detail.

Hubs . The first step in a data vault design is to think of what defines a business entity and what is the corresponding business key. For example this could be a User with the business key being user id or in case of an employee it would be the employee id. This uniquely identifies the entity and this business key goes into the Hub . The hub only stores the business key and some metadata. We will get into the metadata later. In the example below , we will be just storing the userid or the employee id in the hub table . Other attributes like first name , last name , age etc will go into another entity called as the satellite.

Links Hubs are linked to each other to represent transactions or relationships in the real word . Links are entities that tie Hubs ( business entities ) together. For example Employees belong to a department , in this case a link entity will join the dept hub to the employee hub and it depicts a relationship . Lets take another example a user could access a web page and add a product to the shopping cart. In this case the user hub can be linked to the product hub and the order hub with a link entity . This link entity represents the transaction.

Satellites – These are separate entities that add more business context to the Hub entity and the link entity. The Hub entity for e.g. employee hub captures only the business key employee id , but there is a whole lot of employee attributes like his name , age , gender, title , pay etc. that needs to be captured as well. This is where all of the information goes . Links will also have its own satellites , For e.g. in case user _ product link entity , the satellite connected to the link will capture the details of the specific transaction i.e. the date when the product was added to the cart , the quantity, the price , or any other details that go in the transaction. In the case of the link entity depicting a relationship like in the case of the employee to department , the satellite connected that particular link entity will show the date where the employee started with the particular department , the role in the particular department and or any other contextual information required for the entity.

in conclusion you just have to remember this

Hubs – > Business keys

Links -> Relationships / Transaction data

Satellites -> Attributes / description for the above two

This post provides a very high level introductory overview of data vault. I will get into a more details in subsequent posts.

Azure Data Factory Basics

These are sort of the building blocks for Azure data factory

Pipeline – logical group of activities – for eg Get MetaData , copy activity , lookup activity etc

linked service – connection service to resources – the way you create this is to click on manage -> linked service

Dataset refers to the data that are used in the activities and you need linked service to enable the connection to the data

Triggers – define how and when the pipeline is executed.

in memory databases

Notes from the reading of paper – main memory Database systems

There are two kinds of databases memory resident database systems and disk resident databases. if the cache of the DRDB is large enough , copies of the data will be in memory at all times , but its not taking full advantage of the memory. The index structures are designed for disk access ( B-trees) , even though the data is in memory. Also applications may have to access data through a buffer manager as if the data were on disk. For example every time an application wishes to access a given tuple its disk address will have to be computed and then the buffer manager will be invoked to check if the corresponding block is in memory. Once the block is found the tuple will be copied into an application tuple buffer where it is actually examined. For memory resident database, you can just directly access by its memory address. Newer applications convert a tuple or object into an in-memory representation give applications a direct pointer to it – called as swizzling. with regards to locking for concurrency control , since access times to memory is fast , the time period for which the lock is held is very low as well and as such there is no significant advantage to doing narrow or small lock granules like on a specific cell or column as opposed to the entire table , in extreme cases the lock granule can be at the entire database and thus making it serial execution , which is highly desirable since the cost of concurrency control are almost eliminated. ( setting lock, releasing locks, coping with deadlock, CPU cache flushes etc ) . For disk based system , the locks are kept in a has table , with the disk copy having no information., with a memory database systems , this information can be coded into the object itself with a bit or 2 reserved for this .

For in memory database , if there is a need to write to a transaction log on disk , then it present a bottle neck. there are different approaches to solve this problem – carve out some of the memory to hold the log and flush the log at the end of the transaction or do group commits when the page is full etc.

In a main memeory database , index structures like B-trees which are designed for block-oriented storage lose much of their appeal. Hashing provides fast lookup and update ,but may not be as space -efficient as a Tree. T-tree is designed specifically for memory resident databases. since pointers are unfiorm size, we can use fixed length structures for building indexes that rely on pointers. With in memory database , query processing techniques that assume sequential access lose their appeal – for e.g sort merge join processing , no need to sort because of random access.

The rest of the paper deals with the different attempts at an in memory database system with some specific characteristics for each . overall a great introduction to in memory database from historical perspective and still v very relevant , since i have not seen much commercialization of this kind of dB’s other than HANA which is terribly expensive.

oracle storage blocks

Oracle stored data in blocks , blocks are usually 8k in size , you can change this , but it’s best to leave this as the default size. Blocks make up extents and extents make up segments which is the primary unit while working with partitions and tables.

Blocks consider headers and as expected is located at the start of the block and row data which starts at the bottom and works its way back up.

PCTFREE is associated with how much of the space in the block can be used before it is considered full. its purpose is to reserve free space for future updates to the row. This ensures that there is no row migration when updates happen.

ROWID defines how the database has to look up a row, it consists of the data object number

sp_help

one of the ways you can quickly lookup the stored procedure definition in SQL server is using sp_helpText procname . This will print lines each with 255 character that show the code for the stored procedure

if you are trying from a remote machine try this instead

EXEC  [ServerName].[DatabaseName].dbo.sp_HelpText 'ProcName'

Another help stored procedure is sp_help . You can inspect a table by running this and passing a table name as a parameter

note when you drag and drop in ssms it may give you a syntax like this

[procfwk].[AlertOutcomes]

you want to change it to [procfwk.AlertOutcomes] for it to work

sp_help [procfwk.AlertOutcomes]

this gives a detailed layout of the table

you can also find out dependencies using sp_depends

sp_depends [procfwk.AlertOutcomes]

sp_who or sp_who2 shows all , but sp_who active shows active users