Data vault: Hash keys in staging table - advanced

1.5k Views Asked by At

I'm looking into using the datavault 2.0 methodology. I understand the reasons for hashing and trying to apply it. I'd like to apply this in the "staging" phase of the datavault rather than loading it into the DV.

If a table has a business key in it, then it's easy to simply apply it to that table (probably becomes a hub). But there are tables like "orderdetail" (which probably become links) which have multiple references to other elements through a surrogate key.

Should the staging table contain both the surrogate sequence for each foreign key along with a hash for the referenced entity BK?

Example: If I have an order table with a customerId surrogate sequence, but the customer table has a CUST-000xxx reference that is used as the BK, should I perform a 'join' between order and customer to resolve the "CUST-000xxx" so I can hash it and include it in the order staging table?

I was thinking that this could potentially be resolved when loading the data in the DV from the staging area, but the customer reference may not exist in the staging area at that particular moment in time, because the order may simply be a new order for an existing customer that didn't change.

DV 2.0 specifies that all of this business with hashes is done to increase performance and simply load the data in parallel without expensive lookups in the DV itself. Hence the question how this is usually resolved.

Example added down here:

order - orderid - customerid - order_ref - salespersonid

customer - customerid - customer_ref

person - personid - full_name - login

In order to populate order, should I do a join in the source database like so:

SELECT
   hash_func(o.order_ref) as hash_key_order,
   hash_func(c.customer_ref) as hash_key_customer,
   hash_func(p.login) as hash_key_person,
   o.orderid, 
   c.customerid,
   p.login
FROM
   order o inner join customer c on o.customerid = c.customerid
   inner join person p on o.salespersonid = p.personid

or is the resolution for foreign keys resolved in the datavault, so the query is simpler like:

SELECT
   hash_func(o.order_ref) as hash_key_order,
   o.orderid, 
   c.customerid,
   p.personid
FROM
   order o

This is not clear to me. What I understand is that expensive lookups are avoided through hashing, so not generating the hash on staging for foreign keys do not contribute to performance otherwise?

2

There are 2 best solutions below

4
tobi6 On

I am not really sure why there is a complex SELECT statement to populate order. Also I think there might be some mixup regarding the paradigmas of Data Vault. What you want to do with Data Vault is to read all data from the source systems.

This means first you would load the table Order into the core DWH which seems to be modelled with Data Vault. Then you would do the same with Customer, Person and so on. Up until all data which you need later for your statment to work is in the core DWH.

Every entity would have it's own hash key depending on the entity. E.g. for the Order table this might be the id.

Now, when everything is loaded into Data Vault, you can re-create your business rule on top of the data. Meaning, if you use surrogate keys, you might need to recreate them. If the surrogate keys are created beforehand in the database and they are of business value, take them.

But this depends on the use of the ids. As I commented before, you first need to know how the business handles the cases you provided. Then you load every data on it's own into the Data Vault. And then you go on an recreate the statement you have added as an example.

So:

  • Copy the data
  • Recreate the business rules (what happens if there is an order without a customer)
  • Create views / persistant tables
  • Use the data
2
pcd On

The problem is that you are not exporting your BK. You are exporting the surrogate keys. Change o.orderid to o.order_ref etc in your queries and everything should fit into place. Unfortunately people don't understand the reasons behind ID's. They are an internal database element used for performance and management purposes, and have NOTHING to do with the BUSINESS.

pcd