I'm looking into using the datavault 2.0 methodology. I understand the reasons for hashing and trying to apply it. I'd like to apply this in the "staging" phase of the datavault rather than loading it into the DV.
If a table has a business key in it, then it's easy to simply apply it to that table (probably becomes a hub). But there are tables like "orderdetail" (which probably become links) which have multiple references to other elements through a surrogate key.
Should the staging table contain both the surrogate sequence for each foreign key along with a hash for the referenced entity BK?
Example: If I have an order table with a customerId surrogate sequence, but the customer table has a CUST-000xxx reference that is used as the BK, should I perform a 'join' between order and customer to resolve the "CUST-000xxx" so I can hash it and include it in the order staging table?
I was thinking that this could potentially be resolved when loading the data in the DV from the staging area, but the customer reference may not exist in the staging area at that particular moment in time, because the order may simply be a new order for an existing customer that didn't change.
DV 2.0 specifies that all of this business with hashes is done to increase performance and simply load the data in parallel without expensive lookups in the DV itself. Hence the question how this is usually resolved.
Example added down here:
order - orderid - customerid - order_ref - salespersonid
customer - customerid - customer_ref
person - personid - full_name - login
In order to populate order, should I do a join in the source database like so:
SELECT
hash_func(o.order_ref) as hash_key_order,
hash_func(c.customer_ref) as hash_key_customer,
hash_func(p.login) as hash_key_person,
o.orderid,
c.customerid,
p.login
FROM
order o inner join customer c on o.customerid = c.customerid
inner join person p on o.salespersonid = p.personid
or is the resolution for foreign keys resolved in the datavault, so the query is simpler like:
SELECT
hash_func(o.order_ref) as hash_key_order,
o.orderid,
c.customerid,
p.personid
FROM
order o
This is not clear to me. What I understand is that expensive lookups are avoided through hashing, so not generating the hash on staging for foreign keys do not contribute to performance otherwise?
I am not really sure why there is a complex SELECT statement to populate order. Also I think there might be some mixup regarding the paradigmas of Data Vault. What you want to do with Data Vault is to read all data from the source systems.
This means first you would load the table
Orderinto the core DWH which seems to be modelled with Data Vault. Then you would do the same withCustomer,Personand so on. Up until all data which you need later for your statment to work is in the core DWH.Every entity would have it's own hash key depending on the entity. E.g. for the
Ordertable this might be theid.Now, when everything is loaded into Data Vault, you can re-create your business rule on top of the data. Meaning, if you use surrogate keys, you might need to recreate them. If the surrogate keys are created beforehand in the database and they are of business value, take them.
But this depends on the use of the ids. As I commented before, you first need to know how the business handles the cases you provided. Then you load every data on it's own into the Data Vault. And then you go on an recreate the statement you have added as an example.
So: