How to properly partition by hashkey in spark (to achieve co-partitioning)?

35 Views Asked by At

I am implementing a Data Vault hub, and have X number of columns, 2 of them being:

Company_ID, md_5(Company_id).alias(HK_company_ID)

The hash is created on the stage level, and I want to load my data from stage into the data vault. I select these two columns (company_id and precalculated hash). I then want to load new companies from stage into the hub. I want the hub to be partitioned by the HK_company_ID. How to achieve it in order to be able to perform efficient co-partitioned joins?

The only thing I've found is repartition, which is supposed to use hash partitioning under the hood.

  1. But if it does the hash partitioning under the hood, then should i use it on the HK_company_ID or just the company_ID? Maybe there isn't even a need for creating the hash-key at the staging level? I'm not that familiar with the data vault.
  2. Will it even work on the HK_company_ID then? Would it hash the hashes? Or do modulo on the already precalculated hashes?
  3. Regardless of the column that I repartition by, how do I implement it within my script? Right now I do this:
pseudocode:
df1 = stage
df2 = hub

if not hub: #first load
    df_diff= df1
else:
    df_diff = df1.antijoin(hub) 

df_diff.append(hub_location)

Where do I fit repartitioning within this code?

  1. Maybe there's some other option, to partiton by modulos or some range on the hash column?

I'd like to stick to DFs api if possible.

Thanks a lot

0

There are 0 best solutions below