How to keep a unique set of keys with an incremental transformation in Palantir Foundry?

238 Views Asked by Andrew Andrade At 12 October 2022 at 15:05

I am trying to save compute on a python transform in Foundry.

I want to run my code incrementally, but I want to keep a unique set of keys, without having to do a full snapshot read on the full dataset, and then run the unique.

If I try something like df_out = df.select("composite_key").dropDuplicates() I am afraid it uses the full dataset input, I want to make use of the previous deduplication I already did.

Original Q&A

There are 2 best solutions below

Andrew Andrade On 12 October 2022 at 15:05

The trick here is use the previous version of the output dataset:

    df_out = df.unionByName(
        df_out.dataframe('previous', schema=df.schema).select("composite_key")
    ).drop_duplicates()

Using this pattern, you don't need to do a look up on the full dataset, you use the previously computed unique set of keys, union to the new data and then de-dupe.

Kellen Donohue On 12 October 2022 at 22:05

If there are other columns in the new data but you still want to de-dupe by key you can use this approach.

# If there may be duplicates in the data do this step. 
# df = df.dropDuplicates(['composite_key'])

df_prev = df_out.dataframe(mode='previous', schema=df.schema)
# This uses the new row for any existing key. 
# You could do the opposite by swapping the places of the tables.
existing = df_prev.join(df, on='composite_key', how='leftanti')
result = existing.unionByName(df)

How to keep a unique set of keys with an incremental transformation in Palantir Foundry?

There are 2 best solutions below

Related Questions in PALANTIR-FOUNDRY

Related Questions in INCREMENTAL-BUILD

Related Questions in FOUNDRY-CODE-REPOSITORIES

Related Questions in FOUNDRY-CODE-WORKBOOKS

Trending Questions

Popular # Hahtags

Popular Questions