I am trying to save compute on a python transform in Foundry.
I want to run my code incrementally, but I want to keep a unique set of keys, without having to do a full snapshot read on the full dataset, and then run the unique.
If I try something like df_out = df.select("composite_key").dropDuplicates() I am afraid it uses the full dataset input, I want to make use of the previous deduplication I already did.
The trick here is use the previous version of the output dataset:
Using this pattern, you don't need to do a look up on the full dataset, you use the previously computed unique set of keys, union to the new data and then de-dupe.