Incremental append updated rows (based on some columns) in PySpark Palantir Foundry

Question

Incremental append updated rows (based on some columns) in PySpark Palantir Foundry

1k Views Asked by Jresearcher At 04 October 2022 at 17:29

I would like to create a historical dataset on which I would like to add all NEW records of a dataset.

For NEW records I mean new records or modified records: all those that are the same for all columns except the 'reference_date' one.

I insert here the piece of code that allows me to do it on all columns, but I can't figure out how to implement the exclusion condition of a column.

Inputs:

historical (previous):

ID	A	B	dt_run
1	abc	football	2022-02-14 21:00:00
2	dba	volley	2022-02-14 21:00:00
3	wxy	tennis	2022-02-14 21:00:00

input_df (new data):

ID	A	B
1	abc	football
2	dba	football
3	wxy	tennis
7	abc	tennis

DESIRED OUTPUT (new records in bold)

ID	A	B	dt_run
1	abc	football	2022-02-14 21:00:00
2	dba	volley	2022-02-15 21:00:00
3	wxy	tennis	2022-02-01 21:00:00
2	dba	football	2022-03-15 14:00:00
7	abc	tennis	2022-03-15 14:00:00

My code which doesn't work:

@incremental(snapshot_inputs=['input_df'])
@transform(historical = Output(....), input_df = Input(....))

def append(input_df, historical):
    input_df = input_df.dataframe().withColumn('dt_run', F.to_timestamp(F.lit(datetime.now())))
    historical = historical.write_dataframe(dataset_input_df.distinct()\
                                .subtract(historical.dataframe('previous', schema=input_df.schema)))
    return historical

Original Q&A

There are 3 best solutions below

tomwhittaker On 04 October 2022 at 18:35

You can use code similar to what is found here.

Once you have combined the previous output with the new input, you just need to use PySpark to determine which is the newest row and only keep that row instead of line 19.

A possible implementation for this could be using F.row_number e.g.

import pyspark.sql.window as W
import pyspark.sql.functions as F

@incremental()
@transform(
    input_df=Input('/examples/input_df'),
    output_df=Output('/examples/output_df')
)
def incremental_group_by(input_df, output_df):
    # Get new rows
    new_input_df = input_df.dataframe().withColumn('dt_run', F.to_timestamp(F.lit(datetime.now())))


    # Union with the old rows
    out_schema = new_input_df.schema
    both_df = new_input_df.union(
        output_df.dataframe('previous', schema=out_schema)
    )

    partition_cols = ["A","B"] 
    # Get most recent row
    totals_df = totals_df.withColumn("row_number",
        F.row_number().over(W.Window.partitionBy(*partition_cols).orderBy(F.desc("dt_run")))
    ).where(F.col("row_number") == 1).drop("row_number")

    # To fully replace the output, we always set the output mode to 'replace'.
    # Checkpoint the totals dataframe before changing the output mode.
    both_df.localCheckpoint(eager=True)
    output_df.set_mode('replace')
    output_df.write_dataframe(both_df.select(out_schema.fieldNames()))

Edit : The main difference between my answer and the one above is whether you want to have multiple rows in the ouput where 'A' and 'B' are the same. It depends on your usecase which one is better!

Banu On 04 October 2022 at 20:00

I have used the union() function along with dropDulicates()

from datetime import datetime
from pyspark.sql.functions import *
from pyspark.sql.types import *
import pyspark.sql.functions as fx

def append(df_input, df_hist):
    df_union = df_hist.unionByName(df_input,allowMissingColumns=True).dropDuplicates(['ID','A','B'])
    historical = df_union.withColumn('dt_run', fx.coalesce('dt_run', fx.to_timestamp(fx.lit(datetime.now()))))
    return historical

df_hist= spark.createDataFrame( [(1,'abc','football','2022-02-14 21:00:00'),(2,'dba','volley','2022-02-14 21:00:00'),(3,'wxy','tennis','2022-02-14 21:00:00')],schema= ['ID','A','B','dt_run'])
df_hist = df_hist.withColumn('dt_run',fx.col('dt_run').cast('timestamp'))

df_input= spark.createDataFrame([(1,'abc','football'),(2,'dba','football'),(3,'wxy','tennis'),(7,'abc','tennis')],schema= ['ID','A','B'])
        
df_historical = append(df_input,df_hist)
df_historical.show(truncate=False)

**ZygD** · Accepted Answer · 2022-10-04T18:31:01.787000

I've tested the following script and it works. In the following example, you don't need to drop/select columns. Using withColumn you create the missing column in input_df and also change the values in the existing column in historical. This way you can safely do subtract on the whole dataframe. Later, since you append the data rows, the old historical rows will stay intact with their old timestamps.

from transforms.api import transform, Input, Output, incremental
from pyspark.sql import functions as F
from datetime import datetime


@incremental(snapshot_inputs=['input_df'])
@transform(
    historical=Output("...."),
    input_df=Input("....")
)
def append(input_df, historical):

    now = datetime.now()
    df_inp = input_df.dataframe().withColumn('dt_run', F.to_timestamp(F.lit(now)))
    df_hist = historical.dataframe('previous', df_inp.schema).withColumn('dt_run', F.to_timestamp(F.lit(now)))

    historical.write_dataframe(df_inp.subtract(df_hist))

Incremental append updated rows (based on some columns) in PySpark Palantir Foundry

There are 3 best solutions below

Related Questions in APACHE-SPARK

Related Questions in PYSPARK

Related Questions in APPEND

Related Questions in PALANTIR-FOUNDRY

Related Questions in INCREMENTAL-BUILD

Trending Questions

Popular # Hahtags

Popular Questions