Syncing from bronze table to silver table in Databricks in batch mode

72 Views Asked by At

I am syncing my bronze table with silver table in streaming mode. The code looks like following-

# Read a stream from  bronze table
bronzeDF = spark.readStream.table(bronze_table_path)

# Parse JSON payload from the 'decodedData' column
decodedData=bronzeDF.select("decodedData")

#Extract data to be saved as individual columns 
columnsToSaveDF = decodedData.select(get_json_object(col("decodedData"), "$.tenantId").alias("tenant_id"),
                                      get_json_object(col("decodedData"), "$.eventId").alias("event_id"),
                                      get_json_object(col("decodedData"), "$.eventChannel").alias("event_channel"),
                                      get_json_object(col("decodedData"), "$.sourceType").alias("source_type"),
                                      get_json_object(col("decodedData"), "$.timestamp").alias("event_datetime"),
                                      col("decodedData").alias("event_data"))


#Replace checkpoint by a proper path once changes are complete
bronze_to_silver_sync=columnsToSaveDF.writeStream.outputMode("append").option("checkpointLocation", "/user/sumit/bronze-to-silver-checkpoint-13-feb-latest").table(silver_table_path)

This streaming process internally maintains a checkpoint so that even if the stream is discontinued, it can resume from point where it was terminated. But since streaming mode can be costly, I would like to convert this process to a batch job which I can schedule to trigger every 1 or 2 hours. How can I do that with also maintaining a checkpoint?

0

There are 0 best solutions below