When appending to a delta table in an Azure ADLS Gen 2 storage container only ~1/6th of the data gets written

37 Views Asked by At

I have a data pipeline for telemetry that starts with exporting the logs from Azure Application Insights to a database for further processing and ends in powerBI. This processing is triggered every hour.

After migrating from classic application insights to workspace-based application insights (due to deprecation of the first), I have been encountering problems with missing data when appending to an existing delta table in an ADLS Gen 2 storage container.

Previously, the classic application insights logs were continuously exported as block blobs to a storage container (Gen1). This data was processed and written to the delta table in the ADLS Gen 2 container in question.

After the migration to workspace-based application insights the continuous export of the logs is written to a new delta table in a new ADLS Gen 2 storage container made specifically for this purpose. This data is processed in the same way as before. Then it is appended to the already existing delta table mentioned earlier. I made sure the schemas match exactly.

To clarify: Old: Block blob (Gen1 storage) -> delta table (ADLS Gen 2 storage) New: Delta table (ADLS Gen 2 storage) -> delta table (ADLS Gen 2 storage)

During processing post-migration all the data is still present. However when appending to the existing delta table, around 1/6th or 1/7th of the data does not get written. The data processing of this step only consists of renaming columns so it fits the old schema as well as dropping some redundant columns that were added after the migration. No filtering or complex data analysis is done at this step.

Before and after the migration the data is written with the same Scala code:

    formattedData.writeStream
      .trigger(Trigger.Once)
      .format("delta")
      .outputMode("append")
      .option("checkpointLocation", checkPointPath)
      .partitionBy(partitionName)
      .start(tableDataPath)
      .awaitTermination()

I have tried replicating this issue by writing pre-migration data to a separate delta table and subsequentially appended post-migration data to it. This was done with exactly the same scripts with the database mentioned earlier. To my surprise all data is persisted in this test environment.

I can't seem to figure out what the issue is causing a large portion of the data to not be correctly written to the delta table.

0

There are 0 best solutions below