In our service we noticed the data loss in Delta Table where the total no of rows are not aligned with the input that we had fed during the merge.
I confirmed the difference through counting the rows in the Delta Table. I can notice the similar behavior through the Operation metrics of transaction history of the Delta Table. One such example is as follows.
I have sufficed some pointers (A, B ...) against the key metrics that I used to find the issue.
{
"commitInfo": {
"timestamp": 1695221622985,
"operation": "MERGE",
"operationParameters": {
"predicate": "(((target.`PartitionId` IN ('2022', '2023')) AND (target.`PartitionId` = source.`PartitionId`)) AND (target.`Id` = source.`Id`))",
"matchedPredicates": "[{\"predicate\":\"(source.`versionnumber` > target.`versionnumber`)\",\"actionType\":\"update\"}]",
"notMatchedPredicates": "[{\"actionType\":\"insert\"}]"
},
"readVersion": 1739,
"isBlindAppend": false,
"operationMetrics": {
"numTargetRowsCopied": "12628748", (A)
"numTargetRowsDeleted": "0", (B)
"numTargetFilesAdded": "2",
"executionTimeMs": "2031980",
"numTargetRowsInserted": "10", (C)
"scanTimeMs": "126629",
"numTargetRowsUpdated": "92748", (D)
"numOutputRows": "12677631", (E)
"numSourceRows": "92758", (F)
"numTargetFilesRemoved": "2",
"rewriteTimeMs": "1904630"
}
}
}
Based on my observations, I have following equations which mostly are held true.
Eq1: A + C + D - B == E
Eq2: B + C + D == F
But whenever we see data loss, we see that E2 is true (i.e., input data was correctly received) but E1 is not true (i.e., some data got dropped somehow).
In this particular example, for E1, I see a difference of 43,875 rows (i.e., these many rows are not there in Delta Table).
I am not sure if it is by design. In what circumstances, such difference can be observed?
Are there any known issues in Delta Table 1.0 stack which can lead to this? I am looking for some pointers to dig further into this issue.
We are using PySpark SDK for Delta Table and compute is through Synapse Spark pools.
Apache Spark - 3.1
Delta Lake version - 1.0
This has been investigated and the PRs are being pushed to Delta Lake OSS repo. Posting it here for the sake of loop completion.
https://github.com/delta-io/delta/issues/2104