ApproxSimilarityJoin from Spark Minhash model is not able to identify two identical rows

18 Views Asked by At

I was evaluating the t function of the Minhash model (in Spark 3.1.1) with millions of records where datasetA = datasetB (self join). And when evaluating the results I find a large number of identical rows that are not (wrongly) identified as duplicates. I understand that this is a probabilistic function suitable for a high volume of data where it is possible to get false negatives, but I would like to understand why it is not able to identify two records that are exactly the same.

As far as I have been able to find out, the function divides the dataset into different buckets, being the records that fall into the same bucket the ones that are confronted and calculating the distance between them. I would understand that two identical records should fall into the same bucket and be identified as duplicates.

When I reduce the dataset size or increase the numHashTables value to 4 the duplicates are correctly identified. I have repeated the execution on numerous occasions and under the same conditions and the result is the same.

I would be grateful if someone could explain the process a bit more in detail to understand why I get these results.

Additional information: threshold = 0.65 numHashTables=1 (with 4 it works correctly and identifies all duplicates) seed = 1

I am using the NGram and HashingTF functions to generate the vectors, during these tests I generated single element vectors.

0

There are 0 best solutions below