I am using pyspark.ml.feature.BucketedRandomProjectionLSH to identify to similar items.
I have two datasets which have been vectorized. I have used LSH to hash both data sets and have stored them in a separate location. Model used to transform both datasets in stored on hdfs as well. However, when i run approxsimilarityjoin againsts these two datasets and try to write it out to a parquet, it gives me different results versus when I don't right it into parquet.
This is how i create my datasets
brp = BucketedRandomProjectionLSH()
brp.setInputCol(output_col)
brp.setOutputCol("hashes")
brp.setSeed(12345)
brp.setBucketLength(buck_len)
brp.setNumHashTables(num_hshtbls)
model = brp.fit(dfLeft)
model.write().overwrite().save('LSH_model_test')
model = BucketedRandomProjectionLSHModel.load('LSH_model_test')
dfLeft_T=model.transform(dfLeft)
dfLeft_T.write.mode('overwrite').parquet('dfLeft_transformed_test')
dfRight_T=model.transform(dfRight)
dfRight_T.write.mode('overwrite').parquet('dfRight_transformed_test')
To find similar items i use this :
dfLeft_T=spark.read.parquet('dfLeft_transformed_test')
dfRight_T=spark.read.parquet('dfRight_transformed_test')
pairs1_=model.approxSimilarityJoin(dfLeft_T, dfRight_T, cut_off,distCol="EuclideanDistance")
To get number of pairs with 0 distance , i use this command :
pairs1_.filter(col('EuclideanDistance') == 0).count()
which gives output as 924
However, when i try to write pairs1_ to a parquet file and run the command like this
pairs1_.write.mode('overwrite').parquet('pairs1_test')
pairs1A_=spark.read.parquet('pairs1_test')
pairs1A_.filter(col('EuclideanDistance') == 0).count()
the output is 200.
Can you help me understand why writing to a parquet might change the results or outcomes of this output?
I have tried running the above multiple times and the count is always lower when i write to a parquet as opposed to when I don't