I've used LSH after ALS algorithm using pyspark and all seems works fine till I accidentally saw that I had some lost rows during the exploring. All was implemented with help of Spark LSH documentation example https://spark.apache.org/docs/latest/ml-features.html#tab_scala_28
When I specifically try to find the row where idA == 1 - I can do it. When I do repartition(1).write.csv or sorting --> all the rows with idA == 1 isn't in the table. May someone explain how is that possible?
I've used python API for Spark Version v2.2.0, python version is 3.6
A little bit code
brp = BucketedRandomProjectionLSH(inputCol="features", outputCol="hashes",
bucketLength=10.0, numHashTables=3)
table = model.approxSimilarityJoin(Pred_Factors, Pred_Factors, threshold=10.0, distCol="EuclideanDistance") \
.select(col("datasetA.id").alias("idA"),
col("datasetB.id").alias("idB"),
col("EuclideanDistance")).cache()
P.S I've even tried to write the file into csv and search for these id and EuclidianDistance - as you can see that's all unsuccessful. These lost ids are truly too much (that's not only for id = 1). Maybe I don't understand some specifics of LSH algorithm but I can't find the logic of spark LSH behavior by myself.

Here you used random partition because of this you got the problem. So now you have to used
partitionBy(('idA')otherwise you can usedtable.orderBy('idA')for proper result.