Cannot find the rows using sorting, writing after LSH

214 Views Asked by Ivan Shelonik At 24 October 2017 at 11:59

I've used LSH after ALS algorithm using pyspark and all seems works fine till I accidentally saw that I had some lost rows during the exploring. All was implemented with help of Spark LSH documentation example https://spark.apache.org/docs/latest/ml-features.html#tab_scala_28

When I specifically try to find the row where idA == 1 - I can do it. When I do repartition(1).write.csv or sorting --> all the rows with idA == 1 isn't in the table. May someone explain how is that possible?

I've used python API for Spark Version v2.2.0, python version is 3.6

A little bit code

brp = BucketedRandomProjectionLSH(inputCol="features", outputCol="hashes",
                                    bucketLength=10.0, numHashTables=3)
table = model.approxSimilarityJoin(Pred_Factors, Pred_Factors, threshold=10.0, distCol="EuclideanDistance") \
            .select(col("datasetA.id").alias("idA"),
                    col("datasetB.id").alias("idB"),
                    col("EuclideanDistance")).cache()

P.S I've even tried to write the file into csv and search for these id and EuclidianDistance - as you can see that's all unsuccessful. These lost ids are truly too much (that's not only for id = 1). Maybe I don't understand some specifics of LSH algorithm but I can't find the logic of spark LSH behavior by myself.

Original Q&A

There are 1 best solutions below

Sahil Desai On 24 October 2017 at 12:40

Here you used random partition because of this you got the problem. So now you have to used partitionBy(('idA') otherwise you can used table.orderBy('idA') for proper result.

Cannot find the rows using sorting, writing after LSH

There are 1 best solutions below

Related Questions in APACHE-SPARK

Related Questions in PYSPARK

Related Questions in APACHE-SPARK-SQL

Related Questions in LOCALITY-SENSITIVE-HASH

Trending Questions

Popular # Hahtags

Popular Questions