I am trying to implement LSH spark to find nearest neighbours for each user on very large datasets containing 50000 rows and ~5000 features for each row. Here is the code related to this.
MinHashLSH mh = new MinHashLSH().setNumHashTables(3).setInputCol("features")
.setOutputCol("hashes");
MinHashLSHModel model = mh.fit(dataset);
Dataset<Row> approxSimilarityJoin = model .approxSimilarityJoin(dataset, dataset, config.getJaccardLimit(), "JaccardDistance");
approxSimilarityJoin.show();
The job gets stuck at approxSimilarityJoin() function and never goes beyond it. Please let me know how to solve it.
It will finish if you leave it long enough, however there are some things you can do to speed it up. Reviewing the source code you can see the algorithm
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala
The join is probably the slow part here as the data is shuffled. So some things to try:
spark.sql.shuffle.partitions(the default gives you 200 partitions after a join)spark.sql.functions.broadcast(dataset)for a map-side joinsparseVectors.Of these 4 options 2 and 3 have worked best for me while always using
sparseVectors.