What value to use for numHashTable in Spark LSH by Uber?

1.7k Views Asked by Marsellus Wallace At 21 November 2017 at 18:02

I'm trying to use .approxSimilarityJoin of Spark MLlib LSH: MinHash for Jaccard Distance e.g.

val mh = new MinHashLSH()
    .setNumHashTables(5)
    .setInputCol("features")
    .setOutputCol("hashes")

I understand that the higher the numHashTables, the more accurate the system, and the more complex/slow the calculation. I have two questions about the parameters:

What's the relationship between numHashTables and the MinHash fingerprint size?
How do I set the value correctly?

NOTE: I believe that the algorithm has been added to MLlib by Uber: https://eng.uber.com/lsh/

Original Q&A

There are 1 best solutions below

min fan On 12 April 2022 at 06:52

I think numHashTables is just the MinHash fingerprint size. numHashTables may be a experience parameter, It depends on your scene, and b * r = numHashTables (r=1,recently)

What value to use for numHashTable in Spark LSH by Uber?

There are 1 best solutions below

Related Questions in SCALA

Related Questions in APACHE-SPARK

Related Questions in APACHE-SPARK-MLLIB

Related Questions in LOCALITY-SENSITIVE-HASH

Related Questions in MINHASH

Trending Questions

Popular # Hahtags

Popular Questions