What value to use for numHashTable in Spark LSH by Uber?

1.7k Views Asked by At

I'm trying to use .approxSimilarityJoin of Spark MLlib LSH: MinHash for Jaccard Distance e.g.

val mh = new MinHashLSH()
    .setNumHashTables(5)
    .setInputCol("features")
    .setOutputCol("hashes")

I understand that the higher the numHashTables, the more accurate the system, and the more complex/slow the calculation. I have two questions about the parameters:

  • What's the relationship between numHashTables and the MinHash fingerprint size?
  • How do I set the value correctly?

NOTE: I believe that the algorithm has been added to MLlib by Uber: https://eng.uber.com/lsh/

1

There are 1 best solutions below

1
min fan On

I think numHashTables is just the MinHash fingerprint size. numHashTables may be a experience parameter, It depends on your scene, and b * r = numHashTables (r=1,recently)