Let's say I have to find estimate the jaccard similarity between documents A and B, and I use k random permutations of the union of these sets/documents to determine the documents' signatures.
How should I set my k value? Since setting it to a really high value would increase computation time significantly, what could be the least value of k which can give me a good jaccard index estimate?
Given error tolerance e>0 and delta, how can I determine the minimum value of k such that the jaccard index is between (1-e)jaccard_estimate and (1+e)jaccard_estimate with probability greater than or equal to (1-delta)?
I believe this can be derived using chernoff inequality bound, but I'm unable to figure about how to go about it. Any help would be appreciated. Thanks in advance!
If J' denotes the estimate and J is the true Jaccard similarity, (k * J') follows a binomial distribution with parameters k and J. As a consequence, the variance of the estimate is Var(J') = J(1-J)/k <= 1/(4*k). Therefore the standard deviation of J is bounded by stdev(J') <= 1/(2*sqrt(k)).