I was checking out Simhash module ( https://github.com/leonsim/simhash ).
I presume that the Simhash("String").distance(Simhash("Another string")) is the hamming distance between the two strings. Now, I am not sure I understand this "get_features(string) method completely, as shown in (https://leons.im/posts/a-python-implementation-of-simhash-algorithm/).
def get_features(s):
width = 2
s = s.lower()
s = re.sub(r'[^\w]+', '', s)
return [s[i:i + width] for i in range(max(len(s) - width + 1, 1))]
Now, when I try to compute distance between "aaaa" and "aaas" using the width 2, it gives out the distance as 0.
from simhash import Simhash
Simhash(get_features("aaas")).distance(Simhash(get_features("aaaa")))
I am not sure what am I missing out in here.
Dig into code
The width, in your case, is the key parameter in
get_features(), which give different splitted words. Theget_features()in your case will output like:Then Simhash calculates these list as unweighted features (which means the default weight of each feature is 1) and output like:
They are the same!
The reason is from simhash algorithm itself. Let's look into the code:
from: leonsim/simhash
The calculation process can be divied into 4 steps: 1) hash each splitted word (feature), to transform string into binary numbers; 2) weight them; 3) assumble weighted bits together; 4) change the assumbled number into binary and output as the value.
Now, in your case, the step 3 will output like:
And after step 4, the 2 output the same value.
Other parameter
If you change the width from 2 to 1, 3, 4, you will get different result of
Simhash(get_features()).Your case shows the limitation of simhash with short length text.