I am using sklearn pairwise distances to identify the similarity of different products based on their ingredients. My initial df looks like this and contains only 0s and 1s:
| Products | Ingredient 1 | Ingredient 2 | ... | Ingredient 500 |
|---|---|---|---|---|
| Product 1 | 0 | 1 | ... | 1 |
| Product 2 | 1 | 1 | ... | 0 |
| ... | ... | ... | ... | ... |
| Product 600 | 1 | 1 | ... | 1 |
I have converted this to a distance matrix to receive the distances for each pair of products based on their ingredients and calculated the distance matrix by running the following code:
X = df.to_numpy()
distance_array = pairwise_distances(X, metric='hamming')
I have selected hamming as metric based on this article https://towardsdatascience.com/9-distance-measures-in-data-science-918109d069fa as I would like to know the absolute number of ingredients that are different between each product pair. However the matrix returns floats like 0.006 for a product combination that differs only by one ingredient, but I would have expected it to return 1 in this case.
Can anyone help me out on this and explain why hamming distance is not returning the absolute numbers? Is there a more suitable metric for my Use Case?
Thanks a lot!!
It states "number of values that are different between two vectors", so yes you would expect to see
1if only 1 ingredient differs, but the algorithm displays as a percent, not a count. So if 2 of the 3 values differ, that's .6667.if you see
0, that means no difference. If you see1, it means 100% difference (Ie all columns are different when compared.)If you want the number of differences though, you'll need to multiple the values by the number of ingredients.