Distance metric for comparing ingredient lists

99 Views Asked by At

I am using sklearn pairwise distances to identify the similarity of different products based on their ingredients. My initial df looks like this and contains only 0s and 1s:

Products Ingredient 1 Ingredient 2 ... Ingredient 500
Product 1 0 1 ... 1
Product 2 1 1 ... 0
... ... ... ... ...
Product 600 1 1 ... 1

I have converted this to a distance matrix to receive the distances for each pair of products based on their ingredients and calculated the distance matrix by running the following code:

X = df.to_numpy()
distance_array = pairwise_distances(X, metric='hamming')

I have selected hamming as metric based on this article https://towardsdatascience.com/9-distance-measures-in-data-science-918109d069fa as I would like to know the absolute number of ingredients that are different between each product pair. However the matrix returns floats like 0.006 for a product combination that differs only by one ingredient, but I would have expected it to return 1 in this case.

Can anyone help me out on this and explain why hamming distance is not returning the absolute numbers? Is there a more suitable metric for my Use Case?

Thanks a lot!!

1

There are 1 best solutions below

0
chitown88 On BEST ANSWER

It states "number of values that are different between two vectors", so yes you would expect to see 1 if only 1 ingredient differs, but the algorithm displays as a percent, not a count. So if 2 of the 3 values differ, that's .6667.

if you see 0, that means no difference. If you see 1, it means 100% difference (Ie all columns are different when compared.)

If you want the number of differences though, you'll need to multiple the values by the number of ingredients.

import pandas as pd
from sklearn.metrics.pairwise import pairwise_distances


data = [
        [0,1,1],
        [1,1,0],
        [1,1,1],
        [0,1,1],
        [0,0,1]]

columns = ['Ing1', 'Ing2','Ing3']
df = pd.DataFrame(data=data, columns=columns)


X = df.to_numpy()
distance_array = pairwise_distances(df, metric='hamming')

products = ['Product %s' %i for i in range(1, len(df) + 1) ]

distance_matrix = pd.DataFrame(distance_array)
distance_matrix.set_index = products
distance_matrix.columns = products

distance_matrix_vals = distance_matrix * len(columns)

enter image description here

enter image description here

enter image description here