Distance metric for comparing ingredient lists

99 Views Asked by luisa At 07 October 2022 at 09:40

I am using sklearn pairwise distances to identify the similarity of different products based on their ingredients. My initial df looks like this and contains only 0s and 1s:

Products	Ingredient 1	Ingredient 2	...	Ingredient 500
Product 1	0	1	...	1
Product 2	1	1	...	0
...	...	...	...	...
Product 600	1	1	...	1

I have converted this to a distance matrix to receive the distances for each pair of products based on their ingredients and calculated the distance matrix by running the following code:

X = df.to_numpy()
distance_array = pairwise_distances(X, metric='hamming')

I have selected hamming as metric based on this article https://towardsdatascience.com/9-distance-measures-in-data-science-918109d069fa as I would like to know the absolute number of ingredients that are different between each product pair. However the matrix returns floats like 0.006 for a product combination that differs only by one ingredient, but I would have expected it to return 1 in this case.

Can anyone help me out on this and explain why hamming distance is not returning the absolute numbers? Is there a more suitable metric for my Use Case?

Thanks a lot!!

Original Q&A

There are 1 best solutions below

chitown88 On 07 October 2022 at 10:02 BEST ANSWER

It states "number of values that are different between two vectors", so yes you would expect to see 1 if only 1 ingredient differs, but the algorithm displays as a percent, not a count. So if 2 of the 3 values differ, that's .6667.

if you see 0, that means no difference. If you see 1, it means 100% difference (Ie all columns are different when compared.)

If you want the number of differences though, you'll need to multiple the values by the number of ingredients.

import pandas as pd
from sklearn.metrics.pairwise import pairwise_distances


data = [
        [0,1,1],
        [1,1,0],
        [1,1,1],
        [0,1,1],
        [0,0,1]]

columns = ['Ing1', 'Ing2','Ing3']
df = pd.DataFrame(data=data, columns=columns)


X = df.to_numpy()
distance_array = pairwise_distances(df, metric='hamming')

products = ['Product %s' %i for i in range(1, len(df) + 1) ]

distance_matrix = pd.DataFrame(distance_array)
distance_matrix.set_index = products
distance_matrix.columns = products

distance_matrix_vals = distance_matrix * len(columns)

Distance metric for comparing ingredient lists

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in SIMILARITY

Related Questions in HAMMING-DISTANCE

Related Questions in DISTANCE-MATRIX

Trending Questions

Popular # Hahtags

Popular Questions