Record Linkage matching two different datasets Python

100 Views Asked by At

I am trying to fuzzy match using recordlinkage in python. I am matching by name of businesses and zipcode from two different datasets.

Here is the code I am using

reference_usa = pd.read_csv('all_reference_usa.csv', index_col='companyname')
oc_sample = pd.read_csv('oc_sample.csv', index_col='name')

indexer = recordlinkage.Index()
indexer.full()

candidates = indexer.index(reference_usa, oc_sample)
print(len(candidates))

And here is the error.

ValueError('index of DataFrame is not unique')

The issue that I am running into is that I get an error code of index of DataFrame is not unique. This is because there maybe a company with the same name but different location. Is it possible to ignore this rule or can I add an additional index col for zipcode. Ideally, I would like to match the companyname by name and zipcode of the business.

0

There are 0 best solutions below