I am trying to find similar records between 2 dataframes using the recordlinkage package in Python using multiprocessing, but I am running into an issue. Here is the code -
#Recordlinkage part of creating the candidate pairs
indexer = recordlinkage.Index()
indexer.block('COL4')
candidate_links = indexer.index(df1, df2)
#Compare object created to identify similar records
compare = recordlinkage.Compare()
compare.string('COL1','COL1', method='lcs', label = 'COL1_SCORE')
compare.string('COL2','COL2', method='lcs', label = 'COL2_SCORE')
compare.date('COL3','COL3', label='COL3_SCORE')
#Effort to multiprocess since the comparison is a MxN operation and is extremely slow
comparison_vectors = pd.DataFrame([])
pool = mp.Pool(mp.cpu_count())
pool.apply_async(compare.compute, args=(candidate_links,df1,df2), callback=comparison_vectors.append)
pool.close()
pool.join()
When I run this, the code gets executed successfully but I am getting blank dataframe when I print comparison_vectors, which means either the function compare.compute() is not calculating the similarity or the comparison_vectors is not getting appended with the results.
Also, when I don't use multiprocessing and do it the straightforward way by running,
comparison_vectors = compare.compute(candidate_links,df1,df2)
I am getting the calculations done by the compare.compute() function in the comparison_vectors dataframe.
Can someone help me understand what I am doing wrong here?
TIA