Python multiprocessing apply_async in recordlinkage compare function returns blank

105 Views Asked by At

I am trying to find similar records between 2 dataframes using the recordlinkage package in Python using multiprocessing, but I am running into an issue. Here is the code -

#Recordlinkage part of creating the candidate pairs
    indexer = recordlinkage.Index()
    indexer.block('COL4')
    candidate_links = indexer.index(df1, df2)
 
#Compare object created to identify similar records
    compare = recordlinkage.Compare()
    compare.string('COL1','COL1', method='lcs', label = 'COL1_SCORE')
    compare.string('COL2','COL2', method='lcs', label = 'COL2_SCORE')     
    compare.date('COL3','COL3', label='COL3_SCORE')

#Effort to multiprocess since the comparison is a MxN operation and is extremely slow
    comparison_vectors = pd.DataFrame([])
    pool = mp.Pool(mp.cpu_count())
    pool.apply_async(compare.compute, args=(candidate_links,df1,df2), callback=comparison_vectors.append)
    pool.close()
    pool.join()

When I run this, the code gets executed successfully but I am getting blank dataframe when I print comparison_vectors, which means either the function compare.compute() is not calculating the similarity or the comparison_vectors is not getting appended with the results.

Also, when I don't use multiprocessing and do it the straightforward way by running, comparison_vectors = compare.compute(candidate_links,df1,df2) I am getting the calculations done by the compare.compute() function in the comparison_vectors dataframe.

Can someone help me understand what I am doing wrong here?

TIA

0

There are 0 best solutions below