Match 2 data frames if at least 3 keywords are matching

46 Views Asked by At

I have 2 dataframes, df1, df2.

df1 contains fake news stories

story = [[1, 'sun rotates around earth'],[2,'coffee is harmful for health, it causes cancer. It helps noone']]

df1 = pd.DataFrame(story, columns = ['story_no', 'story'])

df2 contains OCR texts extracted from posts

ocrdata = [[1234, 'sun, earth and everything in the Milky Way galaxy rotating around a black hole'],[3456,'I love coffee'],[5678,'coffee helps preventing cancer and keep us healthy']]

df2 = pd.DataFrame(story, columns = ['postid', 'OCR_text'])

I want to match df1 with df2 if at least 3 keywords are matching between df1['story'] and df2['OCR_text']

in above example for first row 'sun', 'earth', 'rorate' keywords(atleast 3 keywords) are matching so it would be a match, in second row only 'coffee' is matching(less than 3 keywords) hence it won't be a match, but in 3rd row of df2 4 keywords (greater than 3 keywords) are matching('coffee', 'health', 'cancer', 'help') hence it would be a match

I want output as

Columns: [story_no, story, postid, OCR_text, matching_keywords]

So the sample output for above dataframes would be

story_no story postid OCR_text matching_keywords
1 'sun rotates around earth' 1234 'sun, earth and everything in the Milky Way galaxy rotating around a black hole' 'sun', 'earth', 'rorate'
2 'coffee is harmful for health, it causes cancer. It helps noone' 5678 'coffee helps preventing cancer and keep us healthy' 'coffee', 'health', 'cancer', 'help'

I have created a function which return keyword list

def get_keywords1(row):
    some_text = row
    lowered = some_text.lower()
    tokens = nltk.tokenize.word_tokenize(some_text)
    keywords = [keyword for keyword in tokens if keyword.isalpha() and not keyword in stop_words]
    keyword_list = [w.replace(' ', '_') for w in keywords]
    return str(''.join([''+ ''.join(sent) + '. ' for sent in keywords]))

But I am not able to match both the dataframes as per above requirement. I am new into python and pandas, any help will be highly appreciated.

Thanks

0

There are 0 best solutions below