lemmatization or normalization using a dictionary and list of variations

50 Views Asked by At

I have a pandas data frame with string column which is a transaction string column. I am trying to some manual lemmatization. I have manually created a dictionary which has the main word as the key and a list of variations of the words as the values. I would like to substitute the words in the list with the main word.

here is the example code of the data I have.

import pandas as pd
list1 = ['0412 UBER TRIP HELP.UBER.COMCA',
'0410 UBER TRIP HELP.UBER.COMCA',
'MOBILE PURCHASE 0410 VALENCIA WHOLE FOODS SAN FRANCISCOCA',
'WHOLEFDS WBG#1 04/13 PURCHASE WHOLEFDS WBG#104 BROOKLYN NY',
'0414 LYFT *CITI BIKE BIK LYFT.COM CA',
'0421 WALGREENS.COM 877-250-5823 IL',
'0421 Rapha Racing PMT LLC XXX-XX72742 OR',
'0422 UBER EATS PAYMENT HELP.UBER.COMCA',
'0912 WHOLEFDS NOE 10379 SAN FRANCISCOCA',
'PURCHASE 1003 CAVIAR*JUNOON WWW.DOORDASH.CA']
df = pd.DataFrame(list1, columns = ['feature'])

map1 = {'payment':['pmts','pmnt','pmt','pmts','pyment','pymnts'],
'account':['acct'],
 'pharmacy':['walgreens','walgreen','riteaid','cvs','pharm'],
 'food_delivery':['uber eats','doordash','seamless','grubhub','caviar'],
 'ride_share':['uber','lyft'],
 'whole_foods':['wholefds','whole foods','whole food']
}

I know how to do it one word at a time using df['feature'].str.replace('variation','main word'). However, this is laborious and time consuming. Is there a faster way to do this? Thank you.

1

There are 1 best solutions below

1
Corralien On BEST ANSWER

Reverse your map:

reverse_map1 = {rf'(?i)\b{v}\b': k for k, l in map1.items() for v in l}
df['feature'] = df['feature'].replace(reverse_map1, regex=True)

Output:

>>> df
                                                            feature
0                        0412 ride_share TRIP HELP.ride_share.COMCA
1                        0410 ride_share TRIP HELP.ride_share.COMCA
2         MOBILE PURCHASE 0410 VALENCIA whole_foods SAN FRANCISCOCA
3  whole_foods WBG#1 04/13 PURCHASE whole_foods WBG#104 BROOKLYN NY
4                  0414 ride_share *CITI BIKE BIK ride_share.COM CA
5                                 0421 pharmacy.COM 877-250-5823 IL
6                      0421 Rapha Racing payment LLC XXX-XX72742 OR
7                  0422 food_delivery PAYMENT HELP.ride_share.COMCA
8                        0912 whole_foods NOE 10379 SAN FRANCISCOCA
9           PURCHASE 1003 food_delivery*JUNOON WWW.food_delivery.CA

Details:

>>> reverse_map1
{'(?i)\\bpmts\\b': 'payment',
 '(?i)\\bpmnt\\b': 'payment',
 '(?i)\\bpmt\\b': 'payment',
 '(?i)\\bpyment\\b': 'payment',
 '(?i)\\bpymnts\\b': 'payment',
 '(?i)\\bacct\\b': 'account',
 '(?i)\\bwalgreens\\b': 'pharmacy',
 '(?i)\\bwalgreen\\b': 'pharmacy',
 '(?i)\\briteaid\\b': 'pharmacy',
 '(?i)\\bcvs\\b': 'pharmacy',
 '(?i)\\bpharm\\b': 'pharmacy',
 '(?i)\\buber eats\\b': 'food_delivery',
 '(?i)\\bdoordash\\b': 'food_delivery',
 '(?i)\\bseamless\\b': 'food_delivery',
 '(?i)\\bgrubhub\\b': 'food_delivery',
 '(?i)\\bcaviar\\b': 'food_delivery',
 '(?i)\\buber\\b': 'ride_share',
 '(?i)\\blyft\\b': 'ride_share',
 '(?i)\\bwholefds\\b': 'whole_foods',
 '(?i)\\bwhole foods\\b': 'whole_foods',
 '(?i)\\bwhole food\\b': 'whole_foods'}
  • (?i): case insensitive
  • \b...\b: word boundary

Update

If you don't care about the lower/upper case, you can use:

reverse_map1 = {rf'\b{v}\b': k for k, l in map1.items() for v in l}
df['feature'] = df['feature'].str.lower().replace(reverse_map1, regex=True)