I am cleaning corpus using following code:-
token=['hi','hi','account','is','follow' ,'follow','account','delhi']
to_remove=set(words union of stopwrold, city,country,firstname, lastname, otherword)
set(token)-to_remove
{'account','follow',}
Because of taking set of token loosing frequency of repeated world, causing low performance of tf-idf. I want to maintain frequency of output world. I have large corpus and using for loop for manual removal takes weeek in cleaning, above code complete job in 1:30 hrs.
output I want in fastest possible way:
{'account','follow' ,'follow','account'}
try this hopefully this will help you