I've been trying to find duplicates for all cells in a column called "text" that at least match 90% and only keep the first row in case of duplicates found (and remove the rest of the duplicate rows). This should then be displayed in a new csv file.
I have tried to do so with this MWE, however it seems to create 2 new columns called "Matches" and "Combined" that I don't need, as a new csv without the duplicates and with only the first occurence would be the eventual goal.
import pandas as pd
from dedupe_FuzzyWuzzy import deduplication
df = pd.read_csv('/path/input.csv')
# normal duplication drop
df = df.drop_duplicates(subset='text', keep='first')
# threshold drop
df_final = deduplication.deduplication(df, ['text'],threshold=90)
# send output to csv
df_final.to_csv('/path/deduplicated.csv',index=False)
This code, with a basic example, uses
rapidfuzzto mark fuzzy-matched duplications in a text column of a pandas DataFrame. Note: higher threshold means more severe matching. The code goes through a List of text values from the column, checks for fuzzy-duplication and marks for deletion. A deletion list is then used as a mask to remove selected DataFrame rows.which prints