Let's assume we hava a Dataframe looking like this:
| Index | FirstName | Surname | Adress | Source |
|---|---|---|---|---|
| 1 | Paul | Baggins | same | good |
| 2 | Paaul | Baggins | same | bad |
| 3 | Mary | Baggins | same | good |
| 4 | Mary | Baggins | same | bad |
| 5 | Lucy | Smith | other | bad |
We want to clean up the Dataframe. First we filter people living at the same adress. We can be sure that adresses are unique for each household. There we want to delete potential duplicates, because we used different data sources and unfortunately there might be some typing errors in the column "FirstName".
How can we delete the duplicates (in our case index rows 2 and 4)?
I found out that we could delete "exact" duplicates by using
df.drop_duplicates(subset=['FirstName','Surname', 'Adress'], keep='first')
This way Index 4 will be deleted. But this is not what I am looking for.
To delete index row 2 I want to compare the text of "FirstName" Index 1 and 2 and tried using the following function:
from difflib import SequenceMatcher
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
similar(text_a, text_b)
Using
similar('Paul', 'Paaul')"
results in 0.888
But I don't see how to put all this together.
The "manipulated" Dataframe should look like this:
| Index | FirstName | Surname | Adress | Source | Similar_to_Index | Ratio |
|---|---|---|---|---|---|---|
| 1 | Paul | Baggins | same | good | 2 | 0.8888 |
| 2 | Paaul | Baggins | same | bad | 1 | 0.8888 |
| 3 | Mary | Baggins | same | good | 4 | 1.0 |
| 4 | Mary | Baggins | same | bad | 3 | 1.0 |
| 5 | Lucy | Smith | other | bad | NaN | NaN |
Then index rows 2 and 4 should be deleted by the rule that the relevant ratio is >0.8 and the column "Source" is labeled "bad". The problem is how to create the Column "Similar_to_Index" I guess.
The final result should be like this:
Cleaned Dataframe:
| Index | FirstName | Surname | Adress | Source | Similar_to_Index | Ratio |
|---|---|---|---|---|---|---|
| 1 | Paul | Baggins | same | good | 2 | 0.8888 |
| 3 | Mary | Baggins | same | good | 4 | 1.0 |
| 5 | Lucy | Smith | other | bad | NaN | NaN |
Deleted_Entries_Dataframe:
| Index | FirstName | Surname | Adress | Source | Similar_to_Index | Ratio |
|---|---|---|---|---|---|---|
| 2 | Paaul | Baggins | same | bad | 1 | 0.8888 |
| 4 | Mary | Baggins | same | bad | 3 | 1.0 |
Thank you very much for any suggestions and help.
You can use
itertools.combinationsto check all pairs of names, then select the top match:Output:
Splitting the data in two:
NB. if you sort the rows to have the "good" source on the top, then those will be kept preferentially as non-duplicate.