I am trying to find an efficient way of grouping duplicate contacts together based on any 3 criteria: Email, Phone number & name, account number & name and creating a master_contact_id. Here is an exaple to illustrate what I am trying to acheive:
Data looks like this:
ContactID Name Email Phone&Name Account&Name
12345 Bob Smith [email protected] 234-243-2432Bob Smith A1234Bob Smith
42023 Bob Smith [email protected] 234-243-2432Bob Smith B1234Bob Smith
50203 Bob S. [email protected] 234-243-2432Bob S. Z1234Bob S.
20394 Clara Sakshi [email protected] 123-123-1234Clara Sakshi Q1231Clara Sakshi
29930 Clara Sakshi [email protected] 234-243-2432Clara Sakshi A1234Clara Sakshi
92303 Clara Sakshi [email protected] 999-999-1234Clara Sakshi Q1231Clara Sakshi
Desired output:
Master ContactID ContactID Notes (not part of output):
1 12345 related to 50203 by email match
1 42023 related to 12345 by name and number match
1 50203 related to 12345 by email match
2 20394 related to 92303 by account number and name match
3 29930 Not related to any other Contacts
2 92303 related to 20394 by account number and name match
I have acheived the desired output by unpivoting the contacts table in SQL and then applying the graph walking technique described here: Grouping 'groups' with common element
Unfortunately the runtime of this solution is not viable. It took almost one hour to run a sample of 1000 records. Runtime increases exponentially as the data set increases and I have over 250000 contacts.
Any insights as to how to acheve this more efficiently (either in SQL or Python) would be greatly appreciated.
Please take note that I am essentially a beginner in SQL and I just started dabbleing in python in the hopes of finding an alternate solution.
Thank you
Luc
You wish to dedup records, efficiently.
Let's start with Email. Create an
email_reporttable with PK of Email and UNIQUE KEY of ContactID. You can produce it using a simple GROUP BY. Take care to also ORDER BY ContactID and then use the MIN(ContactID) so we prefer the first ContactID added to the dataset, rather than subsequent dups. Any "unique constraint violated" messages that crop up here correspond to data cleaning issues, and it's probably best to just discard that handful of records.Now each Email in the dataset maps to exactly one canonical ContactID, 1:1.
Repeat for Phone&Name. And for Account&Name. Giving us a base table plus three report tables. (We haven't set up FK relations but we certainly could.)
You showed example output with, roughly: SELECT mail, phone, acct. We are now in a position to produce those same three columns, via JOINs, but expressed using the common term of ContactID. And then it's a business rule, which you haven't touched on, for what to do with the ambiguous case of the three ContactIDs not matching. Maybe we do 2 out of 3 majority wins on relevant rows? Maybe every distinct tuple should spin up a new distinct MasterContactID? Perhaps we simply compute MIN( ... ) across the 3 IDs? Up to you. Given the indexing, computing such results across quarter million rows will definitely go very fast. Read those rows into a python script and apply the business logic you prefer.
Normalizing all text values beforehand probably wouldn't hurt. Lowercasing is straightforward. Maybe delete or normalize umlauts to one of three corresponding vowels. Turn punctuation like
"-"and"."into SPACE, and then compress repeated spaces down to a single one. There's a whole range of Phonetic algorithms that python could apply, including Metaphone, and postgres can even apply Soundex within a SELECT.The sql should resemble something like this. (Or use a CTE, though debugging would be less convenient.)
Also phone, and account. Then report on them: