Grouping duplicate contacts into one master contact id

Question

Grouping duplicate contacts into one master contact id

73 Views Asked by Montrealer At 09 March 2023 at 16:47

I am trying to find an efficient way of grouping duplicate contacts together based on any 3 criteria: Email, Phone number & name, account number & name and creating a master_contact_id. Here is an exaple to illustrate what I am trying to acheive:

Data looks like this:

ContactID   Name           Email               Phone&Name               Account&Name
12345       Bob Smith      [email protected]         234-243-2432Bob Smith    A1234Bob Smith
42023       Bob Smith      [email protected]       234-243-2432Bob Smith    B1234Bob Smith
50203       Bob S.         [email protected]         234-243-2432Bob S.       Z1234Bob S.
20394       Clara Sakshi   [email protected]    123-123-1234Clara Sakshi Q1231Clara Sakshi
29930       Clara Sakshi   [email protected]       234-243-2432Clara Sakshi A1234Clara Sakshi
92303       Clara Sakshi   [email protected]  999-999-1234Clara Sakshi Q1231Clara Sakshi

Desired output:

Master ContactID   ContactID    Notes (not part of output):
1                  12345        related to 50203 by email match
1                  42023        related to 12345 by name and number match
1                  50203        related to 12345 by email match
2                  20394        related to 92303 by account number and name match
3                  29930        Not related to any other Contacts
2                  92303        related to 20394 by account number and name match

I have acheived the desired output by unpivoting the contacts table in SQL and then applying the graph walking technique described here: Grouping 'groups' with common element

Unfortunately the runtime of this solution is not viable. It took almost one hour to run a sample of 1000 records. Runtime increases exponentially as the data set increases and I have over 250000 contacts.

Any insights as to how to acheve this more efficiently (either in SQL or Python) would be greatly appreciated.

Please take note that I am essentially a beginner in SQL and I just started dabbleing in python in the hopes of finding an alternate solution.

Thank you

Luc

Original Q&A

There are 1 best solutions below

**J_H** · Answer 1 · 2023-03-09T18:09:10.397000

You wish to dedup records, efficiently.

Let's start with Email. Create an email_report table with PK of Email and UNIQUE KEY of ContactID. You can produce it using a simple GROUP BY. Take care to also ORDER BY ContactID and then use the MIN(ContactID) so we prefer the first ContactID added to the dataset, rather than subsequent dups. Any "unique constraint violated" messages that crop up here correspond to data cleaning issues, and it's probably best to just discard that handful of records.

Now each Email in the dataset maps to exactly one canonical ContactID, 1:1.

Repeat for Phone&Name. And for Account&Name. Giving us a base table plus three report tables. (We haven't set up FK relations but we certainly could.)

You showed example output with, roughly: SELECT mail, phone, acct. We are now in a position to produce those same three columns, via JOINs, but expressed using the common term of ContactID. And then it's a business rule, which you haven't touched on, for what to do with the ambiguous case of the three ContactIDs not matching. Maybe we do 2 out of 3 majority wins on relevant rows? Maybe every distinct tuple should spin up a new distinct MasterContactID? Perhaps we simply compute MIN( ... ) across the 3 IDs? Up to you. Given the indexing, computing such results across quarter million rows will definitely go very fast. Read those rows into a python script and apply the business logic you prefer.

Normalizing all text values beforehand probably wouldn't hurt. Lowercasing is straightforward. Maybe delete or normalize umlauts to one of three corresponding vowels. Turn punctuation like "-" and "." into SPACE, and then compress repeated spaces down to a single one. There's a whole range of Phonetic algorithms that python could apply, including Metaphone, and postgres can even apply Soundex within a SELECT.

The sql should resemble something like this. (Or use a CTE, though debugging would be less convenient.)

CREATE TABLE email_report (
    email      TEXT     PRIMARY KEY,
    contactId  INTEGER  NOT NULL,
    UNIQUE (contact_id)
);
INSERT INTO email_report
SELECT    email, MIN(contactId) AS contactId
FROM      base_table
GROUP BY  email
ORDER BY  email, contactId
;

Also phone, and account. Then report on them:

SELECT    e.contactId AS email_cid, ...     -- also phone, and account.
FROM      base_table b
JOIN      email_report e  ON b.email = e.email
JOIN                                        --- also phone
JOIN                                        --- and account

Grouping duplicate contacts into one master contact id

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in SQL

Related Questions in MATRIX

Related Questions in GROUPING

Related Questions in CONNECTED-COMPONENTS

Trending Questions

Popular # Hahtags

Popular Questions