I have documents in couchdb. The schema looks like below:
userId
email
personal_blog_url
telephone
I assume two users are actually the same person as long as they have
- email or
- personal_blog_url or
- telephone
be identical.
I have 3 views created, which basically maps email/blog_url/telephone to userIds and then combines the userIds into the group under the same key, e.g.,
_view/by_email:
----------------------------------
key values
[email protected] [123, 345]
[email protected] [23, 45, 333]
_view/by_blog_url:
----------------------------------
key values
http://myblog.com [23, 45]
http://mysite.com/ss [2, 123, 345]
_view/by_telephone:
----------------------------------
key values
232-932-9088 [2, 123]
000-111-9999 [45, 1234]
999-999-0000 [1]
My questions:
- How can I merge the results from the 3 different views into a final user table/view which contains no duplicates?
- Or whether it is a good practice to do such deduplication in couchdb?
- Or what would be a good way to do a deduplication in couch then?
ps. in the finial view, suppose for all dupes, we only keep the smallest userId.
Thanks.
Good question. Perhaps you could listen to
_changesand search for the fields you want to be unique for the real user in the views you suggested (by_*).Merge the views into one (emit different fields in one map):
function (doc) { if (!doc.email || !doc.personal_blog_url || !doc.telephone) return; emit([1, doc.email], [doc._id]); emit([2, doc.personal_blog_url], [doc._id]); emit([3, doc.telephone], [doc._id]); }
Merge the lists of id's in reduce
keys=[[1, email], [2, personal_blog_url], ...]and merge the three lists. If its minimal id is smaller then the changed doc, update the fieldrealId, otherwise update the documents in the list with the changed id.I suggest using different document to store
{ userId, realId }relation.