How to use fuzzy join to find elements with similar strings, also on the condition that certain columns match

28 Views Asked by At

I have a dataset where I have a columns for buyers (which all have Id numbers) and their characteristics for example gender, occupation etc. I also have a column for reason why they unsubscribed which is a string variable.

I suspect that some of the individuals have been entered more than once but with slightly different 'reason for leaving' and i need to locate these.

I am new to R but i think i need to use the fuzzy package to find similar 'reason for leaving', but the gender and occupation will be the same, so each match must also be conditional upon matching of gender and occupation, as well as being matched by fuzzy for similar reasons for leaving.

Here is some example data: Data1 = data.frame(ID = c('1234','2345','3456','1234','5678'), Gender = c('M','F','F','M','M'), Occupation =c('Doctor','Teacher','Lawyer','Doctor','Athlete', Reason=c('I have no time to do this anymore','I'd rather not say','For financial and family reasons','I don't have time to do it anymore','I didnt enjoy the product'))

It should look like this: ID Gender Occupation Reason for leaving

1 1234 M Doctor I have no time to do this anymore
2 2345 F Teacher I'd rather not say
3 3456 F Lawyer For financial and family reasons
4 1234 M Doctor I don’t have any time to do it anymore 5 5678 F Athlete I didn’t enjoy the product

As you can see, individuals 1 and 4 are the same person who has been entered twice but have slightly different reason for leaving. I want to find these cases in my dataset and be able to view them and choose which one I want to keep and then delete the other. To be the same individual, there has to be the same ID, gender, occupation and it must have a similar reason for leaving based on the fuzzy matching.

I'd be very grateful for any suggestions!

0

There are 0 best solutions below