Compare two columns (with merged phone numbers) if any phone number from first column exists in the second column

63 Views Asked by At

I need to compare two columns which are in resulting data frame and those two columns are coming from a separate sources.

Now, I would like to compare them and have a resulting (tag) column based on their match.(YES or NO)

There are e_phone and Phones.

(ex. e_phone="881234567" and Phones="881234567; 665553331;")

They both can have multiple phone numbers separated by ";". Sometimes e_phone can have numbers like e_phone="0992767312; +550661234567". (which are more then 9 digit length)

  • I need to compare last nine digits of every of available phone number from e_phone in the Phones.

I already cleared Phones column because it comes from the sql, but I am new at the R so I really appreciate anyone's help.

At the end I would like to have add to the resulting data frame just a column called e_phone_match with values "YES" (if any number from e_phone exists in Phones) and "NO" if there isn't any match.

something like

Phones e_phone e_phone_match
881234567; 665553331; 881234567 YES
2

There are 2 best solutions below

4
Julian On

Using separate_rows():

    library(tidyr)
    library(dplyr)
    library(stringr)
    df <- tibble(
      Phones = c("728125443; 967720806;"),
      e_phone = c("524545312; +882728125443")
    )
    
    df |> 
      separate_rows(c(Phones), sep = ";", convert = TRUE) |> 
      filter(!is.na(Phones)) |> 
      separate_rows(c(e_phone), sep = ";", convert = TRUE) |> 
      mutate(across(c(Phones, e_phone), ~str_sub(as.character(.),-9)),
             match = Phones == e_phone)
      

Output:

# A tibble: 4 × 3
  Phones    e_phone   match
  <chr>     <chr>     <lgl>
1 728125443 524545312 FALSE
2 728125443 728125443 TRUE 
3 967720806 524545312 FALSE
4 967720806 728125443 FALSE
0
sebekkg On

Thank you Julian for you help it gave me idea on how to proceed.

here is how I managed it

# Split the e_phone column into a list of vectors
e_phone_split <- strsplit(as.character(res$e_phone), ";")

# Clean and process phone numbers
clean_phone_number <- function(phone) {
  if (!is.na(phone) && phone != "") {
    cleaned <- gsub("[^0-9]", "", phone)  # Remove non-numeric characters
    if (nchar(cleaned) >= 9 && nchar(cleaned) <= 12) {
      cleaned <- substr(cleaned, nchar(cleaned) - 8, nchar(cleaned))  # Take the last nine figures
    } else {
      cleaned <- ""  # Empty string for invalid numbers
    }
    return(cleaned)
  } else {
    return("")
  }
}

# Apply the cleaning function to each element of the list
e_phone_cleaned <- lapply(e_phone_split, function(phone_list) sapply(phone_list, clean_phone_number))

# Create a new column 'e_phone_cleaned' in the original data frame
res$e_phone_cleaned <- sapply(e_phone_cleaned, function(phone_list) paste(phone_list, collapse = ";"))

# Remove ";" from the Phones column for easier comparison
res$Phones <- gsub(";", "", as.character(res$Phones))

# Initialize an empty vector to store match results
match_results <- numeric(nrow(res))

# Loop through each row and compare cleaned phone numbers
for (i in 1:nrow(res)) {
  if (is.na(res$e_phone[i]) || is.na(res$e_phone_cleaned[i]) || (res$Phones[i] == "" && res$e_phone_cleaned[i] == "")) {
    match_results[i] <- 0  # Don't match for NA or both empty
  } else {
    match_results[i] <- grepl(res$e_phone_cleaned[i], res$Phones[i], ignore.case = TRUE)
  }
}

# Add the match results to the data frame
res$enq_match_nr <- as.numeric(match_results)

# Add a column e_phoneone_match based on enq_match_nr
res$e_phoneone_match <- ifelse(res$enq_match_nr == 1, "YES", "NO")