Removing specific text R

211 Views Asked by At

I have a character vector in a data frame in R which contains inbound email text. Most of the rows contain 'Dear x,' where x is any intended recipient and x can vary. There could also be typos such as the incorrect use of lowercase. Either way, the common feature is that they start with the word 'dear' (upper or lowercase) and end in a comma.


df <- data.frame(emails = c("Dear dave, I have seen what you...", "Dear Mr Smith, I recieved your reply...", "dear stu, I note that you have not..."),
                 account = c(534, 434, 544)
)

df

                                   emails account
1      Dear dave, I have seen what you...     534
2 Dear Mr Smith, I recieved your reply...     434
3   dear stu, I note that you have not...     544

I am looking to trim off the email intro to just start with the main body of text so it looks like the one below.

                          emails   account
1        I have seen what you...   534
2       I recieved your reply...   434
3    I note that you have not...   544
3

There are 3 best solutions below

0
akrun On BEST ANSWER

Using trimws in base R

df$emails <-  trimws(df$emails, whitespace = "[Dd]ear[^,]+,\\s+")

-output

df$emails
[1] "I have seen what you..."     "I recieved your reply..."    "I note that you have not..."
0
Tim Biegeleisen On

We can use sub() here:

df$emails <- sub("^[Dd]ear(?: \\S+)+,\\s*", "", df$emails)
0
Carl On

In case you'd like a tidyverse / stringr option:

The ? stops the search at the first comma.

library(tidyverse)

tribble(
  ~emails, ~account,
  "Dear dave, I have seen what you...", 534,
  "Dear Mr Smith, I recieved your reply...", 434,
  "dear stu, I note, that you have not...", 544
) |> 
  mutate(emails = str_remove(emails, "[Dd]ear.*?, "))
#> # A tibble: 3 × 2
#>   emails                       account
#>   <chr>                          <dbl>
#> 1 I have seen what you...          534
#> 2 I recieved your reply...         434
#> 3 I note, that you have not...     544

Created on 2022-12-26 with reprex v2.0.2