Cleaning text data that has extra whitespaces between letters of a word

36 Views Asked by At

Using R, I have read text from PDFs, and some words were read in with a space within the word, and I can't find any way to clean it.

For example, I need to turn "Then I cre ated a c h a r t using Excel" into "Then I created a chart using Excel"

The problem is that text analysis tools first separate words into tokens (separate rows) based on where spaces occur with the assumption that a space denotes a new word.

library(tidytext)
library(dplyr)

## Example dataframe of text strings with erroneous spaces within words
txt <- structure(list(id = 1:3, 
                      data = c("Then I cre ated a c h a r t using Excel", 
                               "other p r inc ip le is that when a person", 
                               "M R . C O O K S O N : Mr . Speaker , on behal f of")), 
                 class = "data.frame", 
                 row.names = c(NA, -3L))


## Unnest text data so each word becomes a separate row in the dataframe
result <- tidytext::unnest_tokens(tbl = txt,
                        output = text, 
                        input = data, 
                        token = "words", 
                        to_lower = F)

print(result)

   id    text
1   1    Then
2   1       I
3   1     cre
4   1    ated
5   1       a
6   1       c
7   1       h
8   1       a
9   1       r
10  1       t
11  1   using
12  1   Excel
...

Is there a way to collapse all words with extra whitespaces within them without collapsing white space that occurs between words? All I can think is some kind of function that would:

  • find occurrences of single characters with whitespace on either side (besides "a" or "I")

  • remove whitespace from either side of the character, but not if that space is between the character and a whole word (i.e., a string of two or more characters)

Maybe something like: if you find a letter besides "a" or "I" with a space on both sides, which is followed by a single letter that is not "a" or "I" followed by a space, then remove that first space?

I tried:

txt <- structure(list(id = 1:3, 
                      data = c("Then I cre ated a c h a r t using Excel", 
                               "other p r inc ip le is that when a person", 
                               "M R . C O O K S O N : Mr . Speaker , on behal f of")), 
                 class = "data.frame", 
                 row.names = c(NA, -3L))

df1 <- txt %>%
  mutate(revised = gsub("([A-Za-z])\\s(?=[A-Za-z]\\b)", "\\1", data, perl = TRUE))

df2 <- txt %>%
  mutate(revised = gsub("(?<=\\b\\w)\\s(?=\\w\\b)", "", data, perl=T))
0

There are 0 best solutions below