Cleaning text data that has extra whitespaces between letters of a word

36 Views Asked by Nick Yarmey At 27 March 2024 at 21:02

Using R, I have read text from PDFs, and some words were read in with a space within the word, and I can't find any way to clean it.

For example, I need to turn "Then I cre ated a c h a r t using Excel" into "Then I created a chart using Excel"

The problem is that text analysis tools first separate words into tokens (separate rows) based on where spaces occur with the assumption that a space denotes a new word.

library(tidytext)
library(dplyr)

## Example dataframe of text strings with erroneous spaces within words
txt <- structure(list(id = 1:3, 
                      data = c("Then I cre ated a c h a r t using Excel", 
                               "other p r inc ip le is that when a person", 
                               "M R . C O O K S O N : Mr . Speaker , on behal f of")), 
                 class = "data.frame", 
                 row.names = c(NA, -3L))


## Unnest text data so each word becomes a separate row in the dataframe
result <- tidytext::unnest_tokens(tbl = txt,
                        output = text, 
                        input = data, 
                        token = "words", 
                        to_lower = F)

print(result)

   id    text
1   1    Then
2   1       I
3   1     cre
4   1    ated
5   1       a
6   1       c
7   1       h
8   1       a
9   1       r
10  1       t
11  1   using
12  1   Excel
...

Is there a way to collapse all words with extra whitespaces within them without collapsing white space that occurs between words? All I can think is some kind of function that would:

find occurrences of single characters with whitespace on either side (besides "a" or "I")
remove whitespace from either side of the character, but not if that space is between the character and a whole word (i.e., a string of two or more characters)

Maybe something like: if you find a letter besides "a" or "I" with a space on both sides, which is followed by a single letter that is not "a" or "I" followed by a space, then remove that first space?

I tried:

txt <- structure(list(id = 1:3, 
                      data = c("Then I cre ated a c h a r t using Excel", 
                               "other p r inc ip le is that when a person", 
                               "M R . C O O K S O N : Mr . Speaker , on behal f of")), 
                 class = "data.frame", 
                 row.names = c(NA, -3L))

df1 <- txt %>%
  mutate(revised = gsub("([A-Za-z])\\s(?=[A-Za-z]\\b)", "\\1", data, perl = TRUE))

df2 <- txt %>%
  mutate(revised = gsub("(?<=\\b\\w)\\s(?=\\w\\b)", "", data, perl=T))

Original Q&A

Cleaning text data that has extra whitespaces between letters of a word

There are 0 best solutions below

Related Questions in R

Related Questions in NLP

Trending Questions

Popular # Hahtags

Popular Questions