Extract First Word of a Paragraph R

767 Views Asked by At

I'm trying to remove apostrophes from a Corpus, but only when they are the first character in a paragraph. I have seen posts about finding the first word in a sentence, but not a paragraph.

The reason I'm trying this is because I'm analyzing text. I want to strip all the punctuation, but leave apostrophes and dashes only in the middle of words. To start this, I did:

library(tm)
library(qdap)
#docs is any corpus
docs.test=tm_map(docs, PlainTextDocument)
docs.test=tm_map(docs.test, content_transformer(strip), char.keep=c("'","-"))
    for(j in seq(docs.test))   
{   
  docs[[j]] <- gsub(" \'", " ", docs[[j]])   

}

This successfully removed all of the apostrophes except those that start on new lines. To remove those on new lines, I have tried:

for(j in seq(docs.test))   
{     
  docs[[j]] <- gsub("\r\'", " ", docs[[j]])
  docs[[j]] <- gsub("\n\'", " ", docs[[j]])
  docs[[j]] <- gsub("<p>\'", " ", docs[[j]])
  docs[[j]] <- gsub("</p>\'", " ", docs[[j]])

}

In general, I think it would be useful to find a way to extract the first word of a paragraph. For my specific issue, I'm trying it just as a way to get at those apostrophes. I'm currently using the packages qdap and tm, but open to using more.

Any ideas?

Thank you!

1

There are 1 best solutions below

0
Ken Benoit On

You didn't supply a test example, but here is a function that keeps intra-word apostrophes and hyphens. It's in a different package, but as the example at the end shows, is easily coerced to a regular list if you need it to be:

require(quanteda)

txt <- c(d1 = "\"This\" is quoted.",
         d2 = "Here are hypen-words.",
         d3 = "Example: 'single' quotes.",
         d4 = "Possessive plurals' usage.")

(toks <- tokens(txt, removePunct = TRUE, removeHyphens = FALSE))
## tokens from 4 documents.
## d1 :
## [1] "This"   "is"     "quoted"
##
## d2 :
## [1] "quanteda's"  "hypen-words"
## 
## d3 :
## [1] "Example" "single"  "quotes" 
##
## d4 :
## [1] "Possessive" "plurals"    "usage"  

You can get back to a list this way, and of course back to documents if you need to be by sapply()ing a paste(x, collapse = " "), etc.

as.list(toks)
## $d1
## [1] "This"   "is"     "quoted"
## 
## $d2
## [1] "quanteda's"  "hypen-words"
## 
## $d3
## [1] "Example" "single"  "quotes" 
## 
## $d4
## [1] "Possessive" "plurals"    "usage"