I'm trying to remove apostrophes from a Corpus, but only when they are the first character in a paragraph. I have seen posts about finding the first word in a sentence, but not a paragraph.
The reason I'm trying this is because I'm analyzing text. I want to strip all the punctuation, but leave apostrophes and dashes only in the middle of words. To start this, I did:
library(tm)
library(qdap)
#docs is any corpus
docs.test=tm_map(docs, PlainTextDocument)
docs.test=tm_map(docs.test, content_transformer(strip), char.keep=c("'","-"))
for(j in seq(docs.test))
{
docs[[j]] <- gsub(" \'", " ", docs[[j]])
}
This successfully removed all of the apostrophes except those that start on new lines. To remove those on new lines, I have tried:
for(j in seq(docs.test))
{
docs[[j]] <- gsub("\r\'", " ", docs[[j]])
docs[[j]] <- gsub("\n\'", " ", docs[[j]])
docs[[j]] <- gsub("<p>\'", " ", docs[[j]])
docs[[j]] <- gsub("</p>\'", " ", docs[[j]])
}
In general, I think it would be useful to find a way to extract the first word of a paragraph. For my specific issue, I'm trying it just as a way to get at those apostrophes. I'm currently using the packages qdap and tm, but open to using more.
Any ideas?
Thank you!
You didn't supply a test example, but here is a function that keeps intra-word apostrophes and hyphens. It's in a different package, but as the example at the end shows, is easily coerced to a regular list if you need it to be:
You can get back to a list this way, and of course back to documents if you need to be by
sapply()ing apaste(x, collapse = " "), etc.