Removing a tweet/row if it contains any non-english word

682 Views Asked by At

I want to remove the whole tweet or a row from a data-frame if it contains any non-english word. My data-frame looks like

     text
1  | morning why didnt i go to sleep earlier oh well im seEING DNP TODAY!!  
     JIP UHH <f0><U+009F><U+0092><U+0096><f0><U+009F><U+0092><U+0096>

2  | @natefrancis00 @SimplyAJ10 <f0><U+009F><U+0098><U+0086><f0><U+009F 
     <U+0086> if only Alan had a Twitter hahaha

3  | @pchirsch23 @The_0nceler @livetennis Whoa whoa let’s not take this too 
     far now
4  | @pchirsch23 @The_0nceler @livetennis Well Pat that’s just not true
5  | One word #Shame on you! #Ji allowing looters to become president

The expected dataframe should be like this:

 text
3  | @pchirsch23 @The_0nceler @livetennis Whoa whoa let’s not take this too 
     far now
4  | @pchirsch23 @The_0nceler @livetennis Well Pat that’s just not true
5  | One word #Shame on you! #Ji allowing looters to become president.
1

There are 1 best solutions below

13
Mankind_2000 On BEST ANSWER

You want to preserve the alpha-numeric characters along with some of punctuation's like @, ! etc.
If your column contains mainly of <unicode>, then this should do:

For data frame df with text column, using grep:

new_str <- grep(df_str$text, pattern = "<*>", value= TRUE , invert = TRUE )
new_str[new_str != ""]

To put it back to your original column text. you can just work with indices that you need and put other to NA:

idx <-  grep(df$text, pattern = "<*>", invert = TRUE )
df$text[-idx] <- NA 

For cleaning the tweet, you can use gsub function. refer this post cleaning tweet in R