I know that there are many questions out there about partial matches and I've read as many as I've been able to, but I have still not managed to extract what I need using R.
In a nutshell, my problem is that I have a data set with over a million rows of Spanish trigrams and I want to find only those that have verbs. In an attempt to make this easier, I added a row with the 500 most common verbs in Spanish in order to try to match them to the trigrams.
I have a data set like this:
data <- data_frame(trigrams= c("no veo que", "no me gusta", "si habla de", "la mesa de", "el caso que"), fequency=c(112, 345, 578), verb=c("hablar", "gustar", "leer"))
The verbs in the third column ("verb") are infinitives and I would like to partially match them to the verbs in the first ("trigram"). I think it would be ideal, in this case, to be able to use a for loop in order to iterate through the 500 verbs that I want to partially match to my over one million trigrams.
so in this case: "gustar" should partially match "no me gusta" and nothing should match verbless trigrams like "el caso que".
I really do hope this makes sense, I have never worked with these amount of data before and I am too new to regular expressions to really figure this out on my own.
I think this approach using
stringrmight help you. You might have to do some modifications in order to use it in adataframe. Basically we have to convert each verb such as "hablar" into a pattern such as'hablar*'and then do astr_extract()-Created on 2018-09-16 by the reprex package (v0.2.0).