Is there a step to use relative frequency instead of step_tokenfilter() in recipes

36 Views Asked by At

I'm building a regression model using this great approach by Emil Hvitfeldt and Julia Silge in R (https://smltar.com/mlregression#fnref7) and I was wondering if it could be possible to use relative frequency instead of absolute in the preprocessing steps step_tokenfilter(). I looked into it but couldn't find the function.

Here is my code for now, using tf-idf instead on the 1000 most frequent tokens.

 data_rec <- recipe(year ~ sentence_lemma, data = data_train) %>%
  step_tokenize(sentence_lemma) %>%
  step_stopwords(sentence_lemma, custom_stopword_source = stopwords_list) %>%
  step_tokenfilter(sentence_lemma, max_tokens = 1e3) %>%
  step_tfidf(sentence_lemma) %>%
  step_normalize(all_predictors())

Thanks in advance for any help ;)

1

There are 1 best solutions below

0
Malichot On

Looking carefully at the documentation and the various functions, I think I found the following solution using step_tf() function (https://textrecipes.tidymodels.org/reference/step_tf.html) with parameter weight_scheme = "term frequency". Cf. Documentation :

"term frequency" will divide the count with the total number of words in the document to limit the effect of the document length as longer documents tends to have the word present more times but not necessarily at a higher percentage