Is there a step to use relative frequency instead of step_tokenfilter() in recipes

36 Views Asked by Malichot At 01 September 2023 at 14:40

I'm building a regression model using this great approach by Emil Hvitfeldt and Julia Silge in R (https://smltar.com/mlregression#fnref7) and I was wondering if it could be possible to use relative frequency instead of absolute in the preprocessing steps step_tokenfilter(). I looked into it but couldn't find the function.

Here is my code for now, using tf-idf instead on the 1000 most frequent tokens.

 data_rec <- recipe(year ~ sentence_lemma, data = data_train) %>%
  step_tokenize(sentence_lemma) %>%
  step_stopwords(sentence_lemma, custom_stopword_source = stopwords_list) %>%
  step_tokenfilter(sentence_lemma, max_tokens = 1e3) %>%
  step_tfidf(sentence_lemma) %>%
  step_normalize(all_predictors())

Thanks in advance for any help ;)

Original Q&A

There are 1 best solutions below

Malichot On 04 September 2023 at 18:38

Looking carefully at the documentation and the various functions, I think I found the following solution using step_tf() function (https://textrecipes.tidymodels.org/reference/step_tf.html) with parameter weight_scheme = "term frequency". Cf. Documentation :

"term frequency" will divide the count with the total number of words in the document to limit the effect of the document length as longer documents tends to have the word present more times but not necessarily at a higher percentage

Is there a step to use relative frequency instead of step_tokenfilter() in recipes

There are 1 best solutions below

Related Questions in R

Related Questions in R-RECIPES

Trending Questions

Popular # Hahtags

Popular Questions