Incremental Inverse Document Frequency without storing the past information

44 Views Asked by At

I compute the tf-idf everyday in my pipeline using pyspark to evaluate the significance of a keyword in a specific document. This enables me to generate a summary for utilization in my machine learning model. Although the documents in my pipeline change daily, many keywords persist. Storing the historical information of document frequency for each keyword is impractical and not possible. How can I approximate or incrementally calculate the IDF score for a given keyword in this scenario?

IDF calculation: idf(t) = log(D / (d: t in d))

0

There are 0 best solutions below