topic modeling and stm: findThoughts with trimmed quanteda corpus

54 Views Asked by At

I'm using the stm package for topic modeling. Everything works great, but when it comes to validating my topics by comparing the documents with the top documents, I'm having difficulties with the stm package.

I'm using quanteda for the pre-processing of my corpus, as well as RNewsflow for the removal of duplicates or similar texts with an overlap of 95%. Unfortunately, RNewsFlow only takes a quanteda dtm as a valid argument (as does tm). Therefore, the formatted DFM used for analysis doesn't quite match with the number of documents of my original corpus.

Hence, I get the error:

Error in findThoughts(stmM_15_k32, texts = corp_chmedia, n = 2, topics = 6) : 
  Number of provided texts and number of documents modeled do not match

Is there an alternative to inspecting top documents in stm with this approach?

I tried removing duplicates before turning my dataframe into a corpus, which unfortunantely only works for removing exact duplicates. However, since I'm working with newspaper articles, and a lot of newspaper articles in my country are re-published across outlets, though with minimnal changes, it is important for me to include overlap measure (RNewsflow uses Jaccard similarity).

0

There are 0 best solutions below