When it comes to "short texts" in topic modelling and natural language processing, what exactly is the definition of a short text? I have not been able to find a definitive answer. Could anyone provide a clear definition of the length of a "short text" in these two areas?

I've tried searching a lot of papers and I haven't seen anyone define short text clearly. I'm using Biterm for short texts, but how long a text can be considered a short text? The thesis in this Similar answers, which I also researched, but gave some examples to state that it was a short text and did not give a definition. I checked some other blogs and someone said that as long as it is less than 160 characters it is a short text. But I didn't find any academic basis for this.

1

There are 1 best solutions below

2
jAYANT YADAV On

There is no definitive answer on the length or definition of short text as per my knowledge. The models that have been considered to work best for short text including WNTM and Biterm,BTM gives a motivation that classical methods like LDA perform poorly on short text present on online social media. The papers are using datasets with avg document length of 12.4, 8.5 for WNTM and 3.9, 5.21, 5.87 in BTM. I would recommend to match your document length with those used by BTM experiments and then proceed.