I am topic modeling on two large text documents (around 500-750 KB) and am asking for ten topics. I keep getting a repeat of two topics. Could this be an issue of the small number of documents? Or should I change the alpha/beta parameters?
Here is the code for the model part:
`lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
id2word=id2word,
num_topics=10,
random_state=100,
update_every=1,
chunksize=2,
passes=10,
alpha='auto',
per_word_topics=True)`
Here are the topics:
[(0,
'0.005*"city" + 0.004*"police" + 0.003*"people" + 0.003*"thank" + '
'0.003*"know" + 0.003*"want" + 0.002*"go" + 0.002*"say" + 0.002*"time" + '
'0.002*"cop"'),
(1,
'0.001*"people" + 0.001*"cop" + 0.001*"city" + 0.001*"want" + 0.001*"go" + '
'0.001*"police" + 0.001*"thank" + 0.001*"time" + 0.001*"know" + 0.001*"say"'),
(2,
'0.001*"people" + 0.001*"police" + 0.001*"city" + 0.001*"thank" + '
'0.001*"want" + 0.001*"cop" + 0.001*"go" + 0.001*"know" + 0.001*"say" + '
'0.001*"make"'),
(3,
'0.002*"city" + 0.002*"people" + 0.001*"know" + 0.001*"want" + '
'0.001*"police" + 0.001*"go" + 0.001*"say" + 0.001*"vote" + 0.001*"time" + '
'0.001*"cop"'),
(4,
'0.001*"city" + 0.001*"police" + 0.001*"cop" + 0.001*"people" + 0.001*"go" + '
'0.001*"thank" + 0.001*"want" + 0.001*"vote" + 0.001*"make" + 0.001*"time"'),
(5,
'0.020*"city" + 0.014*"people" + 0.013*"police" + 0.011*"cop" + 0.010*"go" + '
'0.010*"thank" + 0.009*"want" + 0.009*"know" + 0.008*"say" + 0.006*"time"'),
(6,
'0.001*"city" + 0.001*"go" + 0.001*"know" + 0.001*"people" + 0.001*"police" '
'+ 0.001*"cop" + 0.001*"want" + 0.001*"vote" + 0.000*"say" + 0.000*"time"'),
(7,
'0.002*"city" + 0.001*"people" + 0.001*"police" + 0.001*"thank" + 0.001*"go" '
'+ 0.001*"want" + 0.001*"know" + 0.001*"cop" + 0.001*"vote" + 0.001*"say"'),
(8,
'0.003*"city" + 0.003*"people" + 0.003*"police" + 0.002*"thank" + 0.002*"go" '
'+ 0.002*"know" + 0.002*"vote" + 0.002*"want" + 0.002*"say" + 0.002*"time"'),
(9,
'0.017*"people" + 0.014*"city" + 0.012*"police" + 0.010*"go" + 0.010*"thank" '
'+ 0.010*"want" + 0.009*"know" + 0.009*"say" + 0.009*"vote" + 0.008*"time"')]
The visualization:
`# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
vis
`
see picture of visualization here
I have tried changing the parameters some, but haven't seen results. It's hard to find what the normal range for alpha and beta parameters are.