I have the following pyspark code where I build a Doc2vev model and run UMAP on it. Only sometimes the last UMAP line will throw the error "Cannot assign slice from input of different size".
I can try to specify a random starting seed to find one that always converges for this specific document input, but I really want to improve the model code so it can take any similar document with different data and always converge without me having to manually find a starting seed that works.
What is it about the doc2vec model that makes it sometimes not work with the UMAP function that I can improve?
train_corpus = [gensim.models.doc2vec.TaggedDocument([word for word in agg_corpus_dict[i]['doc'] if word is not None], [str(agg_corpus_dict[i]['id'])]) for i in range(len(agg_corpus_dict))]
model = gensim.models.doc2vec.Doc2Vec(vector_size=100, min_count=20, epochs=20, workers=4)
progress_per_value = 1000
model.build_vocab(train_corpus, progress_per=progress_per_value)
model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs)
model.make_cum_table()
model.save('<model_dir>')
# Load fitted doc2vec model
doc2vec_model = gensim.models.doc2vec.Doc2Vec.load('<model_dir>')
reducer = umap.UMAP().fit(doc2vec_model.dv.vectors)
I've tried random starting seeds.
Editing your question to show the full error you receive, including all the lines of traceback showing involved lines-of-code/files, will help answerers determine what's going on.
If the exact same code, using the exact same
Doc2Vecmodel, sometimes succeeds & sometimes fails, that implies some instability in theUMAPcode. (Still, seeing the whole erro/traceback might offer clues.)If, on the other hand, it fails reliably on some frozen
Doc2Vecmodels, but not others, your should add extra output to determine what's different about the cases that it succeeds & fails. For example,print(d2v_model.dv.vectors.shape)before the line that sometimes fails, & examine (or share in your question) the outputs from both successful & failing runs.If that shows no obvious difference/coding-error between working and non-working cases, I suppose there's a chance the
UMAPcode is sensitive in some way to the exact values inside theDoc2Vecvectors. I wouldn't normally expect that – in the usual case, all vectors have nonzero dimensions, and I'd expect an algorithm that works on one set of such dimensions to work on others.But I suppose it might be possible, especially if you're running on a small or quirky amount of data, that some runs are leaving some vector dimensions in weird states – like lots of
0.0values – and that's perhaps creating problems for theUMAPstep, if it assumes otherwise. So if nothing else improves things, that'd be something else to check.