I am working on a sentiment analysis project where I am analyzing a corpus of documents, and I am specifically not removing the word "not" as a stopword, so that I can use it to determine if a text agrees or disagrees with something. For instance, there is a difference between "not effective" and "effective" when discussing the COVID vaccine.
However, my phraser is not identifying any bigrams with the word "not." I presume this is because that token exists in such large numbers (particularly because I expanded contractions, so "isn't" -> "is not"), that the scoring function simply scores all bigrams with "not" too low. This would be because the standard phrase scoring function is:
(where min_count is a hyper parameter)
So, since "not" exists many thousands of times in the database, worda_count will be very large, leading to a large denominator and dropping the score considerably.
Is there a way to get around this, so "not" bigrams are scored effectively?
I can think of a few options off the top of my head:
Write my own scoring function that effectively has two scoring formula: the standard scoring formula, and a different scoring formula if the first word is "not".
I could include "not" in a list of
connector_words, butgensim.models.phrases.Phraserspecifically indicates that these connector words cannot be at the beginning or end of a phrase.

As you've discovered, the
Phrasesfunctionality in Gensim is pretty crude: it only combines words based on a meaning-oblivious statistical analysis. It's more likely to be helpful in promoting certain noun-phrases ('new_york') or idioms than generic syntactical reversals-of-meaning (as with an added'not'). So whether you'll want to use it at all, I'm not sure.You could try the most simpleminded thing possible: preprocess to always attach
'not'to the following word. Maybe it'll help!You could also try some expensive grammar-aware preprocessing - the sort that labels words with parts-of-speech, & further identifies which other words/word-ranges a particular
'not'modifies. That might allow you to condiionally connect the'not'to later words – maybe even non-contiguous words – & perhaps that will provide a lift to downstream sentiment-analysis.