Say I have the following sentences ["hello", "foo bar baz"] and I want to get 1,2 and 3-gram if the 1 and 2-grams are not in the 3-gram i.e for the two sentences above I would like a vocabulary being [("hello"), ("foo bar baz")].
If I use CountVectorizer with ngram_range = (1,3) I would get the uni-grams foo, bar and baz and their bi-grams as well. thus I can't just set ngram_range=(3,3).
Is there a way of doing that in any way without doing seriously work-around?
Unfortunately,
scikit-learndoes not provide a straightforward way of generating unique n-grams. Here's a simple way usingnltkto achieve what you're asking:With this code, we first generate all n-grams within the specified range for each text. We then count the occurrences of each n-gram across all texts. Finally, we keep only the n-grams that occur once, which are the n-grams that don't have any sub-n-grams present in the corpus.
Output: