I am trying to remove punctuation and spaces (which includes newlines) and filter for tokens consisting of alphabetic characters only, and return the token text. I first define the function
return [t.text for t in nlp(doc) if \
not t.is_punct and \
not t.is_space and \
t.is_alpha]
And then i vectorize
vectorizer = TfidfVectorizer(tokenizer=spacy_tokenizer)
train_feature_vects = vectorizer.fit_transform(train_data)
The terminal gets stuck, and says the parameter 'token_pattern' will not be used since 'tokenizer' is not none'. What am I doing wrong?
For
TfidfVectorizer,CountVectorizer, etc. in scikit-learn, to define your owntokenizer, you also need to settoken_patterntoNone:scikit-learn will use the
token_patternfor tokenisation, if not specified, it will use the default valuer”(?u)\b\w\w+\b”. this will override thetokenizeryou define. So you need to settoken_patterntoNone, and scikit-learn will use the function you pass totokenizerfor the tokenization step.