the parameter 'token_pattern' will not be used since 'tokenizer' is not none'

1k Views Asked by Nachonacho Nachonacho At 21 September 2023 at 10:17

I am trying to remove punctuation and spaces (which includes newlines) and filter for tokens consisting of alphabetic characters only, and return the token text. I first define the function

  return [t.text for t in nlp(doc) if \
          not t.is_punct and \
          not t.is_space and \
          t.is_alpha]

And then i vectorize

vectorizer = TfidfVectorizer(tokenizer=spacy_tokenizer)
train_feature_vects = vectorizer.fit_transform(train_data)

The terminal gets stuck, and says the parameter 'token_pattern' will not be used since 'tokenizer' is not none'. What am I doing wrong?

Original Q&A

There are 1 best solutions below

Andj On 22 September 2023 at 00:47

For TfidfVectorizer, CountVectorizer, etc. in scikit-learn, to define your own tokenizer, you also need to set token_pattern to None:

vectorizer = TfidfVectorizer(tokenizer=spacy_tokenizer, token_pattern=None)

scikit-learn will use the token_pattern for tokenisation, if not specified, it will use the default value r”(?u)\b\w\w+\b”. this will override the tokenizer you define. So you need to set token_pattern to None, and scikit-learn will use the function you pass to tokenizer for the tokenization step.

the parameter 'token_pattern' will not be used since 'tokenizer' is not none'

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in MACHINE-LEARNING

Related Questions in NAIVEBAYES

Trending Questions

Popular # Hahtags

Popular Questions