Custom tokenizer not working in countvectorizer sklearn

174 Views Asked by At

I am trying to make a Countvectorizer with a custom tokenizer function. I am facing a weird problem with it. In below code temp_tok is a list of 5 values which is used as vocabulary later.

temp_tok = ["or", "Normal sinus rhythm", "sinus", "anuj","Normal sinus"]

def tokenize(text):
    return [temp_tok[0],temp_tok[1], "sinus", "Normal sinus"]

def tokenize2(text):
    return [i for i in temp_tok if i in text]

text = "Normal sinus rhythm"

The output of text for both functions is same which is

tokenize(text)
output = ['or', 'Normal sinus rhythm', 'sinus', 'Normal sinus']

But when I build vectorizer with these tokenizer, it gives unexpected output for tokenize2. My vocabulary is temp_tok for both. I experimented with n_gram range but it is not helping.

vectorizer = CountVectorizer(vocabulary=temp_tok,tokenizer = tokenize)
vectorizer2 = CountVectorizer(vocabulary=temp_tok,tokenizer = tokenize2)

While vectorizer.transform([text]) is giving expected output, vectorizer2.transform([text]) is giving 1 only for "or" and "sinus"

vectorizer.transform(["Normal sinus rhythm"]).toarray()
array([[1, 1, 1, 0, 1]])

vectorizer.transform(["Normal sinus rhythm"]).toarray()
array([[1, 0, 1, 0, 0]])

I also tried passing dictionary instead of list temp_tok as vocabulary to Countvectorizer but it doesn't help. Is this sklearn problem or I am doing something wrong?

1

There are 1 best solutions below

0
Anuj Chopra On BEST ANSWER

Countvectorizer is passing the text by converting it to lower case. So tokenize2 is not working while tokenize works well. This can be seen by adding a print function in tokenize2.

def tokenize2(text):
print(text)
return [i for i in temp_tok if i in text]

A good solution would be to change the elements in temp_tok to lower cases. Else any technique to handle small case, capital case would work.