How to pass my stop_words list using TfidfVectorizer?

28 Views Asked by At

I am trying to use the TfidfVectorizer function with my own stop words list and using my own tokenizer function. Currently I am doing this:

def transformation_libelle(sentence, **args):
    stemmer = SnowballStemmer("french")
    sentence_clean = re.compile(r'^[A-Z][A-Z][A-Z]\d ').sub('', sentence.replace(r'_', " ").replace(r'-', " "))
    return [stemmer.stem(token).upper() for token in re.split(r'\W+', sentence_clean) if token not in stop_words and not all([char.isdigit() or char == '.' for char in token])]

tfidf_vectorizer = TfidfVectorizer(max_df=0.5,
                                 min_df=0,
                                 use_idf=True, tokenizer=transformation_libelle, lowercase=False,
                                 ngram_range=(1,3), stop_words=stop_words)

With stop_words my own list. How can I pass it throught my tokenizer function?

Thank you

1

There are 1 best solutions below

0
Lefloch Had On

I just added an input to my function: transformation_libelle(sentence, stop_words=[], **args)