I am trying to use the TfidfVectorizer function with my own stop words list and using my own tokenizer function. Currently I am doing this:
def transformation_libelle(sentence, **args):
stemmer = SnowballStemmer("french")
sentence_clean = re.compile(r'^[A-Z][A-Z][A-Z]\d ').sub('', sentence.replace(r'_', " ").replace(r'-', " "))
return [stemmer.stem(token).upper() for token in re.split(r'\W+', sentence_clean) if token not in stop_words and not all([char.isdigit() or char == '.' for char in token])]
tfidf_vectorizer = TfidfVectorizer(max_df=0.5,
min_df=0,
use_idf=True, tokenizer=transformation_libelle, lowercase=False,
ngram_range=(1,3), stop_words=stop_words)
With stop_words my own list. How can I pass it throught my tokenizer function?
Thank you
I just added an input to my function: transformation_libelle(sentence, stop_words=[], **args)