Tensorflow tokenizer question. What num_words does exactly?

52 Views Asked by Fraimy At 20 March 2024 at 17:04

when executing this code I get 11937, but shouldn't I get 10.000? If I shouldn't I have a few follow-up questions:

What's the point of num_words?
What tihs number 11937 I got represents?
How do I limit the size of my vocabulary?

MAX_WORDS_COUNT = 10000                 
WIN_SIZE   = 1000                       
WIN_HOP    = 100                        

tokenizer = Tokenizer(num_words=MAX_WORDS_COUNT, filters='!"#$%&()*+,-–—./…:;<=>?@[\\]^_`{|}~«»\t\n\xa0\ufeff',
                      lower=True, split=' ', oov_token='unkown_word', char_level=False, )

tokenizer.fit_on_texts(x_data)

items = list(tokenizer.word_index.items())
print(len(items))

I expected the 10.000 as an output because I believe num_words limits the size of the vocabulary.

If needed I can provide the full code from my colab notebook.

Original Q&A

Tensorflow tokenizer question. What num_words does exactly?

There are 0 best solutions below

Related Questions in PYTHON

Related Questions in TENSORFLOW

Related Questions in MACHINE-LEARNING

Related Questions in TOKENIZE

Trending Questions

Popular # Hahtags

Popular Questions