Tensorflow tokenizer question. What num_words does exactly?

52 Views Asked by At

when executing this code I get 11937, but shouldn't I get 10.000? If I shouldn't I have a few follow-up questions:

  1. What's the point of num_words?
  2. What tihs number 11937 I got represents?
  3. How do I limit the size of my vocabulary?
MAX_WORDS_COUNT = 10000                 
WIN_SIZE   = 1000                       
WIN_HOP    = 100                        

tokenizer = Tokenizer(num_words=MAX_WORDS_COUNT, filters='!"#$%&()*+,-–—./…:;<=>?@[\\]^_`{|}~«»\t\n\xa0\ufeff',
                      lower=True, split=' ', oov_token='unkown_word', char_level=False, )

tokenizer.fit_on_texts(x_data)

items = list(tokenizer.word_index.items())
print(len(items))

I expected the 10.000 as an output because I believe num_words limits the size of the vocabulary.

If needed I can provide the full code from my colab notebook.

0

There are 0 best solutions below