Why do I get a long list of zeros in classification of text?

38 Views Asked by At

I have 500 comments in Russian from YouTube. I tokenized them using the youtokentome library.

df['textOriginal'].to_csv('text.txt', index=False, header=False)

model_path = 'tokenizer.model'
yttm.BPE.train(data='text.txt', model=model_path, vocab_size=5000)

tokenizer = yttm.BPE(model=model_path)

df['tokens'] = df['textOriginal'].apply(lambda x: tokenizer.encode(x, output_type=yttm.OutputType.ID))

example of tokens for text

Next, I give a list of tokens in the tensor.

tokens_tensor = df['tokens'].apply(lambda x: torch.tensor(x)).tolist()
tokens_tensor = torch.nn.utils.rnn.pad_sequence(tokens_tensor, batch_first=True)

Next, I want to classify the text into 3 categories. To do this, I use nn.Embedding+nn.LIST+ nn.Linear.

But the return value of the model is unclear to me. I get a long list of zeros.

How do I get the classification of objects?

code of my model:

embedding_dim = 300
vocab_size = 5000
hidden_size = 512
output_dim = 3

import torch.nn.functional as F

class MyModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_size, output_dim, dropout_rate=0.5):
        super(MyModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(input_size=embedding_dim, hidden_size=hidden_size, batch_first=True)
        self.linear = nn.Linear(hidden_size, output_dim)


    def forward(self, input_seq):
        embedded = self.embedding(input_seq)
        lstm_out, _ = self.lstm(embedded)
        lstm_out = lstm_out[:, -1, :]  
        x = self.linear(lstm_out)
        return F.log_softmax(x, dim=1)
0

There are 0 best solutions below