I have 500 comments in Russian from YouTube. I tokenized them using the youtokentome library.
df['textOriginal'].to_csv('text.txt', index=False, header=False)
model_path = 'tokenizer.model'
yttm.BPE.train(data='text.txt', model=model_path, vocab_size=5000)
tokenizer = yttm.BPE(model=model_path)
df['tokens'] = df['textOriginal'].apply(lambda x: tokenizer.encode(x, output_type=yttm.OutputType.ID))
Next, I give a list of tokens in the tensor.
tokens_tensor = df['tokens'].apply(lambda x: torch.tensor(x)).tolist()
tokens_tensor = torch.nn.utils.rnn.pad_sequence(tokens_tensor, batch_first=True)
Next, I want to classify the text into 3 categories. To do this, I use nn.Embedding+nn.LIST+ nn.Linear.
But the return value of the model is unclear to me. I get a long list of zeros.
How do I get the classification of objects?
code of my model:
embedding_dim = 300
vocab_size = 5000
hidden_size = 512
output_dim = 3
import torch.nn.functional as F
class MyModel(nn.Module):
def __init__(self, vocab_size, embedding_dim, hidden_size, output_dim, dropout_rate=0.5):
super(MyModel, self).__init__()
self.embedding = nn.Embedding(vocab_size, embedding_dim)
self.lstm = nn.LSTM(input_size=embedding_dim, hidden_size=hidden_size, batch_first=True)
self.linear = nn.Linear(hidden_size, output_dim)
def forward(self, input_seq):
embedded = self.embedding(input_seq)
lstm_out, _ = self.lstm(embedded)
lstm_out = lstm_out[:, -1, :]
x = self.linear(lstm_out)
return F.log_softmax(x, dim=1)