SimpleTransformers 'Roberta' problem with token recognition

31 Views Asked by At

I have a problem while predicting the labels in a token sequence with a SimpleTransformers model trained previously.

When I predict the labels in a sequence, the model omits lots of tokens, depending on the value I assign to max_sequence_length. If it's set to 128 (default), the model omits about 40 tokens per sequence. When it's set to 512, the model only omits 3 tokens in total.

I'm trying to understand why the model is omitting tokens if no sequences are longer than 128 and what is more, 512. If someone could help me, it will give me a lot of help.

import pytesseract
import os
import shutil
import pandas as pd
from responses import target
import unidecode
from re import sub, match,search
from simpletransformers.ner import NERModel, NERArgs
import torch
from scipy.special import softmax
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sn
from random import choice
def main():
    model_args = NERArgs()
    df = pd.read_csv(".\\dataset.csv")

    args = NERArgs()
    args.num_train_epochs = 4
    args.learning_rate = 1e-4
    args.overwrite_output_dir =True
    args.train_batch_size = 8
    args.eval_batch_size = 8
    args.max_seq_length=512
    model = NERModel( "roberta",
    "roberta-base",labels=[0,1,2,3,4,5],args =args,use_cuda=False)
    model.train_model(df,accuracy_score=accuracy_score)
    torch.save(model,"Auto-ID-model")
    model = torch.load("Auto-ID-model")
    
    pred, raw = model.predict(test)

model.predict is the part of the code is giving me problems.

0

There are 0 best solutions below