How to handle tokens that don't have a label in an NLP task?

34 Views Asked by At

I'm working on training an NLP model to detect sensitive information in documents. There are 15 categories of sensitive information I'm attempting to predict. It seemed like adding another category for nonsensitive data would be a good idea so that the model could distinguish between sensitive and non-sensitive.

However, this has lead to the dataset to be very skewed and the model to be very accurate at predicting the non-sensitive data. I'm unsure how to approach the task becuase if I mask the nonsensitive data, then I don't believe the model would learn to distinguish between sensitive and nonsensitive very well.

Right now there is so much nonsensitive data that I'm considering doing a heavy amount of undersampling to shrink the dataset.

After training the model with the current data, the metrics are great for the Non-Sensitive Data, however most of the other categories aren't predicted very well.

Model training results

Data Overview

The data for this problem is a dataset of medical notes that have fake sensitive information included in them. Don't believe the dataset is very large compared to other NLP datasets, but I don't have much experience with NLP problems. It is 26MB when formatted and placed in Excell files as Train, Validation, and Test.

When splitting the data, I set Training as 50% of the data Validation as 25%, and Test as 25%. I did this so that I could ensure test had enough of different categories of sensitive info in it.

Data Specifics

The data is being classified in context of each sentence. So the data is formatted into 38 token sequences to be used as input. Each token has been assigned a label, including padding. However, I have a mask on my model which ignores the padding and the labels for it. All of the labels are encoded into a range of numbers between 0 and 15. Padding is '0' and Nonsensitive information is '1'.

An example sequence of input_ids and matching labels:

  • Tokens:[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2501, 3058, 1024, 12875, 2575, 1011, 6185, 1011, 2260, 6063, 2030, 2705, 24174, 2594, 9228, 1018, 2081, 2527, 4418, 18168, 4817, 1010, 11721, 22955, 2581, 2475]
  • Labels: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 10, 10, 10, 10, 10, 10, 9, 1, 1, 1, 1, 1, 14, 14, 14, 14, 12, 12, 1, 5, 11, 11, 1]

I will also add, that the sentences of the data have been tokenized by the bert-base-uncased tokenizer.

The model

Because I believed the dataset was small, I've been using a GRU model rather than an LSTM or trying to train a Transformer.

Here is my code for the model:

#Dataset statistics
numb_classes = 17 #Number of classes in the NLP problem
vocab_size = 30522 #The vocab size of the bert-base-uncased model
out_dim = 100
input_len = 38

callback = keras.callbacks.EarlyStopping(monitor='loss', patience=3)

model_gru = Sequential()
model_gru.add(Embedding(input_dim=vocab_size, output_dim=out_dim, input_length=input_len, mask_zero=True))
model_gru.add(GRU(64, return_sequences=True, activation='tanh'))
model_gru.add(Dropout(.25))
model_gru.add(GRU(32, return_sequences=True, activation='tanh'))
model_gru.add(Dropout(.25))
model_gru.add(Dense(numb_classes, activation='softmax'))

model_gru.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['sparse_categorical_accuracy'])
history_gru = model_gru.fit(train_X_gru, train_y_gru, epochs=10, batch_size=70,
                        validation_data=(val_X_gru, val_y_gru))

The model is very simplistic right now, because it seems to be overfitting to the data, and I've added dropout to help combat that.

0

There are 0 best solutions below