How to instructDataCollatorForLanguageModeling to not use shifted inputs as labels but my own labels?
Here's a MWE:
data = {
'sources': ["This is some text", "Another text athta ljdlsfjsdlf", "Also some bulshit type text who knows wtf?"],
'targets': ["Some potential target.", "The answer is JoLo!", "Who killed margaret and what was the motive and poential causes!"]
}
tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")
config = AutoConfig.from_pretrained("openai-community/gpt2")
gpt2model = AutoModelForCausalLM.from_config(config)
tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
>> "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained"
tokenized_data = tokenizer(data['sources'])
with tokenizer.as_target_tokenizer():
tokenized_data['labels'] = tokenizer(data['targets'])
>> tokenized_data
{'input_ids': [[1212, 318, 617, 2420], [6610, 2420, 379, 4352, 64, 300, 73, 67, 7278, 69, 8457, 67, 1652], [7583, 617, 4807, 16211, 2099, 2420, 508, 4206, 266, 27110, 30]], 'attention_mask': [[1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]],
'labels': {'input_ids': [[4366, 2785, 2496, 13], [464, 3280, 318, 5302, 27654, 0], [8241, 2923, 6145, 8984, 290, 644, 373, 262, 20289, 290, 745, 1843, 5640, 0]], 'attention_mask': [[1, 1, 1, 1], [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}}
tokenized_labels = tokenized_data.pop('labels')
outputs = data_collator(tokenized_data)
>> outputs
{'input_ids': tensor([[ 1212, 318, 617, 2420, 50257, 50257, 50257, 50257, 50257, 50257,
50257, 50257, 50257],
[ 6610, 2420, 379, 4352, 64, 300, 73, 67, 7278, 69,
8457, 67, 1652],
[ 7583, 617, 4807, 16211, 2099, 2420, 508, 4206, 266, 27110,
30, 50257, 50257]]), 'attention_mask': tensor([[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]]), 'labels': tensor([[ 1212, 318, 617, 2420, -100, -100, -100, -100, -100, -100,
-100, -100, -100],
[ 6610, 2420, 379, 4352, 64, 300, 73, 67, 7278, 69,
8457, 67, 1652],
[ 7583, 617, 4807, 16211, 2099, 2420, 508, 4206, 266, 27110,
30, -100, -100]])}
Now the outputs['labels'] are just the shifted outputs['input_ids'] which is happening automatically from the DataColaltorForLanguageModeling.
The question is since I do have proper labels for the data, in this case tokenized_data['labels'] or the variable tokenized_labels, how do I use that in the Trainer class?
So, the dataset.map(...) will tokenize the whole dataset and will return tokens for both text and labels.
Then using data_collator will create labels by shifting the input_ids and feed that to the model.
How do I tell Trainer or DataCollator to use my tokenized_labels instead of creating them based on the inputs?