Why can we set LLM's input and output to be the same when fine tuning on text generation task?

1k Views Asked by At

I'm trying to fine tune my GPT-2 model for song lyrics text generation, and I have a couple of song lyrics on hand. However, I'm confused about how to fine tune GPT-2 model that doesn't have standard input and outputs format. The reason is that I want my fine tuned GPT-2 to generate anything in a lyric style, and I don't know what should be the expected output given the song lyrics datasets on hand.

After searching related articles online, I found a really confusing solution on How to Fine-Tune GPT-2 for Text Generation using the following training method:

outputs = model(input_tensor, labels=input_tensor)
loss = outputs[0]
loss.backward()

From my understanding, the first parameter is the input text of the model, while the second parameter labels is usually the expected output of the model. If we just set it to be the same, are we actually training a repeater that always repeats the input text? If so, how could we expect that our fine tuned model can speak everything in a lyric style?

(Additional Question during my trial and error): Intuitively, I thought I should split each song into 2 halfs. Then I use first half as inputs to my GPT-2 model, and set the expected output to be the second half. But after some experiments I found my fine-tuned GPT-2 kept repeating words like "the" in down-stream tasks. I'm curious about the reason why I failed here.

1

There are 1 best solutions below

1
padeoe On

GPT-2 is designed to predict the next token in a sequence based on the preceding tokens. For instance, given the phrase "I love", it might predict "you".

1. Why Using input_tensor for both Input and Labels

The confusion often arises when seeing input_tensor used for both input and labels. This is due to the Masking mechanism inherent in GPT-2.

Unlike BERT, where specific tokens are masked and the model predicts them, GPT-2's masking is about controlling which tokens are visible during each prediction step. For a sequence like "I love music", the model predicts:

  1. "I" -> "love"
  2. "I love" -> "music"

This is achieved through internal masking. The model doesn't see future tokens, ensuring genuine next-token prediction based on the given context. So, using input_tensor for both input and labels doesn't make the model a mere repeater. It's training the model to predict subsequent tokens based on prior context.

2. Splitting Songs in Half

Manually splitting song lyrics isn't ideal. GPT-2's design inherently predicts the next token for each position in the sequence. Splitting songs might bias the model towards the latter parts, potentially limiting its learning.

3. A Better Approach

Consider using Descriptive Prompts:

  • Structure your dataset with prompts describing the song's style or theme, followed by the corresponding lyrics. For instance:

    Input: "A melancholic ballad about lost love in winter."

    Output: "Snowflakes fall, my heart calls, for the love lost in winter's thrall..."

This approach can guide the model more effectively in generating lyrics in the desired style.