I came across the xTuring library reading this article on the continual training of an LLM. It claims that it continued the training of a pre-trained model in an unsupervised way with xTuring via the TextDataset. I'm struggling to understand why the TextDataset takes both a list of inputs and targets as logically it should just take a big block of text for further NSP training I think? Since it uses HuggingFace datasets and PyTorch Lightning under the hood, I'm wondering if this design is due to design patterns imposed by HuggingFace datasets.
TextDataset
Here is how you can create this type of dataset:
From a python dictionary with the following keys:
text : List of strings representing the input text.
target : List of strings representing the target text.
from xturing.datasets.text_dataset import TextDataset
dataset = TextDataset({
"text": ["first text", "second text"],
"target": ["first text", "second text"]
})