# Formatting
block_size = 128 # or any number suitable to your context
def group_texts(examples):
# Concatenate all 'input_ids'
concatenated_examples = sum(examples["input_ids"], [])
total_length = len(concatenated_examples)
# Organize into sequences of fixed length
sequences = [
concatenated_examples[i : i + block_size]
for i in range(0, total_length, block_size)
]
result = {
"input_ids": sequences,
# Shift the labels for CLM
"labels": [sequence[1:] + [tokenizer.eos_token_id] for sequence in sequences],
}
return result
tokenized_dataset = tokenized_dataset.map(
group_texts,
batched=True,
batch_size=1000, # or any number suitable to your context
I am not getting what the block_size and the batch_size refers to?
Batch_size determines how many examples will be processed simultaneously in parallel. For example, in your code, as batch_size=1000, it means 1000 instances will be processed at the same time.
Block_size determines the fixed length of each sequence. The concatenated_examples list is divided into sequences of length block_size using a sliding window approach.
You may need to promise instances number divisible by 1000, as you only have 328 instances, you may use a little batch_size like 8.