ArrowInvalid: Column 4 named input_ids expected length 1000 but got length 328

1.5k Views Asked by Nischal At 19 June 2023 at 19:27

# Formatting
block_size = 128  # or any number suitable to your context


def group_texts(examples):
    # Concatenate all 'input_ids'
    concatenated_examples = sum(examples["input_ids"], [])
    total_length = len(concatenated_examples)
    # Organize into sequences of fixed length
    sequences = [
        concatenated_examples[i : i + block_size]
        for i in range(0, total_length, block_size)
    ]
    result = {
        "input_ids": sequences,
        # Shift the labels for CLM
        "labels": [sequence[1:] + [tokenizer.eos_token_id] for sequence in sequences],
    }
    return result


tokenized_dataset = tokenized_dataset.map(
    group_texts,
    batched=True,
    batch_size=1000,  # or any number suitable to your context

I am not getting what the block_size and the batch_size refers to?

Original Q&A

There are 2 best solutions below

linpingta On 20 June 2023 at 01:28

Batch_size determines how many examples will be processed simultaneously in parallel. For example, in your code, as batch_size=1000, it means 1000 instances will be processed at the same time.

Block_size determines the fixed length of each sequence. The concatenated_examples list is divided into sequences of length block_size using a sliding window approach.

 Column 4 named input_ids expected length 1000 but got length 328

You may need to promise instances number divisible by 1000, as you only have 328 instances, you may use a little batch_size like 8.

steven william On 07 January 2024 at 15:17

set

tokenized_dataset = tokenized_dataset.map(
    group_texts,
    batched=True,
    batch_size=8)

ArrowInvalid: Column 4 named input_ids expected length 1000 but got length 328

There are 2 best solutions below

Related Questions in PYTHON

Related Questions in MACHINE-LEARNING

Related Questions in LANGUAGE-MODEL

Trending Questions

Popular # Hahtags

Popular Questions