How to augment dataset by adding rows via huggingface datasets?

Question

How to augment dataset by adding rows via huggingface datasets?

46 Views Asked by Jotschi At 13 March 2024 at 23:11

I have a dataset with 113287 train rows. Each 'caption' field is however an array with multiple strings. I would like to flatmap this array and add new rows.

The documentation for datasets states that the batch mapping feature may be used to achieve this:

This means you can concatenate your examples, divide it up, and even add more examples!

from datasets import load_dataset

dataset_name = "Jotschi/coco-karpathy-opus-de"
coco_dataset = load_dataset(dataset_name)

def chunk_examples(entry):
    captions = [caption for caption in entry["caption"][0]]
    return {"caption": captions}

print(coco_dataset)
chunked_dataset = coco_dataset.map(chunk_examples, batched=True, num_proc=4,
                      remove_columns=["image_id", "caption", "image"])
print(chunked_dataset)
print(len(chunked_dataset['train']))

DatasetDict({
    train: Dataset({
        features: ['caption', 'image_id', 'image'],
        num_rows: 113287
    })
    validation: Dataset({
        features: ['caption', 'image_id', 'image'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['caption', 'image_id', 'image'],
        num_rows: 5000
    })
})
DatasetDict({
    train: Dataset({
        features: ['caption'],
        num_rows: 464
    })
    validation: Dataset({
        features: ['caption'],
        num_rows: 40
    })
    test: Dataset({
        features: ['caption'],
        num_rows: 40
    })
})
464

The problem that I'm having is that the resulting dataset does not contain the expected amount of rows.

It states num_rows: 464 have been added. I suspect this to be the batches. How can I normalize this back into a "regular" dataset? Is there something wrong with my mapping function?

datasets==2.18.0

Original Q&A

There are 1 best solutions below

**Jotschi** · Answer 1 · 2024-03-13T23:43:48.863000

My mapping function was incorrect. I was only accessing the first entry via [0].

def chunk_examples(batch):
    captions = []
    for row in batch["caption"]:
        captions += row
    return {"caption": captions}

Now it yields:

DatasetDict({
    train: Dataset({
        features: ['caption'],
        num_rows: 453460
    })
    validation: Dataset({
        features: ['caption'],
        num_rows: 25010
    })
    test: Dataset({
        features: ['caption'],
        num_rows: 25010
    })
})
453460

How to augment dataset by adding rows via huggingface datasets?

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in HUGGINGFACE-DATASETS

Trending Questions

Popular # Hahtags

Popular Questions