Reading partitioned parquet files with Apache Beam and Python SDK

31 Views Asked by Stefano Castoldi At 26 March 2024 at 15:47

I have parquet files partitioned by iso_week and need to read all the data as a PCollection with Apache Beam and the Python SDK.

Partitioned Parquet Files Structure

data_to_read/
├─ iso_week=2023-W40/
│  ├─ 12343435.parquet
├─ iso_week=2023-W41/
│  ├─ 1231243254.parquet

I tried to use the global pattern * as suggested in the documentation:

pipeline | "ReadData" >> beam.io.ReadFromParquet("data_to_read/*")

But I get the Error that the path doesn't contain any parquet file.

Is there a way to read partitioned parquet files in Apache Beam?

There are 0 best solutions below