Do I need to specify custom partitioning if I know I'll be filtering on two columns (datetime and string) of a Apache Arrow dataset?

34 Views Asked by ron burgundy At 26 February 2024 at 18:43

I'm new to Apache Arrow and I currently using it to store data without specifying any custom partitions. There's usually no more than 10 columns in my data. It's a pandas dataframe that's datetime indexed, and there's an another string identifier column, with the rest being floats.

If I were to use the filter() function on this dataset, where I exclusively filter on the string and datetime columns when loading data, should be specifying the partition columns?

Additionally, as one of the intended index columns contain datetime objects in Python, do I need to do something special to it for partitioning purposes?

Also when would it be appropriate to let Arrow figure out how to partition the dataset on its own? Some of my datasets are rather small (few mb), and perhaps in those cases its better to leave them as one file and let Arrow handle things?

Original Q&A

There are 1 best solutions below

assignUser On 27 February 2024 at 04:16

Partitioning depends on your data and use case but there are some general guidelines in the documentation: https://arrow.apache.org/docs/cpp/dataset.html#partitioning-performance-considerations

Avoid files smaller than 20MB and larger than 2GB.

Avoid partitioning layouts with more than 10,000 distinct partitions.

Anything under a few hundred MB probably doesn't need to be partitioned in most case and you specifically want to avoid a large number of small files due the increased overhead.

So depending on the number of distinct date-time objects you probably want to partition on the month or day (assuming you need to partition at all/ if that makes sense for the analysis you want to do). If your string identifier only has a few distinct values using it would be fine, if its random ids not so much.

Do I need to specify custom partitioning if I know I'll be filtering on two columns (datetime and string) of a Apache Arrow dataset?

There are 1 best solutions below

Related Questions in PANDAS

Related Questions in PYARROW

Related Questions in APACHE-ARROW

Trending Questions

Popular # Hahtags

Popular Questions