I'm new to Apache Arrow and I currently using it to store data without specifying any custom partitions. There's usually no more than 10 columns in my data. It's a pandas dataframe that's datetime indexed, and there's an another string identifier column, with the rest being floats.
If I were to use the filter() function on this dataset, where I exclusively filter on the string and datetime columns when loading data, should be specifying the partition columns?
Additionally, as one of the intended index columns contain datetime objects in Python, do I need to do something special to it for partitioning purposes?
Also when would it be appropriate to let Arrow figure out how to partition the dataset on its own? Some of my datasets are rather small (few mb), and perhaps in those cases its better to leave them as one file and let Arrow handle things?
Partitioning depends on your data and use case but there are some general guidelines in the documentation: https://arrow.apache.org/docs/cpp/dataset.html#partitioning-performance-considerations
Anything under a few hundred MB probably doesn't need to be partitioned in most case and you specifically want to avoid a large number of small files due the increased overhead.
So depending on the number of distinct date-time objects you probably want to partition on the month or day (assuming you need to partition at all/ if that makes sense for the analysis you want to do). If your string identifier only has a few distinct values using it would be fine, if its random ids not so much.