I want to Hive-partition my dataset, but I don't quite know how to ensure the file counts in the splits are sane. I know I should roughly aim for files that are 128MB in size
How do I safely scale and control the row counts inside files of my Hive-partitioned dataset?
For this answer, I'll assume you have correctly understood the reasons why you should and should not do Hive-style partitioning and won't be covering the backing theory.
In this case, it's important to ensure we not only correctly calculate the number of files needed inside our splits but also repartition our dataset based on these calculations. Failure to do repartitions before write-out on Hive-style partition datasets may result in your job attempting to write out millions of tiny files which will kill your performance.
In our case, the strategy we will use will be to create files that are at most
Nrows per file, which will bound the size of each file. We can't easily limit the exact size of each file inside the splits, but we can use row counts as a good approximation.The methodology we will use to accomplish this will be to create a synthetic column that describes which 'batch' a row will belong to, repartition the final dataset on both the Hive split column and this synthetic column, and use this result on write.
In order to ensure our synthetic column indicates the proper batch a row belongs to, we need to determine the number of rows inside each hive split, and 'sprinkle' the rows inside this split into the proper number of files.
The strategy in total will look something like this:
Let's start by considering the following dataset:
Let's imagine the column we want to Hive split on is
col_1, and we want 5 rows per file per value ofcol_1.Now that we know what file index each row belongs in, we now need to repartition before write out.
Now, on write-out, make sure to ensure your write options includes
partition_cols=["col_1"], and voila!I'd highly recommend reading this post as well to ensure you understand exactly why the partitioning is necessary before write-out