I want to control size of files written to HDFS from Spark Dataframe from Java application, file format is ORC.
My Datasets vary greatly in size:
- 1st: 200 partitions, each 30MB (ORC)
- 2nd: 200 partitions, each 0.6MB (ORC)
- 3rd: 200 partitions, each 0.2MB (ORC)
I need to assure that minimum file size written is 120MB, if whole dataset is smaller than this size, there should be 1 partition for it.
I tried following approach:
dataset.repartition(calcNumPartitions).write().mode("overwrite").orc(path);
where:
calcNumPartitions(Dataset<Row> dataset) {
BigInt datasetSizeBytes = dataset.queryExecution().optimizedPlan().stats().sizeInBytes();
return (int) Math.ceil(datasetSizeBytes.longValue()/120 * 1024 * 1024;
}
On 1st example, it gave me 23 partitions with ~190MB file size.
What is the better solution for such problem ?
I tried to control file size written from Dataset write method.