Spark set minimum output file size from Dataset write

81 Views Asked by MateuszK555 At 28 August 2023 at 13:37

I want to control size of files written to HDFS from Spark Dataframe from Java application, file format is ORC.

My Datasets vary greatly in size:

1st: 200 partitions, each 30MB (ORC)
2nd: 200 partitions, each 0.6MB (ORC)
3rd: 200 partitions, each 0.2MB (ORC)

I need to assure that minimum file size written is 120MB, if whole dataset is smaller than this size, there should be 1 partition for it.

I tried following approach:

dataset.repartition(calcNumPartitions).write().mode("overwrite").orc(path);

where:

calcNumPartitions(Dataset<Row> dataset) {
  BigInt datasetSizeBytes = dataset.queryExecution().optimizedPlan().stats().sizeInBytes();
  return (int) Math.ceil(datasetSizeBytes.longValue()/120 * 1024 * 1024;
}

On 1st example, it gave me 23 partitions with ~190MB file size.

What is the better solution for such problem ?

I tried to control file size written from Dataset write method.

Original Q&A

Spark set minimum output file size from Dataset write

There are 0 best solutions below

Related Questions in APACHE-SPARK

Related Questions in HDFS

Related Questions in ORC

Trending Questions

Popular # Hahtags

Popular Questions