Merging 1000s of files into one sorted file with memory constraints - heap vs bucket sort

243 Views Asked by 404error At 18 October 2021 at 07:34

The numbers are unique within each file and across all files. Here is one popular approach (external sorting) to merge n files into one sorted file when our compute machine cannot store all the data:

Sort each smaller file
Read first elements from each file
Create a heap
while heap is not empty:
- write min element to final file
- insert next element from the smaller file (whose element was written to the main file in step above) into the heap

While I do understand how this algorithm helps in memory usage - there's a lot of I/O overhead involved.

Another possible approach uses bucket sort - specially if numbers are uniformly distributed:

Create n buckets - (max_num-min_number)/number_of_elements_in_1_bucket
For each smaller file:
- read file
- create a batch update dictionary {bucket_number:[list of elements]}
- Append data from above dictionary to respective bucket files
- The dictionary space is limited so only 100 elements at a time can be processed for each smaller file. So above 2 steps are repeated until we have exhausted the current file elements
For each bucket file created
- sort data in each bucket
append all files together into a single sorted file

Now, I believe the latter approach can be slightly faster or at least equivalent in terms of time complexity, but uses more storage space for temp files. Are these 2 approaches comparable in time complexity or 1st one is always better even after multiple I/O calls? Am I missing some time-consuming step in the 2nd approach that makes it less optimum?

Original Q&A

There are 1 best solutions below

btilly On 18 October 2021 at 17:25 BEST ANSWER

My understanding is that it is the other way around. If numbers have a predictable distribution in a predictable range, bucket sort is faster. (In the limit O(n) vs O(n log(n)) for comparison based sorts.) Better yet, buckets can be sorted independently from each other. Which makes them perfect for a distributed sort.

Merging 1000s of files into one sorted file with memory constraints - heap vs bucket sort

There are 1 best solutions below

Related Questions in ALGORITHM

Related Questions in SORTING

Related Questions in HEAPSORT

Related Questions in BUCKET-SORT

Related Questions in EXTERNAL-SORTING

Trending Questions

Popular # Hahtags

Popular Questions