The numbers are unique within each file and across all files. Here is one popular approach (external sorting) to merge n files into one sorted file when our compute machine cannot store all the data:
- Sort each smaller file
- Read first elements from each file
- Create a heap
- while heap is not empty:
- write min element to final file
- insert next element from the smaller file (whose element was written to the main file in step above) into the heap
While I do understand how this algorithm helps in memory usage - there's a lot of I/O overhead involved.
Another possible approach uses bucket sort - specially if numbers are uniformly distributed:
- Create n buckets - (max_num-min_number)/number_of_elements_in_1_bucket
- For each smaller file:
- read file
- create a batch update dictionary {bucket_number:[list of elements]}
- Append data from above dictionary to respective bucket files
- The dictionary space is limited so only 100 elements at a time can be processed for each smaller file. So above 2 steps are repeated until we have exhausted the current file elements
- For each bucket file created
- sort data in each bucket
- append all files together into a single sorted file
Now, I believe the latter approach can be slightly faster or at least equivalent in terms of time complexity, but uses more storage space for temp files. Are these 2 approaches comparable in time complexity or 1st one is always better even after multiple I/O calls? Am I missing some time-consuming step in the 2nd approach that makes it less optimum?
My understanding is that it is the other way around. If numbers have a predictable distribution in a predictable range, bucket sort is faster. (In the limit
O(n)vsO(n log(n))for comparison based sorts.) Better yet, buckets can be sorted independently from each other. Which makes them perfect for a distributed sort.