We have 200Gb file filled with text lines divided by "\n". Our server has Linux on board, gcc, 8GB RAM and unlimited hard disk space. The requirement is to implement in C++ the most efficient way to lexicographically sort lines of this file from our point of view.
I have done the external sorting of text lines lexicographically for the input file (size up to 2GB) correctly. But when I test with a file size bigger than 2GB, the output is incorrect. The maximum file size needs to be tested bigger than 200GB.
Below is my code that worked well with file size lower than 2GB. Program has 3 parameters as command line argument: input file name, output file name and memory limit in bytes (we will test with different memory limits, not only 8GB) The program should be able to work correctly with simple boundary cases, e.g. with small files, empty files, files much bigger than 200GB. It should work well with non-ascii symbols, but we can assume that zero-byte is not present in the file. We can also assume that memory limit is much bigger than the size of the longest line.
Here's the code. Please help me point out the reason why it can't produce correct result with file size bigger than 2GB. I would appreciate if you can give me a modified version of my code below to make it works in all cases.
#include <stdio.h>
#include <stdlib.h>
#include <fstream>
#include <sstream>
#include <iostream>
#include <algorithm>
#include <vector>
#include <string>
#include <iterator>
//using namespace std;
#define FILESIZE 200000000000 // size of the file on disk
#define TOTAL_MEM 7000000000 // max items the memory buffer can hold
typedef std::vector<std::string> DataContainer;
// Function to sort lines of text according to their length
/*bool compare(std::string &a, std::string &b)
{
if (a.size() < b.size())
return true;
else if (a.size() > b.size())
return false;
else
return a < b;
//return strcasecmp( a.c_str(), b.c_str() ) <= 0;
}
long estimateSizeOfBlocks(long sizeoffile, int maxtmpfiles, long maxMemory)
{
}
*/
std::string ToString(int val) {
std::stringstream stream;
stream << val;
return stream.str();
}
void ExternalSort(std::string infilepath, std::string outfilepath)
{
std::string line, new_line;
std::string tfile = "temp-file-";
int i, j, k;
DataContainer v;
std::ifstream infile;
if(!infile.is_open())
{
std::cout << "Unable to open file\n";
}
infile.open(infilepath, std::ifstream::in);
// Check if input file is empty
if (infile.peek() == std::ifstream::traits_type::eof())
{
std::cout << "File is empty";
}
//std::vector<std::string> x(std::istream_iterator<std::string>(infile), {});
// Get size of the file
infile.seekg (0, infile.end);
int input_file_size = infile.tellg();
infile.seekg (0, infile.beg);
std::cout << "The size of the file chosen is (in bytes): " << input_file_size << "\n";
// Estimate required number of runs to read all data in the input file
int runs_count;
if(input_file_size % TOTAL_MEM > 0)
runs_count = input_file_size/TOTAL_MEM + 1; // if file size is fixed at 200GB, we just need to define FILESIZE: runs_count = FILESIZE/TOTAL_MEM
else
runs_count = input_file_size/TOTAL_MEM;
// Iterate through the elements in the file
for(i = 0; i < runs_count; i++)
{
// Step 1: Read M-element chunk at a time from the file
for (j = 0; j < (TOTAL_MEM < input_file_size ? TOTAL_MEM : input_file_size); j++)
{
while(std::getline(infile, line))
{
// If line is empty, ignore it
if(line.empty())
continue;
new_line = line + "\n";
// Line contains string of length > 0 then save it in vector
if(new_line.size() > 0)
v.push_back(new_line);
}
}
// Step 2: Sort M elements
sort(v.begin(), v.end()); //sort(v.begin(), v.end(), compare);
// Step 3: Create temporary files and write sorted data into those files.
std::ofstream tf;
tf.open(tfile + ToString(i) + ".txt", std::ofstream::out | std::ofstream::app);
std::ostream_iterator<std::string> output_iterator(tf, "\n");
std::copy(v.begin(), v.end(), output_iterator);
/* for(vector<std::string>::const_iterator it = v.begin(); it != v.end(); ++it)
tf << *it << "\n";
*/
tf.close();
}
infile.close();
// Step 4: Now open each file and merge them, then write back to disk
DataContainer vn;
for(i = 0; i < runs_count; i++)
{
std::ifstream sortedFiles[runs_count];
std::ostringstream filename;
filename << tfile << i << ".txt";
sortedFiles[i].open(filename.str(), std::ifstream::in);
//sortedFiles[i].open(tfile + ToString(i) + ".txt", std::ifstream::in);
std::string st, new_st;
while(std::getline(sortedFiles[i], st))
{
// If line is empty, ignore it
if(st.empty())
continue;
new_st = st + "\n";
if(new_st.size() > 0)
vn.push_back(new_st);
}
sortedFiles[i].close();
}
// Step 5: Merge total sorted data
sort(vn.begin(), vn.end());
std::ofstream outfile;
outfile.open(outfilepath, std::ofstream::out | std::ofstream::app);
std::ostream_iterator<std::string> output_iterator(outfile, "\n");
std::copy(vn.begin(), vn.end(), output_iterator);
outfile.close(); // Close final sorted file
}
int main(int argc, char** argv) // e.g. argv[1] = "E:\\data.txt" argv[2] = "E:\\sorted_data.txt"
{
std::cout << "You have entered " << argc
<< " arguments:\n";
for (int i = 0; i < argc; ++i)
std::cout << argv[i] << "\n";
ExternalSort(argv[1], argv[2]);
return 0;
}
See the edit below for a full implementation of external sort - no warranty.
Assuming you have two ascending sorted files (each number is a line):
1 3 5 710 2 4 6 81 10 2 3 4 5 6 7 8If you simply concatenate the files, as it seems you did, you will get:
1 3 5 7 10 2 4 6 8; which is not what you expected.You should read from both files and write the lexicographically-smallest line. This can be easily understood with files containing the same number of lines and alternating lines (for instance,
1 3 5&2 4 6):If the number of lines differs, once you reach the end of one file, write the remaining of the second file:
Once you have an algorithm that merges two files, run it on pair of files until you end with only one file. Assuming you have sorted files A, B, C & D:
Delete AB & CD.
Here’s a two-way merger written in hurry – no warranty:
E D I T