- We have 2 files:
data.txtandkeys.txt. data.txtis some proper unicode text withNlines.keys.txtis a list of newline-separated integers,Nlines.- Output a file
sorted.txtwhere the lines indata.txtare sorted according tokeys.txtwithout writing an intermediate filepaste -d',' keys.txt data.txt.
I need to use this for large files (hundreds of GB) on machines with 16-32 GB of memory.
My first attempt was to do it in Python, which is a bit slow. It's simple enough, so we discussed doing it in C++. But I'd prefer if it uses readily available tools so there's no installation needed. This could well be impossible to do efficiently with GNU or Unix tools, but I don't know enough there to make a claim.
You should be able to do this without buffering to a file. For performance, I guess calibrating
sort --buffer-sizewould be the first move, and perhaps usingparallelto sort in chunks the second.