I'm dealing with a large file with each row with CHR and POS values (which are positional coordinates).
I process this file using a tool, but it outputs only a subset of these positional coordinates with additional metadata information for all samples.
My goal is to expand the second file to include all positional coordinates from the first file, using metadata from the closest position of the processed file. To find the closest position, it should make sure the CHR matches, and the POS is closest to that in the processed file.
Things to note:
- Both files will be sorted numerically.
- If there is a tie of the original position with that of processed file, it should output the later position's metadata. For example, position CHR:1,POS:600 is close to both POS:500 and POS:700 in processed file, so we pick the metadata from the later (POS:700)'s file
- If the processed file does not include the last position of the original file, it should output the metadata from the last position of the processed file.
Original File:
CHR POS
1 100
1 200
1 300
1 400
1 500
1 600
1 700
1 800
1 900
1 1000
Processed File:
CHR POS sample1 sample2 sample3 sample4 sample5 sample6
1 100 0 1 2 1 0 2
1 400 0 1 2 1 1 2
1 500 2 0 1 0 2 1
1 700 0 1 2 1 0 2
1 1000 0 1 2 1 2 2
Intended Output File:
CHR POS sample1 sample2 sample3 sample4 sample5 sample6
1 100 0 1 2 1 0 2
1 200 0 1 2 1 0 2
1 300 0 1 2 1 1 2
1 400 0 1 2 1 1 2
1 500 2 0 1 0 2 1
1 600 0 1 2 1 0 2
1 700 0 1 2 1 0 2
1 800 0 1 2 1 0 2
1 900 0 1 2 1 2 2
1 1000 0 1 2 1 2 2
Since I have more than million rows, 1000s of samples, I would like a memory efficient way to perform this.