Merging two files and expanding metadata efficiently

65 Views Asked by binf-er At 09 November 2023 at 22:38

I'm dealing with a large file with each row with CHR and POS values (which are positional coordinates).

I process this file using a tool, but it outputs only a subset of these positional coordinates with additional metadata information for all samples.

My goal is to expand the second file to include all positional coordinates from the first file, using metadata from the closest position of the processed file. To find the closest position, it should make sure the CHR matches, and the POS is closest to that in the processed file.

Things to note:

Both files will be sorted numerically.
If there is a tie of the original position with that of processed file, it should output the later position's metadata. For example, position CHR:1,POS:600 is close to both POS:500 and POS:700 in processed file, so we pick the metadata from the later (POS:700)'s file
If the processed file does not include the last position of the original file, it should output the metadata from the last position of the processed file.

Original File:

CHR    POS
1      100
1      200
1      300
1      400
1      500
1      600
1      700
1      800
1      900
1      1000

Processed File:

CHR    POS    sample1    sample2    sample3    sample4    sample5    sample6
1      100    0          1          2          1          0          2
1      400    0          1          2          1          1          2
1      500    2          0          1          0          2          1
1      700    0          1          2          1          0          2
1      1000   0          1          2          1          2          2

Intended Output File:

CHR    POS    sample1    sample2    sample3    sample4    sample5    sample6
1      100    0          1          2          1          0          2
1      200    0          1          2          1          0          2
1      300    0          1          2          1          1          2
1      400    0          1          2          1          1          2
1      500    2          0          1          0          2          1
1      600    0          1          2          1          0          2
1      700    0          1          2          1          0          2
1      800    0          1          2          1          0          2
1      900    0          1          2          1          2          2
1      1000   0          1          2          1          2          2

Since I have more than million rows, 1000s of samples, I would like a memory efficient way to perform this.

Original Q&A

Merging two files and expanding metadata efficiently

There are 0 best solutions below

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in SORTING

Related Questions in FILE-IO

Related Questions in DATA-PROCESSING

Trending Questions

Popular # Hahtags

Popular Questions