Single zcat multiple extracts with IDs arrays

157 Views Asked by At

I have many GB+ size gz archives I can not decompress for disk space reasons. Each archive has one specific identification number (example test365.gz) and a structure like this:

         1    1    2 1
##########                 Name:     ZINC000077407198
@<TRIPOS>MOLECULE
 ZINC000077407198      none
@<TRIPOS>ATOM
      1 C1          5.7064    -2.3998   -12.0246 C.3        1  LIG1  -0.1500
@<TRIPOS>BOND
     1    1    2 1
##########                 Name:     ZINC000099999999
@<TRIPOS>MOLECULE
 ZINC000099999999      none
@<TRIPOS>ATOM
      1 C1         -2.0084    -5.2055   -12.9609 C.3        1  LIG1  -0.1500
@<TRIPOS>BOND
     1    1    2 1
##########                 Name:     ZINC000077402345
@<TRIPOS>MOLECULE
 ZINC000077402345     none
@<TRIPOS>ATOM
      1 C1          6.5657    -1.5531   -15.3414 C.3        1  LIG1  -0.1500
@<TRIPOS>BOND
     1    1    2 1
##########                 Name:     ZINC000077407198
@<TRIPOS>MOLECULE
 ZINC000077407198      none
@<TRIPOS>ATOM
      1 C1          3.6696    -1.8305   -14.6766 C.3        1  LIG1  -0.1500
@<TRIPOS>BOND
     1    1    2 1
##########                 Name:     ZINC000012345678
@<TRIPOS>MOLECULE
 ZINC000012345678      none
@<TRIPOS>ATOM
      1 C1          4.5368    -0.8182   -17.4314 C.3        1  LIG1  -0.1500
@<TRIPOS>BOND
     1    1    2 1
##########                 Name:     ZINC000077407100
@<TRIPOS>MOLECULE
 ZINC000077407100      none
@<TRIPOS>ATOM
      1 C1          1.4756    -2.2562   -14.0852 C.3        1  LIG1  -0.1500
@<TRIPOS>BOND
     1    1    2 1
##########                 Name:     ZINC000077407198
@<TRIPOS>MOLECULE
 ZINC000077407198      none
@<TRIPOS>ATOM
      1 C1          6.1712    -0.8991   -16.4096 C.3        1  LIG1  -0.1500
@<TRIPOS>BOND
     1    1    2 1
##########                 Name:     ZINC000077407198
@<TRIPOS>MOLECULE
 ZINC000077407198      none
@<TRIPOS>ATOM

The number of lines between the ###### defined block is variable.

I have a list of identifiers for ZINC entities + target archive:

test365/    ZINC000077407198
test227/    ZINC000009100000
test365/    ZINC000077407100
... 

Currently I do:

zcat test365.gz | sed -n '/##########                 Name:     ZINC000077407100/,/##########                 Name:/p' > ZINC000077407100.out

and I get:

##########                 Name:     ZINC000077407100
@<TRIPOS>MOLECULE
 ZINC000077407100      none
@<TRIPOS>ATOM
      1 C1          1.4756    -2.2562   -14.0852 C.3        1  LIG1  -0.1500
@<TRIPOS>BOND
     1    1    2 1
##########                 Name:     ZINC000077407198

Which works fine. If there are N blocks for ZINC000077407100 I extract N blocks upon zcat and do not mind about the line with starting with #####.

The problem is I need to read the archive N times for the N identifiers / ZINC_NUMBER I want the information for. And it takes a lot of time since I have thousands to extract.

So I would like to find a way to pass an array or list of identifiers / ZINC_NUMBER to output the zcat reading to several different files in function of the identifiers in the array / list.

In other words I would like to do single read using zcat and extract data for a set of identifiers and not only one.

Thanks for your help!

2

There are 2 best solutions below

1
dash-o On BEST ANSWER

Per OP the requirement is to process large volume of data (millions of rows, multiple GB of data, and the need to retrieve data about 100's of items). Technically possible to do with modern bash, but it unlikely that this will perform well. A better scripting engine will do much better here.

Possible bash/awk solution presented here. It will scan each referenced file once, adn extract all the selected tags with a single pass. Note that the 'tags' lists will be scanned multiple times, but it is implied it's size is reasonable

#! /bin/bash -uex
TAGS=data.txt

file_list=$(awk '{ print $1 }' < $TAGS | sort -u)

for f in $file_list ;
do
        gz_name=${f%/}.gz
        zcat $gz_name | awk -v F=$f '
        # Remember tags to retrieve
!DATA && $1 == F { tags[$2] = 1 }
        # OUT set to current output file, empty if item not selected
DATA && $1 == "##########" && $2 == "Name:" {
        OUT = tags[$3] ? $3 ".out" : "" ;
}
OUT { print >OUT }
' $TAGS DATA=1 -
done

Needless to say, possible to write the above 5 liner awk job with Python, Perl, Javascript, or your favorite text processing tool. Tested with the sample data set.

2
Socowi On

It seems each entry starting with ########## has always 6 lines. In that case, it would be far easier and more efficient to use grep -A7 instead of sed -n /##.../,/##.../p. I suppose you only printed the subsequent header as it was easier that way (at least when using sed). Therefore, I excluded the subsequent header in this answer (grep -A6 instead of grep -A7).

grep can be given a list of patterns to search for. This is done with the -f option. The list of patterns can be generated from your file. First group by the archive name (e.g. test365) and then print all the patterns for that archive. Here we use awk to do so. A null byte separates the pattern sections for each archive.

To prevent false positives (and maybe speed up the search a bit) we only search for complete lines instead of substrings. To speed things up we set LC_ALL=C. You may also find that zgrep is faster than zcat | grep.

The following script decompresses each archive at most once.

awk -v prefix='##########                 Name:     ' '
  {a[$1]=a[$1] "\n" prefix $2}
  END {for (k in a) print k a[k] "\0"}
' /path/to/your/list.txt |
while IFS=$'\n' read -r -d '' archive patterns; do
  LC_ALL=C zgrep -A6 -Fxf <(printf %s "$patterns") "${archive/\//.gz}"
  # TODO do something with the output for this archive
done

In the above script I converted test365/ from your list to test365.gz automatically. I don't know your directory structure. If you need something different, adapt the last argument of zgrep. $archive iterates over the first column of your (grouped) list (that is, each archive is listed only once).

From your example code it seems like you want to generate an individual file for each pattern. To do so, replace the loop body from above to

zgrep ... > /tmp/zincfound
while IFS= read -r pattern; do
    grep -A6 -Fx "$pattern" /tmp/zincfound > "${pattern##* }.out" 
done <<< "$patterns"
rm /tmp/zincfound