Using bash to query a large tab delimited file

284 Views Asked by At

I have a list of names and IDs (50 entries)

cat input.txt

name    ID
Mike    2000
Mike    20003
Mike    20002

And there is a huge zipped file (13GB)

zcat clients.gz

name    ID  comment
Mike    2000    foo
Mike    20002   bar
Josh    2000    cake
Josh    20002   _

My expected output is

NR  name    ID  comment
1    Mike   2000    foo
3    Mike   20002   bar

each $1"\t"$2 of clients.gz is a unique identifier. There might be some entries from input.txt that might be missing from clients.gz. Thus, I would like to add the NR column to my output to find out which are missing. I would like to use zgrep. awk takes a very long time (since I had to zcat for uncompress the zipped file I assume?)

I know that zgrep 'Mike\t2000' does not work. The NR issue I can fix with awk FNR I imagine.

So far I have:

awk -v q="'" 
'
NR > 1 {
print "zcat clients.gz | zgrep -w $" q$0q
}' input.txt |
bash > subset.txt
3

There are 3 best solutions below

10
Ed Morton On BEST ANSWER
$ cat tst.awk
BEGIN { FS=OFS="\t" }
{ key = $1 FS $2 }
NR == FNR { map[key] = (NR>1 ? NR-1 : "NR"); next }
key in map { print map[key], $0 }

$ zcat clients.gz | awk -f tst.awk input.txt -
NR      name    ID      comment
1       Mike    2000    foo
3       Mike    20002   bar
10
Cyrus On

With GNU awk and bash:

awk 'BEGIN{FS=OFS="\t"} 
     # process input.txt
     NR==FNR{
       a[$1,$2]=$1 FS $2
       line[$1,$2]=NR-1
       next
     }
     # process <(zcat clients.gz)
     {
       $4=a[$1,$2]
       if(FNR==1)
         line[$1,$2]="NR"
       if($4!="")
         print line[$1,$2],$1,$2,$3
     }' input.txt <(zcat clients.gz)

Output:

NR      name    ID      comment
1       Mike    2000    foo
3       Mike    20002   bar

As one line:

awk 'BEGIN{FS=OFS="\t"} NR==FNR{a[$1,$2]=$1 FS $2; line[$1,$2]=NR-1; next} {$4=a[$1,$2]; if(FNR==1) line[$1,$2]="NR"; if($4!="")print line[$1,$2],$1,$2,$3}' input.txt <(zcat clients.gz)

See: Joining two files based on two key columns awk and 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR

0
tshiono On

[EDIT]
I've misunderstood where the prepended line numbers come from. Corrected.

Would you try the following:

declare -A num          # asscoiates each pattern to the line number
mapfile -t ary < <(tail -n +2 input.txt)
pat=$(IFS='|'; echo "${ary[*]}")
for ((i=0; i<${#ary[@]}; i++)); do num[${ary[i]}]=$((i+1)); done
printf "%s\t%s\t%s\t%s\n" "NR" "name" "ID" "comment"
zgrep -E -w "$pat" clients.gz | while IFS= read -r line; do
    printf "%d\t%s\n" "${num[$(cut -f 1-2 <<<"$line")]}" "$line"
done

Output:

NR  name    ID  comment
1   Mike    2000    foo
3   Mike    20002   bar
  • The second line and third generate a search pattern as Mike 2000|Mike 20003|Mike 20002 from input.txt.
  • The line for ((i=0; i<${#ary[@]}; i++)); do .. creates a map from the pattern to the number.
  • The expression "${num[$(cut -f 1-2 <<<"$line")]}" retrieves the line number from the 1st and 2nd fields of the output.

If the performance is not still satisfactory, please consider ripgrep which is much faster than grep or zgrep.