Using bash to query a large tab delimited file

284 Views Asked by Shahin At 26 January 2020 at 04:14

I have a list of names and IDs (50 entries)

cat input.txt

name    ID
Mike    2000
Mike    20003
Mike    20002

And there is a huge zipped file (13GB)

zcat clients.gz

name    ID  comment
Mike    2000    foo
Mike    20002   bar
Josh    2000    cake
Josh    20002   _

My expected output is

NR  name    ID  comment
1    Mike   2000    foo
3    Mike   20002   bar

each $1"\t"$2 of clients.gz is a unique identifier. There might be some entries from input.txt that might be missing from clients.gz. Thus, I would like to add the NR column to my output to find out which are missing. I would like to use zgrep. awk takes a very long time (since I had to zcat for uncompress the zipped file I assume?)

I know that zgrep 'Mike\t2000' does not work. The NR issue I can fix with awk FNR I imagine.

So far I have:

awk -v q="'" 
'
NR > 1 {
print "zcat clients.gz | zgrep -w $" q$0q
}' input.txt |
bash > subset.txt

Original Q&A

There are 3 best solutions below

Ed Morton On 26 January 2020 at 16:58 BEST ANSWER

$ cat tst.awk
BEGIN { FS=OFS="\t" }
{ key = $1 FS $2 }
NR == FNR { map[key] = (NR>1 ? NR-1 : "NR"); next }
key in map { print map[key], $0 }

$ zcat clients.gz | awk -f tst.awk input.txt -
NR      name    ID      comment
1       Mike    2000    foo
3       Mike    20002   bar

Cyrus On 26 January 2020 at 09:05

With GNU awk and bash:

awk 'BEGIN{FS=OFS="\t"} 
     # process input.txt
     NR==FNR{
       a[$1,$2]=$1 FS $2
       line[$1,$2]=NR-1
       next
     }
     # process <(zcat clients.gz)
     {
       $4=a[$1,$2]
       if(FNR==1)
         line[$1,$2]="NR"
       if($4!="")
         print line[$1,$2],$1,$2,$3
     }' input.txt <(zcat clients.gz)

Output:

NR      name    ID      comment
1       Mike    2000    foo
3       Mike    20002   bar

As one line:

awk 'BEGIN{FS=OFS="\t"} NR==FNR{a[$1,$2]=$1 FS $2; line[$1,$2]=NR-1; next} {$4=a[$1,$2]; if(FNR==1) line[$1,$2]="NR"; if($4!="")print line[$1,$2],$1,$2,$3}' input.txt <(zcat clients.gz)

See: Joining two files based on two key columns awk and 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR

tshiono On 26 January 2020 at 09:11

[EDIT]
I've misunderstood where the prepended line numbers come from. Corrected.

Would you try the following:

declare -A num          # asscoiates each pattern to the line number
mapfile -t ary < <(tail -n +2 input.txt)
pat=$(IFS='|'; echo "${ary[*]}")
for ((i=0; i<${#ary[@]}; i++)); do num[${ary[i]}]=$((i+1)); done
printf "%s\t%s\t%s\t%s\n" "NR" "name" "ID" "comment"
zgrep -E -w "$pat" clients.gz | while IFS= read -r line; do
    printf "%d\t%s\n" "${num[$(cut -f 1-2 <<<"$line")]}" "$line"
done

Output:

NR  name    ID  comment
1   Mike    2000    foo
3   Mike    20002   bar

The second line and third generate a search pattern as Mike 2000|Mike 20003|Mike 20002 from input.txt.
The line for ((i=0; i<${#ary[@]}; i++)); do .. creates a map from the pattern to the number.
The expression "${num[$(cut -f 1-2 <<<"$line")]}" retrieves the line number from the 1st and 2nd fields of the output.

If the performance is not still satisfactory, please consider ripgrep which is much faster than grep or zgrep.

Using bash to query a large tab delimited file

There are 3 best solutions below

Related Questions in BASH

Related Questions in AWK

Related Questions in GREP

Related Questions in ZIP

Related Questions in ZGREP

Trending Questions

Popular # Hahtags

Popular Questions