Awk command to set variable name while matching a regular expression

208 Views Asked by At

I have a names.dmp file which contains taxonomy ids and scientific names among other details.

I want to fetch the scientific name of a particular tax-id, for which I am running this command:

cat names.dmp | grep "scientific name" | awk '$1~/^10090$/{print $0}' | cut -d "|" -f1,2

which gives me the output:

10090 | Mus musculus

But I need this to be dynamic, i.e., set a variable id=10090 and use this variable inside the regular expression. I need an exact match of the value while using "id", as there are entries such as 210090 and 100904 which I am getting as output which are not needed.

I am quite inexperienced when it comes to awk, so any help is appreciated.

EDIT:

Here is the example input:

10089   |       Mus formosanus Kuroda, 1925     |               |       authority       |
10089   |       Mus formosanus  |               |       synonym |
10089   |       ricefield mouse |               |       common name     |
10089   |       Ryukyu mouse    |               |       genbank common name     |
10090   |       house mouse     |               |       genbank common name     |
10090   |       LK3 transgenic mice     |               |       includes        |
10090   |       mouse   |       mouse <Mus musculus>    |       common name     |
10090   |       Mus musculus Linnaeus, 1758     |               |       authority       |
10090   |       Mus musculus    |               |       scientific name |
10090   |       Mus sp. 129SV   |               |       includes        |
10090   |       nude mice       |               |       includes        |
10090   |       transgenic mice |               |       includes        |
10091   |       Mus castaneus   |               |       synonym |
10091   |       Mus musculus castaneus  |               |       scientific name |
10091   |       Mus musculus castaneus Waterhouse, 1843 |               |       authority       |
10091   |       southeastern Asian house mouse  |               |       genbank common name     |
10092   |       Mus domesticus  |               |       synonym |
10092   |       Mus musculus domesticus Schwarz & Scharz 1943   |               |       authority       |
10092   |       Mus musculus domesticus |               |       scientific name |
10092   |       Mus musculus praetextus |               |       synonym |
100902  |       Fusarium oxysporum f. sp. conglutinans  |               |       scientific name |
100903  |       Fusarium oxysporum f. sp. fragariae     |               |       scientific name |
100905  |       Cloning vector pACN     |               |       scientific name |
100906  |       Nitrosomonas sp. ENI-11 |               |       scientific name |
100907  |       Chilean sea bass        |               |       common name     |

And the output I need is:

10090 | Mus musculus

4

There are 4 best solutions below

6
Renaud Pacalet On BEST ANSWER

When you use awk, frequently, you don't need anything else:

$ awk -F'[[:space:]]*\\|[[:space:]]*' -v id="10090" '
  /scientific name/ && $1 == id {print $1 " | " $2}' file
10090 | Mus musculus
  1. -F'[[:space:]]*\\|[[:space:]]*': set the input field separator as space-surrounded |.
  2. -v id="10090": declare awk variable id and assign it 10090 (change this if needed).
  3. If the input record matches string scientific name and the first field equals id, print the two first fields separated by |.

As noted in comments this does not preserve the input field separators. In case you want to preserve them you can use the split function of GNU awk, instead of the input field separator, to save the fields in an array and the separators in another:

$ awk -v id="10090" '/scientific name/ {
    split($0,f,/[[:space:]]*\|[[:space:]]*/,s)
    if(f[1] == id) print f[1] s[1] f[2]}' file
10090   |       Mus musculus

Finally, if your awk is not GNU awk but you want to preserve the field separators, you can use match and substr instead of split:

$ awk -F'[[:space:]]*\\|[[:space:]]*' -v id="10090" '
  /scientific name/ && $1==id {
    a=match($0,/\|/); b=match(substr($0,a+1),/[[:space:]]*\|/)
    print substr($0,1,a+b-1)}' file
10090   |       Mus musculus

We simply use match to find the index of the first | (a), then the index of the first space before the second | (b), and print only the everything before that (substr).

2
jared_mamrot On

One option would be:

id=10090
awk -v id="$id" '/scientific name/ && $1 == id' names.dmp | cut -d "|" -f1,2

You can also preserve whitespace in awk (using e.g. How to preserve the original whitespace between fields in awk?) and incorporate the cut command into your awk command, but as you describe yourself as 'inexperienced', this is probably the best solution.

0
Paolo On

A possible solution:

$ id=10090
$ awk -v id="$id" 'BEGIN{FS="| +";OFS="    |   "} /scientific name/ && $1==id {print $1,$3" "$4}' file
10090    |   Mus musculus
5
user1934428 On

While you can set awk variables from the outside and that this is usually the best solution, your specific case is so simple that interpolation by the shell works as well:

awk '$1~/^'$id'$/{print $0}'

Since you know that your id is always a string of digits, you don't even have to double-quote here.