Awk command to set variable name while matching a regular expression

Question

Awk command to set variable name while matching a regular expression

208 Views Asked by chan-98 At 26 July 2023 at 11:32

I have a names.dmp file which contains taxonomy ids and scientific names among other details.

I want to fetch the scientific name of a particular tax-id, for which I am running this command:

cat names.dmp | grep "scientific name" | awk '$1~/^10090$/{print $0}' | cut -d "|" -f1,2

which gives me the output:

10090 | Mus musculus

But I need this to be dynamic, i.e., set a variable id=10090 and use this variable inside the regular expression. I need an exact match of the value while using "id", as there are entries such as 210090 and 100904 which I am getting as output which are not needed.

I am quite inexperienced when it comes to awk, so any help is appreciated.

EDIT:

Here is the example input:

10089   |       Mus formosanus Kuroda, 1925     |               |       authority       |
10089   |       Mus formosanus  |               |       synonym |
10089   |       ricefield mouse |               |       common name     |
10089   |       Ryukyu mouse    |               |       genbank common name     |
10090   |       house mouse     |               |       genbank common name     |
10090   |       LK3 transgenic mice     |               |       includes        |
10090   |       mouse   |       mouse <Mus musculus>    |       common name     |
10090   |       Mus musculus Linnaeus, 1758     |               |       authority       |
10090   |       Mus musculus    |               |       scientific name |
10090   |       Mus sp. 129SV   |               |       includes        |
10090   |       nude mice       |               |       includes        |
10090   |       transgenic mice |               |       includes        |
10091   |       Mus castaneus   |               |       synonym |
10091   |       Mus musculus castaneus  |               |       scientific name |
10091   |       Mus musculus castaneus Waterhouse, 1843 |               |       authority       |
10091   |       southeastern Asian house mouse  |               |       genbank common name     |
10092   |       Mus domesticus  |               |       synonym |
10092   |       Mus musculus domesticus Schwarz & Scharz 1943   |               |       authority       |
10092   |       Mus musculus domesticus |               |       scientific name |
10092   |       Mus musculus praetextus |               |       synonym |
100902  |       Fusarium oxysporum f. sp. conglutinans  |               |       scientific name |
100903  |       Fusarium oxysporum f. sp. fragariae     |               |       scientific name |
100905  |       Cloning vector pACN     |               |       scientific name |
100906  |       Nitrosomonas sp. ENI-11 |               |       scientific name |
100907  |       Chilean sea bass        |               |       common name     |

And the output I need is:

10090 | Mus musculus

Original Q&A

There are 4 best solutions below

jared_mamrot On 26 July 2023 at 12:09

One option would be:

id=10090
awk -v id="$id" '/scientific name/ && $1 == id' names.dmp | cut -d "|" -f1,2

You can also preserve whitespace in awk (using e.g. How to preserve the original whitespace between fields in awk?) and incorporate the cut command into your awk command, but as you describe yourself as 'inexperienced', this is probably the best solution.

Paolo On 26 July 2023 at 12:30

A possible solution:

$ id=10090
$ awk -v id="$id" 'BEGIN{FS="| +";OFS="    |   "} /scientific name/ && $1==id {print $1,$3" "$4}' file
10090    |   Mus musculus

user1934428 On 26 July 2023 at 12:38

While you can set awk variables from the outside and that this is usually the best solution, your specific case is so simple that interpolation by the shell works as well:

awk '$1~/^'$id'$/{print $0}'

Since you know that your id is always a string of digits, you don't even have to double-quote here.

**Renaud Pacalet** · Accepted Answer · 2023-07-26T12:25:45.850000

When you use awk, frequently, you don't need anything else:

$ awk -F'[[:space:]]*\\|[[:space:]]*' -v id="10090" '
  /scientific name/ && $1 == id {print $1 " | " $2}' file
10090 | Mus musculus

-F'[[:space:]]*\\|[[:space:]]*': set the input field separator as space-surrounded |.
-v id="10090": declare awk variable id and assign it 10090 (change this if needed).
If the input record matches string scientific name and the first field equals id, print the two first fields separated by |.

As noted in comments this does not preserve the input field separators. In case you want to preserve them you can use the split function of GNU awk, instead of the input field separator, to save the fields in an array and the separators in another:

$ awk -v id="10090" '/scientific name/ {
    split($0,f,/[[:space:]]*\|[[:space:]]*/,s)
    if(f[1] == id) print f[1] s[1] f[2]}' file
10090   |       Mus musculus

Finally, if your awk is not GNU awk but you want to preserve the field separators, you can use match and substr instead of split:

$ awk -F'[[:space:]]*\\|[[:space:]]*' -v id="10090" '
  /scientific name/ && $1==id {
    a=match($0,/\|/); b=match(substr($0,a+1),/[[:space:]]*\|/)
    print substr($0,1,a+b-1)}' file
10090   |       Mus musculus

We simply use match to find the index of the first | (a), then the index of the first space before the second | (b), and print only the everything before that (substr).

Awk command to set variable name while matching a regular expression

There are 4 best solutions below

Related Questions in BASH

Related Questions in AWK

Related Questions in GREP

Related Questions in CUT

Related Questions in NCBI

Trending Questions

Popular # Hahtags

Popular Questions