Extracting vcf columns substring with awk

229 Views Asked by At

I have vcf file like this:

##bcftools_annotateVersion=1.3.1+htslib-1.3.1
##bcftools_annotateCommand=annotate 
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  HG005
chr1    817186  rs3094315   G   A   50  PASS    platforms=2;platformnames=Illumina,CG;datasets=3;datasetnames=HiSeq250x250,CGnormal,HiSeqMatePair;callsets=5;callsetnames=HiSeq250x250Sentieon,CGnormal,HiSeq250x250freebayes,HiSeqMatePairSentieon,HiSeqMatePairfreebayes;datasetsmissingcall=IonExome,SolidSE75bp;callable=CS_HiSeq250x250Sentieon_callable,CS_CGnormal_callable,CS_HiSeq250x250freebayes_callable;AN=2;AF=1;AC=2 GT:PS:DP:ADALL:AD:GQ    1/1:.:809:0,363:78,428:237
chr1    817341  rs3131972   A   G   50  PASS    platforms=3;platformnames=Illumina,CG,Solid;datasets=4;datasetnames=HiSeq250x250,CGnormal,HiSeqMatePair,SolidSE75bp;callsets=6;callsetnames=HiSeq250x250Sentieon,CGnormal,HiSeq250x250freebayes,HiSeqMatePairSentieon,HiSeqMatePairfreebayes,SolidSE75GATKHC;datasetsmissingcall=IonExome;callable=CS_HiSeq250x250Sentieon_callable,CS_CGnormal_callable,CS_HiSeq250x250freebayes_callable;AN=2;AF=1;AC=2   GT:PS:DP:ADALL:AD:GQ    1/1:.:732:1,330:99,391:302

I need to extract ID column and AN from INFO column to get:

ID         INFO
rs3094315   2
rs3131972   2

I'm trying something like this awk '/^[^#]/ { print $3, gsub(/^[^AN=])/,"",$8)}' file.vcf, but still not getting the desired result.

5

There are 5 best solutions below

0
user438383 On BEST ANSWER

Usual reminder, there are dedicated tools for this kind of thing and there's no reason to use something like awk or sed, especially if you are a beginner and you don't really understand what the command is doing, as there are many pitfalls of parsing vcf files. There's every reason to think an awk script will silenty fail on a new/previous version of the vcf format in some weird and wonderful.

Simple bcftools solution:

bcftools query -f'%ID [%AN]\n' in.vcf > out.txt

bcftools format extracts fields from the vcf. Within the -f formqt flag, for INFO fields, use the format %FIELD and for FORMAT fields, use the format [%FIELD].

0
dawg On

You can try this awk:

awk 'BEGIN{OFS="\t"}
/^##/{next}
/^#/{print $3,$8; next}
{
    split($8,a,";")
    for(i=1;i<=length(a);i++) if (a[i]~/^AN=/) {sub(/^AN=/,"",a[i]); break}
    printf "%s%s%s\n", $3, OFS, a[i]
}
' file

With the example, prints:

ID  INFO
rs3094315   2
rs3131972   2
0
Renaud Pacalet On

GNU sed is maybe a better choice for this:

$ sed -En '/^#/!s/^(\w+\t){2}(\w+).*;AN=([^;]+).*/\2\t\3/p' file.vcf
rs3094315    2
rs3131972    3
0
RARE Kpop Manifesto On
 mawk NF=NF OFS='\t' FS='^##.+|^[^ \t]+[ \t]+[^ \t]+[ \t]+|[ \t]' \
               '.+(FILTER[ \t]+|;AN=)|([ \t]+[^ \t]+[ \t]+[^ \t]+|;.+)$' 

    ID  INFO    
    rs3094315   2   
    rs3131972   2
0
ufopilot On
$ awk -F' *|AN=|;AF=' 'NR>2{print(NR==3 ? $3"\t\t"$8 : $3"\t"$9)}' file
ID              INFO
rs3094315       2
rs3131972       2