I have vcf file like this:
##bcftools_annotateVersion=1.3.1+htslib-1.3.1
##bcftools_annotateCommand=annotate
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG005
chr1 817186 rs3094315 G A 50 PASS platforms=2;platformnames=Illumina,CG;datasets=3;datasetnames=HiSeq250x250,CGnormal,HiSeqMatePair;callsets=5;callsetnames=HiSeq250x250Sentieon,CGnormal,HiSeq250x250freebayes,HiSeqMatePairSentieon,HiSeqMatePairfreebayes;datasetsmissingcall=IonExome,SolidSE75bp;callable=CS_HiSeq250x250Sentieon_callable,CS_CGnormal_callable,CS_HiSeq250x250freebayes_callable;AN=2;AF=1;AC=2 GT:PS:DP:ADALL:AD:GQ 1/1:.:809:0,363:78,428:237
chr1 817341 rs3131972 A G 50 PASS platforms=3;platformnames=Illumina,CG,Solid;datasets=4;datasetnames=HiSeq250x250,CGnormal,HiSeqMatePair,SolidSE75bp;callsets=6;callsetnames=HiSeq250x250Sentieon,CGnormal,HiSeq250x250freebayes,HiSeqMatePairSentieon,HiSeqMatePairfreebayes,SolidSE75GATKHC;datasetsmissingcall=IonExome;callable=CS_HiSeq250x250Sentieon_callable,CS_CGnormal_callable,CS_HiSeq250x250freebayes_callable;AN=2;AF=1;AC=2 GT:PS:DP:ADALL:AD:GQ 1/1:.:732:1,330:99,391:302
I need to extract ID column and AN from INFO column to get:
ID INFO
rs3094315 2
rs3131972 2
I'm trying something like this awk '/^[^#]/ { print $3, gsub(/^[^AN=])/,"",$8)}' file.vcf, but still not getting the desired result.
Usual reminder, there are dedicated tools for this kind of thing and there's no reason to use something like awk or sed, especially if you are a beginner and you don't really understand what the command is doing, as there are many pitfalls of parsing vcf files. There's every reason to think an awk script will silenty fail on a new/previous version of the vcf format in some weird and wonderful.
Simple bcftools solution:
bcftools formatextracts fields from the vcf. Within the-fformqt flag, forINFOfields, use the format%FIELDand forFORMATfields, use the format[%FIELD].