I have genotypes of over 20k individuals in a vcf file got after imputation. I'll give you an example of the aspect of this vcf file, with only 7 samples:
#CHROM POS ID REF ALT QUAL FILTER FORMAT INFO 0_0_473294.CEL 0_0_347293_v2.CEL 0_0_9588393_RS.CEL 0_0_999444_rp.CEL 0_0_26:9494949.CEL 0_0_237485_RS_rp.CEL 0_0_27:484848.CEL
16 11781 rs549521730 G C . PASS IMPUTED GP
So, starting from column 10, genotypes of individuals start. Now, I need to modify individual code of this vcf file, so as to have a vcf file with the following aspect:
#CHROM POS ID REF ALT QUAL FILTER FORMAT INFO 473294 347293 9588393 999444 9494949 237485 484848
16 11781 rs549521730 G C . PASS IMPUTED GP
Therefore, I need only serial numbers, without the flanking stuff, like .CEL, _RS, 26:, and so on.
Do you know a tool, like bcftools, being able to re-annotate sample codes of a vcf file? Or is it possible to do it in bash? Thank you!
If I'm reading your question correctly it looks like you just want to change the column names?
It looks like there are a lot of different formats to the column sample names; How you go about converting those to just the number you want will depend on the specifics but will probably involve regex. I'm not sure your example has enough info to answer that part.
I'd recommend something like making a single-line header text file (
header.txt), making a new vcf file from it (output.vcf), and appending all but the header line of the input vcf file (input.vcf) to the new file.