I am currently using Snakemake for a bioinformatics project. Given a human reference genome (hg19) and a bam file, I want to be able to specify that there will be multiple output files with the same name but different extensions. Here is my code
rule gridss_preprocess:
input:
ref=config['ref'],
bam=config['bamdir'] + "{sample}.dedup.downsampled.bam",
bai=config['bamdir'] + "{sample}.dedup.downsampled.bam.bai"
output:
expand(config['bamdir'] + "{sample}.dedup.downsampled.bam{ext}", ext = config['workreq'], sample = "{sample}")
Currently config['workreq'] is a list of extensions that start with "."
For example, I want to be able to use expand to indicate the following files
S1.dedup.downsampled.bam.cigar_metrics
S1.dedup.downsampled.bam.computesamtags.changes.tsv
S1.dedup.downsampled.bam.coverage.blacklist.bed
S1.dedup.downsampled.bam.idsv_metrics
I want to be able to do this for multiple sample files, S_. Currently I am not getting an error when I try to do a dry run. However, I am not sure if this will run properly.
Am I doing this right?
expand()defines a list of files. If you're using two parameters, the cartesian product will be used. Thus, your rule will define as output ALL files with your extension list for ALL samples. Since you define a wildcard in your input, I think that what you want is all files with your extension for ONE sample. And this rule will be executed as many times as the number of samples.You're mixing up wildcards and placeholders for the
expand()function. You can define a wildcard inside an expand() by doubling the brackets:This expand function will expand in list
{sample}.dedup.downsampled.bam.cigar_metrics{sample}.dedup.downsampled.bam.computesamtags.changes.tsv{sample}.dedup.downsampled.bam.coverage.blacklist.bed{sample}.dedup.downsampled.bam.idsv_metricsand thus define the wildcard
sampleto match the files in the input.