How to solve wildcards in input files cannot be determined from output files in snakelike

96 Views Asked by At

I am a new Snakemake user and trying to develop a pipeline using some data to be able to implement to to our real data. I have multiple folders (one folder for each patient) and in each folder there are multiple files for each tumour and normal samples: Here is the structure of my directories

`A/
 A-T1.fastq
 A-N1.fastq

B/
 B-T1.fastq
 B-N1.fastq

C/
 C-T1.fastq
 C-N1.fastq`

and so on .... (in total more than 100 directories).

This is also my snake file:

`#!/usr/bin/env snakemake

configfile:
    "config.json"

(DIRS,SAMPLES) = glob_wildcards(config['data']+"{dir}/{sample}.fastq")

rule all:
    input:
        expand("results/mapped/{dir}/{sample}.sorted.bam", dir=DIRS, sample=SAMPLES)

rule symlink:
    input:
         expand(config['data']+"{dir}/{{sample}}.fastq")
    output:
         "00-input/{dir}/{sample}.fastq"
    shell: 
        "ln -s {input} {output}"     
               

rule map_reads:
    input:
        "data/genome.fa",
        "00-input/{dir}/{sample}.fastq"
    output:
        "results/mapped/{dir}/{sample}.bam"
    conda:
        "envs/samtools.yaml"
    shell:
        "bwa mem {input} | samtools view -b - > {output}"


rule sort_alignments:
    input:
        "results/mapped/{dir}/{sample}.bam"
    output:
        "results/mapped/{dir}/{sample}.sorted.bam"
    conda:
        "envs/samtools.yaml"
    shell:
        "samtools sort -o {output} {input}"`

this is also my config file:

`{
    "data": "/analysis/Anna/snakemake-demo/data/samples_fastq/"
}`

By running this script I get the following error message:

`WildcardError in line 13:
No values given for wildcard 'dir'.
`

I tried a different way by adding modifying my rule symlink:

 ` input:
         expand(config['data']+"{{dir}}/{{sample}}.fastq")
`

And this time I get a different error message:

`Missing input files for rule symlink:`

I have looked through several similar questions on Stack but have not been able to fix my error so far. I appreciate if someone could help me to learn where is my mistake and any clues how I can fix that. Thank you

I tried similar issues on stack to fix the error but still struggling.

3

There are 3 best solutions below

0
AnnaS On BEST ANSWER

So after trying different ways and going over several stack posts finally I got the solution to my question using the super useful answer from this question Process multiple directories and all files within using snakemake and https://snakemake.readthedocs.io/en/stable/project_info/faq.html#how-do-i-run-my-rule-on-all-files-of-a-certain-directory. By default the expand function uses itertools.product to create every combination of the supplied wildcards. Expand takes an optional, second positional argument which can customize how wildcards are combined. I needed to add "zip" and here is my worked example code: I slightly simplified it compared to my original question

#!/usr/bin/env snakemake

configfile:
    "config.json"

DIRS,SAMPLES = glob_wildcards(config['data']+"{dir}/{sample}.fastq")

rule all:
    input:
        expand("results/mapped/{dir}/{sample}.sorted.bam", zip, dir=DIRS, sample=SAMPLES)

rule symlink:
    input:
         config['data']+"{dir}/{sample}.fastq"
    output:
         "00-input/{dir}/{sample}.fastq"
    shell: 
        "ln -s {input} {output}"     
               

rule map_reads:
    input:
        fasta="data/genome.fa",
        fastq=rules.symlink.output
    output:
        "results/mapped/{dir}/{sample}.bam"
    conda:
        "envs/samtools.yaml"
    shell:
        "bwa mem {input} | samtools view -b - > {output}"


rule sort_alignments:
    input:
        rules.map_reads.output
    output:
        "results/mapped/{dir}/{sample}.sorted.bam"
    conda:
        "envs/samtools.yaml"
    shell:
        "samtools sort -o {output} {input}"
3
PaulArthurM On

By running a minimal example on my computer, based on the information you provide, I managed to reproduce your errors.

Concerning the first error, your solution expand(config['data']+"{{dir}}/{{sample}}.fastq") worked also for me.

However, for the second error, the complete error message was:

MissingInputException: Missing input files for rule symlink:
    output: 00-input/A/A-T1.sorted.fastq
    wildcards: dir=A, sample=A-T1.sorted
    affected files:
        /home/paularthur/Documents/stack_overflow/77107606/data/A/A-T1.sorted.fastq

Note that the value of the sample wildcard is A-T1.sorted, while you expect it to be A-T1.

My understanding is that in your version, filenames are ambiguous. Snakemake don't manage to infer automatically the value of the wildcard sample between rule sort_alignments and rule map_reads.

In this kind of situations, I use rule dependencies https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#rule-dependencies

Importantly, be aware that referring to rule a here requires that rule a was defined above rule b in the file, since the object has to be known already. This feature also allows us to resolve dependencies that are ambiguous when using filenames.

Note that when the rule you refer to defines multiple output files but you want to require only a subset of those as input for another rule, you should name the output files and refer to them specifically:

By modifying your Snakefile accordingly, I manage to run the pipeline in dry run:

Edit with zip solution from @AnnaS

#!/usr/bin/env snakemake

configfile:
    "config.json"

(DIRS,SAMPLES) = glob_wildcards(config['data']+"{dir}/{sample}.fastq")

rule all:
    input:
        expand("results/mapped/{dir}/{sample}.sorted.bam", zip, dir=DIRS, sample=SAMPLES)

rule symlink:
    input:
         expand(config['data']+"{{dir}}/{{sample}}.fastq")
    output:
         "00-input/{dir}/{sample}.fastq"
    shell: 
        "ln -s {input} {output}"     
               

rule map_reads:
    input:
        fasta="data/genome.fa",
        fastq=rules.symlink.output
    output:
        "results/mapped/{dir}/{sample}.bam"
    conda:
        "envs/samtools.yaml
    shell:
        "bwa mem {input} | samtools view -b - > {output}"


rule sort_alignments:
    input:
        rules.map_reads.output
    output:
        "results/mapped/{dir}/{sample}.sorted.bam"
    conda:
        "envs/samtools.yaml"
    shell:
        "samtools sort -o {output} {input}"
1
dariober On

I haven't tried running an example, but it seems to me that in symlink rule you want:

rule symlink:
    input:
         config['data'] + "{dir}/{sample}.fastq",
    ...

this will run ln -s ... for each combination of {dir} and {sample} that is necessary to produce the output required by the downstream rules.

Also, keep in mind that:

expand("results/mapped/{dir}/{sample}.sorted.bam", dir=DIRS, sample=SAMPLES)

will produce all combinations of {dir} and {sample}. This may or may not be want you want.