Snakemake use different wildcards in a rule

265 Views Asked by At

I am trying to create snakemake rule that take in input my fastq files and return in output a .sam file for each fastq file.

I have a file like this:

FILE    TYPE    SM    LB    ID    PU          PL
xfgh.fastq.gz  Single      IND1  IND1  IND1  Platform    Illumina
IND2.fastq.gz     Single  IND2  IND2  IND2  Platform    Illumina
zfgv.fastq.gz  Single      IND3  IND3  IND3  Platform    Illumina 
IND4_P1.fastq.gz  Single      IND4  IND4  IND4  Platform    Illumina

So I did something like that.
I open my dataframe with pandas:

pd.read_csv("info_file.txt") and I stock in a list the columns file SM and ID

and i create my rule:

rule all:
    input:
        sam_file = expand("ALIGNEMENT/{sm}/{id}.sam", sm = info_df["SM"], id = info_df["ID"])

rule alignement:
    input:
          fastq_files = "PATH/TO/{fastq}"
    output:
          sam_file = "ALIGNEMENT/{sm}/{id}.sam"

I know input and output need to have the same wildcards but does there exist a method to have my input from the columns "FILES" of my file.txt and in output a path like that : "ALIGNEMENT/{sm}/{id}.sam" where {sm} and {id} are SM and ID columns of my file.txt

I also want to launch one rule per files.

If any one can help me thanks you

2

There are 2 best solutions below

0
dariober On BEST ANSWER

I am trying to create snakemake rule that take in input my fastq files and return in output a .sam file for each fastq file.

From the above it seems to me that you want to add zip to the expand function in rule all. With zip you pair wildcards as they appear in your input lists, without it you get all combinations of {id} and {sm}.

Then to get the input fastq file in rule alignment, you need to query the info dataframe to get the FILE corresponding to a given id. You can do this with a lambda function or write a dedicated function to use as input.

Here's my take on it:

import pandas as pd

info_df = pd.read_csv("info_file.txt", sep='\t') 

rule all:
    input:
        expand("ALIGNEMENT/{sm}/{id}.sam", zip, sm = info_df["SM"], id = info_df["ID"])

rule alignement:
    input:
        fastq_files=lambda wc: info_df[info_df['ID'] == wc.id]['FILE'],
    output:
        sam_file = "ALIGNEMENT/{sm}/{id}.sam"
    shell:
        r"""
        echo {input.fastq_files} > {output.sam_file}
        """
1
Dmitry Kuzminov On

In a nutshell: you don't need Snakemake. That is not a good selection of tool, as it solves the problem opposite to the one you have.

The problem as you described it is: having a list of input filenames and knowing the procedure, you want to process them in a loop. Just use pure Python, read the list of filenames, iterate over the names and call shell with the command that creates an output for each input. Use Python or any other scripting language of your choice.

Snakemake is a powerful tool that solves the opposite goal. The starting point is WHAT you want to create as a target. Next step is to define the rules of HOW the target is being created. The final step is to provide INPUT files that become the leaves of the dependency tree, and run the pipeline.

A slightly more Snakemake-tonic way to solve your problem may look like follows. First you define WHAT you want to get (note the zip parameter):

rule all:
    input:
        sam_file = expand("ALIGNEMENT/{sm}/{id}.sam", zip, sm=info_df["SM"], id=info_df["ID"])

Then you may define a rule that creates each sam file. Note that {fastq} is replaced with a dictionary that provides a mapping from the pairs of (sm, id) to the fastq values:

rule create_sam:
    input:
        lambda wildcards: f"PATH/TO/{mapping[(wildcards.sm, wildcards.id)]}"
    output:
        sam_file = "ALIGNEMENT/{sm}/{id}.sam"

And finally you need to create a mapping dictionary as a global object out of your csv file. Not the best solution to the wrong problem.

Update: even though @dariober provides a valid answer that solves the problem, I insist on that there is no need to use Snakemake in this scenario. All the benefits that Snakemake offers are lost while there are unnecessary complexities. As an alternative solution I'm providing a simplistic single rule pipeline without nasty lambda.

rule keep_it_simple_stupid:
    script:
        with open("info_file.txt") as f:
            next(f)
            for row in f:
                filename, _, sm, _, id, _, _ = row.split('\t')
                shell(f"do whatever you want with {filename}, {sm}, and {id}")

The body of this rule is pure barebone Python. Wrapping it into a Snakemake rule may make sense if it is a part of a larger pipeline that is more idiomatic to Snakemake. But if it is not - Snakemake should be avoided as a wrong tool for the task.