how to set output directory in shell in snakemake workflow

37 Views Asked by At

the --output_dir in my shell command allows file to be written to that directory. but i keep getting the error

SyntaxError:
Not all output, log and benchmark files of rule bismark_cov contain the same wildcards. This is crucial though, in order to avoid that two or more jobs write to the same file.
  File "extra_bismark_methyl_analysis.smk", line 35, in <module>
bismark_methylation_extractor {input.bam_path} --parallel 4 \
--paired-end --comprehensive \
--bedGraph --zero_based --output_dir {params.out_dir}

pls see full command i used

import os
import glob
from datetime import datetime

#import configs
configfile: "/lila/data/greenbaum/users/ahunos/apps/lab_manifesto/configs/config_snakemake_lilac.yaml"

# Define the preprocessed files directory  
preprocessedDir = '/lila/data/greenbaum/projects/methylSeq_Spectrum/data/preprocessed/WholeGenome_Methyl/OUTDIR/bismark/deduplicated/*.bam'
dir2='/lila/data/greenbaum/projects/methylSeq_Spectrum/data/preprocessed/Capture_Methyl/OUTDIR/bismark/deduplicated/*.bam'

# Create the pattern to match BAM files
def get_bams(nfcore_OUTDIR):
    bam_paths = glob.glob(nfcore_OUTDIR, recursive=True)
    return bam_paths

#combine bam files
bam_paths = get_bams(nfcore_OUTDIR=preprocessedDir) + get_bams(nfcore_OUTDIR=dir2)
print(bam_paths)

#get sample names
SAMPLES = [os.path.splitext(os.path.basename(f))[0] for f in bam_paths]
print(f"heres SAMPLES \n{SAMPLES}")


contexts=['CpG','CHH','CHG']
suffixes=['bismark.cov.gz','M-bias.txt', '.bedGraph.gz']

rule all:
    input:
        expand('results/{sample}/{sample}.{suffix}', sample=SAMPLES, suffix=suffixes, allow_missing=True),
        expand('results/{sample}/{sample}_splitting_report.txt', sample=SAMPLES,allow_missing=True),
        expand('results/{sample}/{C_context}_context_{sample}.txt', sample=SAMPLES, C_context=contexts,allow_missing=True)

rule bismark_cov:
    input:
        bam_path=lambda wildcards: wildcards.bam_paths
    output:
        'results/{sample}/{sample}.{suffix}',
        'results/{sample}/{sample}_splitting_report.txt',
        'results/{sample}/{C_context}_context_{sample}.txt'
    params:
        out_dir='results/{sample}'
    shell:
        """ 
bismark_methylation_extractor {input.bam_path} --parallel 4 \
--paired-end --comprehensive \
--bedGraph --zero_based --output_dir {params.out_dir}
        """

1

There are 1 best solutions below

2
kEks On

The problem is that you have one output with an additional wildcard ({suffix} and {C_context}). Based on the first output line it would run the rule three times for each sample. But the output line in the middle would only run once per sample (Same for the {C_context}).

I don't know bismark_methylation_extractor, but I guess it creates all these suffixes and contexts. If this is the case, than you could either explicitly write them out

 output:
    'results/{sample}/{sample}.M-bias.txt',
    'results/{sample}/{sample}.bismark.cov.gz',
    'results/{sample}/{sample}.bedGraph.gz',
    'results/{sample}/{sample}_splitting_report.txt',
    

or I guess the expand command in the results part (like you did in the input of the rule all) should work as well.