Reproducible RNA-Seq analysis with Snakemake workflow + Pixi
1 Snakemake
1.1 The Philosophy
In Snakemake, everything is defined by Rules. Each rule describes how to create an output file from an input file using a specific shell command. Snakemake automatically builds a directed acyclic graph (DAG) to determine the order of operations.
1.2 Set Up
Just like with Nextflow, we use Pixi to lock our software versions.
mkdir snakemake_project && cd snakemake_project
pixi init
# Add Snakemake and bioinformatics tools
pixi add snakemake openjdk fastp salmon multiqc1.3 Project Structure
Ensure your data is organized before starting:
snakemake_project/
├── data/
│ ├── sample1_R1.fq.gz
│ ├── sample1_R2.fq.gz
│ └── reference.fa
├── Snakefile # The "Recipe" file
├── pixi.toml # The "Software DNA"
└── results/ # Created automatically
1.4 Write the Snakefile
Snakemake’s Snakefile is where you define your workflow. It is a Python-based DSL (Domain Specific Language) that allows you to write rules in a clear and modular way. Create Snakefile file. This is the Snakemake equivalent of main.nf.
# =============================================================================
# Snakefile: RNA-Seq Quantification Pipeline
# =============================================================================
# 1. Define the samples (Extracting names from the filenames in data/)
SAMPLES = ["ggal_gut"]
# 2. The 'rule all' defines the final targets we want to achieve.
# Snakemake works backward from here.
rule all:
input:
"results/multiqc/multiqc_report.html",
expand("results/salmon/{sample}_quant/quant.sf", sample=SAMPLES)
# 3. Rule for Trimming
rule fastp:
input:
r1 = "data/{sample}_1.fq",
r2 = "data/{sample}_2.fq"
output:
r1 = "results/trimmed/{sample}_R1.fq.gz",
r2 = "results/trimmed/{sample}_R2.fq.gz",
json = "results/trimmed/{sample}.json"
shell:
"""
fastp -i {input.r1} -I {input.r2} \
-o {output.r1} -O {output.r2} \
-j {output.json}
"""
# 4. Rule for Salmon Indexing (Only runs once)
rule salmon_index:
input:
fasta = "data/reference.fa"
output:
directory("results/salmon_index")
shell:
"salmon index -t {input.fasta} -i {output} -k 15"
# 5. Rule for Salmon Quantification
rule salmon_quant:
input:
r1 = "results/trimmed/{sample}_R1.fq.gz",
r2 = "results/trimmed/{sample}_R2.fq.gz",
index = "results/salmon_index"
output:
directory("results/salmon/{sample}_quant")
shell:
"""
salmon quant -i {input.index} -l A \
-1 {input.r1} -2 {input.r2} \
-o {output}
"""
# 6. Rule for MultiQC
rule multiqc:
input:
# Collects all JSONs and Quant folders before running
expand("results/trimmed/{sample}.json", sample=SAMPLES),
expand("results/salmon/{sample}_quant", sample=SAMPLES)
output:
"results/multiqc/multiqc_report.html"
shell:
"multiqc results/ -o results/multiqc/"1.5 Main run
To ensure Snakemake only uses your Pixi-installed tools and not your system tools, use the pixi run prefix.
# Preview what Snakemake will do (Dry run)
pixi run snakemake -n
# Run the pipeline with 2 cores
pixi run snakemake --cores 21.6 YAML configuration
YOu can run snakemake with a YAML configuration file so that you don’t have to hard-code sample names in the Snakefile ?
Using a configuration file (YAML) is the professional way to manage a Snakemake pipeline. It separates your code (the Snakefile) from your data (the file paths and parameters). This allows a colleague to run your pipeline on their own data just by editing one simple text file.
Create config.yaml
Create this file in your project root. This is where you define your “metadata.”
# config.yaml
samples:
- ggal_gut
- sample_control
- sample_treatment
paths:
reads: "data/"
results: "results/"
transcriptome: "data/reference.fa"
params:
salmon_kmer: 15
multiqc_title: "RNA-Seq Graduate Project"Update Snakefile
We now tell Snakemake to load that YAML file. We replace hard-coded strings with config[‘key’].
# Snakefile - Configured Version
# 1. Load the config file
configfile: "config.yaml"
# 2. Extract variables for cleaner code
SAMPLES = config["samples"]
READS_DIR = config["paths"]["reads"]
RES_DIR = config["paths"]["results"]
rule all:
input:
f"{RES_DIR}reports/multiqc_report.html",
expand(f"{RES_DIR}salmon/{{sample}}_quant/quant.sf", sample=SAMPLES)
rule fastp:
input:
r1 = f"{READS_DIR}{{sample}}_1.fq",
r2 = f"{READS_DIR}{{sample}}_2.fq"
output:
r1 = f"{RES_DIR}trimmed/{{sample}}_R1.fq.gz",
r2 = f"{RES_DIR}trimmed/{{sample}}_R2.fq.gz",
json = f"{RES_DIR}trimmed/{{sample}}.json"
shell:
"fastp -i {input.r1} -I {input.r2} -o {output.r1} -O {output.r2} -j {output.json}"
rule salmon_index:
input:
fasta = config["paths"]["transcriptome"]
output:
directory(f"{RES_DIR}salmon_index")
params:
k = config["params"]["salmon_kmer"]
shell:
"salmon index -t {input.fasta} -i {output} -k {params.k}"
rule salmon_quant:
input:
r1 = f"{RES_DIR}trimmed/{{sample}}_R1.fq.gz",
r2 = f"{RES_DIR}trimmed/{{sample}}_R2.fq.gz",
index = f"{RES_DIR}salmon_index"
output:
directory(f"{RES_DIR}salmon/{{sample}}_quant")
shell:
"salmon quant -i {input.index} -l A -1 {input.r1} -2 {input.r2} -o {output}"
rule multiqc:
input:
expand(f"{RES_DIR}trimmed/{{sample}}.json", sample=SAMPLES),
expand(f"{RES_DIR}salmon/{{sample}}_quant/quant.sf", sample=SAMPLES)
output:
f"{RES_DIR}multiqc/multiqc_report.html"
params:
title = config["params"]["multiqc_title"]
shell:
"multiqc {RES_DIR} -o {RES_DIR}multiqc/ --title '{params.title}'"Run Pipeline
The command remains the same, but now Snakemake reads the config.yaml automatically.
# Preview what Snakemake will do (Dry run)
pixi run snakemake -n
# Run the pipeline with 2 cores
pixi run snakemake --cores 2Tips
The Power of f-strings In the Snakefile above, I used Python f-strings (e.g., f"{RES_DIR}trimmed/..."). This makes the paths much easier to read and modify. Note that when using f-strings in Snakemake, you must use double curly braces {sample} for Snakemake wildcards to prevent Python from confusing them with variables.
Wildcards vs. Expand expand(): This is a “Creator.” It generates a list of many files (e.g., one for every sample). You use this in rule all or rule multiqc where you need to gather everything.
{sample}: This is a “Placeholder.” It tells a rule: “I don’t care which sample this is; just perform this action on whichever one is currently being requested.”
Visualizing your Progress Snakemake can generate a “Map” of your work. This is excellent for your thesis defense slides:
pixi run snakemake --dag | dot -Tpdf > dag.pdf(Note: This requires the graphviz package, which you can add via pixi add graphviz)
Final Check: Why this is the “Gold Standard” Pixi handles the Environment (The Tools).
Snakemake handles the Execution (The Steps).
Config.yaml handles the Project Data (The Specifics).
You now have two complete, professional-grade workflows (Nextflow and Snakemake). Which one do you feel more comfortable using for your daily research?
1.7 DAG (map)
To get a visual “map” (the DAG) for your thesis or Quarto document, Snakemake has built-in tools that generate these as standalone HTML and PDF/SVG files.
To integrate the DAG (Map) and the Execution Report directly into your Snakefile, we need to treat them as rules that run after everything else is finished.
However, there is a small technical trick: Snakemake cannot easily “report on itself” while it is still running the main pipeline. The best way to handle this for a graduate-level project is to create a “Meta-Rule” or a specific target.
Here is the updated Snakefile with two new rules and a new folder structure.
🛠️ Updated Snakefile with following Rules
rule all:
input:
f"{RES_DIR}multiqc/multiqc_report.html",
expand(f"{RES_DIR}salmon/{{sample}}_quant/quant.sf", sample=SAMPLES),
# Adding the plots/reports to 'all' if you want them generated automatically
f"{RES_DIR}plots/workflow_dag.svg"Keep your fastp, salmon_index, and salmon_quant rules as they were
rule multiqc:
input:
expand(f"{RES_DIR}trimmed/{{sample}}.json", sample=SAMPLES),
expand(f"{RES_DIR}salmon/{{sample}}_quant/quant.sf", sample=SAMPLES)
output:
f"{RES_DIR}multiqc/multiqc_report.html"
params:
search_dir = RES_DIR,
out_dir = f"{RES_DIR}multiqc",
title = config["params"]["multiqc_title"]
shell:
# Added the -f (force) flag here
"multiqc {params.search_dir} -o {params.out_dir} "
"-f --title '{params.title}' -n multiqc_report.html"run the command in your terminal to ensure that the Snakemake execution report and DAG are generated correctly:
pixi run snakemake --cores 21.8 Results
After running this, your results/ folder 📂 will look like a professional bioinformatician’s project:
- results/trimmed/: Cleaned FastQ files and JSON logs.
- results/salmon/: Gene/Transcript quantification files.
- results/multiqc/: The summary of all quality metrics.
- results/plots/: Your workflow_dag.svg (Perfect for your Quarto doc).
1.9 Nextflow vs Snakemake
| Feature | Nextflow | Snakemake |
|---|---|---|
| Logic | Data-driven (Channels) | File-driven (Inputs/Outputs) |
| Language | Groovy-based DSL | Python-based |
| Resume | Uses -resume and a work/ folder | Checks timestamps of files on disk |
| Software | process.conda in config | pixi run or conda: per rule |
1.10 Why Snakemake?
- Python Power: Since Snakemake is Python, you can write standard Python code inside your Snakefile to parse complex filenames or handle metadata CSVs.
- Wildcards: The
{sample}tag is a wildcard. Snakemake automatically figures out that if you wantresults/trimmed/sampleA_R1.fq.gz, it must look fordata/sampleA_1.fq. - Reproducibility: Combined with Pixi, a Snakemake workflow is 100% portable. If you move this folder to a cluster, it will behave exactly the same way.
END