Reproducible RNA-Seq analysis with Nextflow + Pixi
1 Nextflow
1.1 The Philosophy
Nextflow is a powerful workflow manager that allows you to define complex bioinformatics pipelines in a clear and modular way. Pixi is a modern package manager that ensures your software dependencies are consistent and reproducible across different environments.
In research, reproducibility is king. If a colleague can’t run your code 6 months from now, the science is incomplete. This stack ensures that every dependency (Java, STAR, Fastp) is locked to a specific version and hash.
1.2 Install Pixi
Pixi is a modern package or project manager. Unlike standard Conda, it is written in Rust and is significantly faster and more reliable.
# Install Pixi
curl -fsSL https://pixi.sh/install.sh | bash
source ~/.bashrc # or ~/.zshrc depending on your shell # to load pixi into your terminal session
pixi --version # should show the version numberIMPORTANT: Restart your terminal after installation
1.3 Initialize Project
Create a new directory for your analysis and move into it.
mkdir rna_seq_project # create a new folder for your project
cd rna_seq_project # move into the project folderNow, initialize the project with Pixi. This will create a pixi.toml file in your directory. This file is the blueprint of your project, listing all the software and dependencies you will use.
pixi init --channel conda-forge --channel bioconda # initialize a new project with specified channelsThis creates a pixi.toml. Think of this as the “DNA” of your project—it lists every piece of software required.
1.4 Add Bioinfo tools
We need to add our bioinformatic tools. Nextflow requires Java, and for RNA-seq analysis, we typically use fastqc for quality control, Fastp for quality control and salmon for transcript quantification. We’ll also add Samtools for BAM file handling and MultiQC for summarizing QC reports.
pixi init --channel conda-forge --channel bioconda --platform linux-64 # initialize with specific name, channels, and platform
pixi add nextflow "openjdk>=17" fastqc fastp salmon samtools multiqc # add dependencies to your project (star etc)Note:
- The above command will automatically create a
pixi.lockfile. This file is what makes your pipeline 100% reproducible on this and other machines. - Never edit this manually—it stores the exact hashes of your software.
- Notice we include openjdk because Nextflow requires a Java environment to run.
Here is pixis pixi.toml file after adding the dependencies. This file is human-readable and shows the software you have added, but it does contain the exact versions or hashes. The pixi.lock file contains that information:
[workspace]
name = "nextflow-pixi-simple"
channels = ["conda-forge", "bioconda"]
platforms = ["linux-64"] # ✅ safest for WSL/Linux. Remove extra platforms.
[dependencies]
nextflow = ">=25.10.4,<26" # Nextflow itself (the workflow manager)
openjdk = ">=17" # Java environment for Nextflow
fastp = ">=1.3.1,<2" # Fastp for quality control
samtools = ">=1.22.1,<2" # Samtools for BAM file handling
multiqc = ">=1.33,<2" # MultiQC for summarizing QC reports
fastqc = ">=0.12.1,<0.13" # FastQC for quality control
salmon = ">=1.10.3,<2" # Salmon for transcript quantification1.5 Prepare data
Bioinformatics pipelines expect specific folder structures. let’s say our rnaseq data (fastq files and reference genome) is in a folder called data/, let’s create that structure and add some (dummy) files if you don’t have them yet. Create data folder with mkdir and move with cd into the data folder and create dummy files:
mkdir data # create a data folder for your input files
cd data # move into the data folderAdd your files: Place your reference file and fastq files inside data/. For this tutorial, we will use dummy files from the Nextflow RNA-seq tutorial. You can replace these with your actual data.
Reference
genome or transcriptome
wget https://github.com/nextflow-io/rnaseq-nf/blob/ca20a6dfd2d799b903e557dc2e736d55419c7a1a/data/ggal/ggal_1_48850000_49020000.Ggal71.500bpflank.fa data/Fastq files
R1:
wget https://github.com/nextflow-io/rnaseq-nf/blob/ca20a6dfd2d799b903e557dc2e736d55419c7a1a/data/ggal/ggal_gut_1.fq data/R2:
wget https://github.com/nextflow-io/rnaseq-nf/blob/ca20a6dfd2d799b903e557dc2e736d55419c7a1a/data/ggal/ggal_gut_2.fq data/Add your files: Place your reference.fa and .fastq.gz files inside data/.
1.6 Write main.nf
Create your main.nf file or script with nano e.g., nano main.nf. This defines all the “Processes” (the tasks) and the “Workflow” (how data flows between them). Nextflow’s syntax is straightforward. Each process has inputs, outputs, and a script section that defines the commands to run. The workflow section connects everything together.
Here’s a simple example that performs quality control with fastqc and Fastp and then aligns reads to a reference transcriptome with Salmon. you can expand this with more processes (e.g., STAR for alignment, DESeq2 for differential expression) as needed. The key is to keep it modular and clear.
/*
* RNA-Seq Pipeline: FastQC -> Fastp -> Salmon -> MultiQC
*/
// ===========================================
// PROCESSES
// ===========================================
// 1. Indexing (Run once per transcriptome)
process SALMON_INDEX {
tag "${transcriptome.baseName}"
publishDir "${params.outdir}/salmon_index", mode: 'copy'
input: path transcriptome
output: path "salmon_index", emit: index
script: "salmon index -t ${transcriptome} -i salmon_index -k 15"
}
// 2. Quality Control
process FASTQC {
tag "${pair_id}"
publishDir "${params.outdir}/fastqc", mode: 'copy'
input: tuple val(pair_id), path(r1), path(r2)
output: path "*_fastqc.{zip,html}"
script: "fastqc ${r1} ${r2}"
}
// 3. Trimming
process FASTP {
tag "${pair_id}"
// No publishDir here usually, to save space. Results go to Salmon.
input: tuple val(pair_id), path(r1), path(r2)
output:
tuple val(pair_id), path("tr_${pair_id}_R1.fq.gz"), path("tr_${pair_id}_R2.fq.gz"), emit: reads
path "*.json", emit: json
script:
"""
fastp -i ${r1} -I ${r2} \
-o tr_${pair_id}_R1.fq.gz -O tr_${pair_id}_R2.fq.gz \
-j ${pair_id}.json
"""
}
// 4. Quantification
process SALMON_QUANT {
tag "${pair_id}"
publishDir "${params.outdir}/counts", mode: 'copy'
input:
tuple val(pair_id), path(r1), path(r2)
path index
output:
path "${pair_id}_quant"
script:
"""
salmon quant -i ${index} -l A \
-1 ${r1} -2 ${r2} \
-o ${pair_id}_quant
"""
}
// ===========================================
// WORKFLOW
// ===========================================
workflow {
// Read input pairs (flat: true gives val, path, path)
read_ch = channel.fromFilePairs(params.reads, flat: true)
// Run Indexing once
SALMON_INDEX(params.transcriptome)
// Run QC and Trimming in parallel
fastqc_out = FASTQC(read_ch)
fastp_out = FASTP(read_ch)
// Run Salmon using the index and the trimmed reads
salmon_out = SALMON_QUANT(fastp_out.reads, SALMON_INDEX.out.index)
}1.7 Config file nextflow.config
Nextflow needs to know where to find the software Pixi just installed and its environment. Create a nextflow.config file. This file handles the “Where” and “How.”
Define your parameters and resources here. This keeps your workflow flexible and reusable. first, we tell Nextflow to use the Conda environment created by Pixi. Then we set our input files and parameters. Finally, we define default resources for all processes.
// nextflow.config
params {
// Inputs - Use absolute paths or ${projectDir}
reads = "${projectDir}/data/ggal_gut_{1,2}.fq"
transcriptome = "${projectDir}/data/ggal_1_48850000_49020000.Ggal71.500bpflank.fa"
outdir = "results"
}
// Pixi/Conda integration
conda {
enabled = true
useMamba = true // Pixi is Mamba-compatible and very fast
}
process {
// Point to your Pixi environment
conda = "${projectDir}/.pixi/envs/default"
// Default resources
cpus = 2
memory = '4 GB'
// Error strategy: finish other jobs but stop the pipeline if one fails
errorStrategy = 'finish'
}
// Capture execution details for your lab notebook
report {
enabled = true
file = "${params.outdir}/reports/execution_report.html"
}Explanation: What’s happening here?
- the
paramsblock defines the input files and output directory. Using${projectDir}ensures that the paths are relative to your project, making it portable. - The
condablock tells Nextflow to use Conda environments, and we specify that we want to use Mamba for faster environment resolution. Pixi creates a Conda environment under.pixi/envs/default, which is where our tools are installed. - The
processblock sets default computational resources for all processes. You can override these in individual processes if needed. Here, we specify that each process should use 2 CPUs and 4 GB of memory. Adjust these values based on your system’s capabilities and the requirements of your tasks. - The
errorStrategyis set to ‘finish’, which means that if one process fails, Nextflow will finish any currently running processes but will not start any new ones. This allows you to see the results of completed steps while preventing further errors from cascading. - The
reportblock enables the generation of an execution report, which will be saved as an HTML file in the specified location. This report provides a detailed overview of the pipeline execution, including which processes ran, their status, and any errors encountered. - The
conda.enabledline tells Nextflow to use the Conda environment. - The
process.condaline points to the specific environment Pixi created. - The
paramsblock defines our input files and output directory. - By configuring these settings, you ensure that your pipeline is not only reproducible but also well-documented and resource-efficient.
- This setup is ideal for a lab notebook, as it captures all the necessary details about the execution environment and the pipeline’s behavior, making it easier to reproduce and troubleshoot in the future.
- Remember to adjust the
cpusandmemorysettings based on the specific requirements of your tasks and the capabilities of your system. For example, if you are working with larger datasets or more computationally intensive processes, you may need to allocate more resources.
1.8 Main Run
Run the pipeline with the following command.
pixi run nextflow run main.nfIf this is your first time running the pipeline, Nextflow will download the required software and set up the environment. This may take a few minutes. If you need to stop the pipeline for any reason (e.g., power outage, typo in the code), you can simply re-run the same command with the -resume flag. This allows Nextflow to pick up where it left off, skipping any steps that have already completed successfully. This is especially useful for long-running RNA-seq analyses, as it saves time and computational resources.
pixi run nextflow run main.nf -resumeWhy -resume? If the pipeline fails (e.g., power outage or a typo), Nextflow will skip the steps that already finished successfully. This is a lifesaver for long-running RNA-seq jobs.
Summary for your Lab Notebook Pixi ensures your environment is identical on your laptop, the lab server, and the HPC.
Nextflow manages the heavy lifting of file names, directories, and parallel processing.
Reproducibility: If you share the pixi.toml, pixi.lock, main.nf, and nextflow.config, any other student can replicate your results perfectly.
1.9 Technical Summary
| Component | Tool | Responsibility |
|---|---|---|
| Orchestration | Nextflow | Job scheduling and data flow |
| Env Manager | Pixi | Lightning-fast dependency resolution |
| Reproducibility | pixi.lock |
Strict version locking |
| Reporting | Quarto | Documentation and results |
1.10 Add new process
(e.g., MultiQC) You can easily expand this workflow by adding new processes. For example, you might want to add a “MultiQC” step to this tutorial to aggregate all your QC results into one report.
By default, MultiQC scans the folder and orders modules based on its own internal priority. To get that specific “ordered” look—where you see General Stats followed by FastQC, then fastp, and finally Salmon—you need a small configuration file. Here’s how you can set that up:
// 5. Reporting
process MULTIQC {
tag "Creating Ordered Report"
publishDir "${params.outdir}/reports", mode: 'copy'
input:
path all_logs // All the tool outputs
path mqc_config // The .yaml file we just made
output:
path "multiqc_report.html"
script:
// We add the -c flag to point to our config
"""
multiqc . -c ${mqc_config}
"""
}
workflow {
// COLLECT all outputs into one list for MultiQC
// This ensures MultiQC waits for everything to finish
ch_multiqc_input = fastqc_out.collect()
.mix(fastp_out.json.collect())
.mix(salmon_out.collect())
.collect()
// Pass the logs AND the config file
MULTIQC(ch_multiqc_input, "${projectDir}/multiqc_config.yaml")
}This will run MultiQC after all the processes have completed, collecting all the JSON files into a single report. The collect() method gathers all the emitted JSON paths into a list that MultiQC can process. This is the beauty of Nextflow’s dataflow programming model—it allows you to easily connect processes and manage complex workflows with minimal code changes.
you saw that we added a new process called MULTIQC that takes all the logs from the previous steps and generates a comprehensive report. The multiqc_config.yaml file is used to specify the order of modules in the MultiQC report, ensuring that the most important QC metrics are highlighted first.
Here is how to set up a multiqc_config.yaml and update your workflow to use it.
- Create the Config (
multiqc_config.yaml)
Create this file in your project root. This tells MultiQC exactly how to stack the report.
# multiqc_config.yaml
report_header_info:
project_name: "RNA-Seq Analysis with Nextflow + Pixi"
contact_email: "abu.siddique@slu.se"
# This defines the vertical order of the modules in the report
top_modules:
- fastqc
- fastp
- salmon
module_order:
- fastqc:
name: "Raw Read Quality (FastQC)"
anchor: "fastqc_raw"
- fastp:
name: "Adapter Trimming (fastp)"
- salmon:
name: "Quantification (Salmon)"
# Customizes the General Statistics table columns
table_columns_visible:
fastqc:
percent_duplicates: False
percent_gc: True
fastp:
pct_trimmed: True
salmon:
percent_mapped: TrueNow, pass this multiqc_config.yaml file to the MultiQC process in your workflow. This ensures that when MultiQC runs, it uses your custom configuration to generate the report. update the MultiQC process in your main.nf to include the config file as an input:
Updated main.nf
with MultiQC process
/*
* RNA-Seq Pipeline: FastQC -> Fastp -> Salmon -> MultiQC
*/
// ===========================================
// PROCESSES
// ===========================================
// 1. Indexing (Run once per transcriptome)
process SALMON_INDEX {
tag "${transcriptome.baseName}"
publishDir "${params.outdir}/salmon_index", mode: 'copy'
input: path transcriptome
output: path "salmon_index", emit: index
script: "salmon index -t ${transcriptome} -i salmon_index -k 15"
}
// 2. Quality Control
process FASTQC {
tag "${pair_id}"
publishDir "${params.outdir}/fastqc", mode: 'copy'
input: tuple val(pair_id), path(r1), path(r2)
output: path "*_fastqc.{zip,html}"
script: "fastqc ${r1} ${r2}"
}
// 3. Trimming
process FASTP {
tag "${pair_id}"
// No publishDir here usually, to save space. Results go to Salmon.
input: tuple val(pair_id), path(r1), path(r2)
output:
tuple val(pair_id), path("tr_${pair_id}_R1.fq.gz"), path("tr_${pair_id}_R2.fq.gz"), emit: reads
path "*.json", emit: json
script:
"""
fastp -i ${r1} -I ${r2} \
-o tr_${pair_id}_R1.fq.gz -O tr_${pair_id}_R2.fq.gz \
-j ${pair_id}.json
"""
}
// 4. Quantification
process SALMON_QUANT {
tag "${pair_id}"
publishDir "${params.outdir}/counts", mode: 'copy'
input:
tuple val(pair_id), path(r1), path(r2)
path index
output:
path "${pair_id}_quant"
script:
"""
salmon quant -i ${index} -l A \
-1 ${r1} -2 ${r2} \
-o ${pair_id}_quant
"""
}
// 5. Reporting
process MULTIQC {
tag "Creating Ordered Report"
publishDir "${params.outdir}/reports", mode: 'copy'
input:
path all_logs // All the tool outputs
path mqc_config // The .yaml file we just made
output:
path "multiqc_report.html"
script:
// We add the -c flag to point to our config
"""
multiqc . -c ${mqc_config}
"""
}
// ===========================================
// WORKFLOW
// ===========================================
workflow {
// Read input pairs (flat: true gives val, path, path)
read_ch = channel.fromFilePairs(params.reads, flat: true)
// Run Indexing once
SALMON_INDEX(params.transcriptome)
// Run QC and Trimming in parallel
fastqc_out = FASTQC(read_ch)
fastp_out = FASTP(read_ch)
// Run Salmon using the index and the trimmed reads
salmon_out = SALMON_QUANT(fastp_out.reads, SALMON_INDEX.out.index)
// COLLECT all outputs into one list for MultiQC
// This ensures MultiQC waits for everything to finish
ch_multiqc_input = fastqc_out.collect()
.mix(fastp_out.json.collect())
.mix(salmon_out.collect())
.collect()
// Pass the logs AND the config file
MULTIQC(ch_multiqc_input, "${projectDir}/multiqc_config.yaml")
}New run
same command with resume flag
pixi run nextflow run main.nf -resume1.11 Monitoring Results
Once the run finishes, check the results/ folder. You should see subfolders for fastqc, counts, and reports. The multiqc_report.html will be in the reports/ folder, summarizing all your QC metrics in one place.
main.nf will now run multiqc after all steps are complete. The new output will be in results/multiqc/ with a single report combining all the tool outputs.
💡 Note: You will see extra folder as
results/reportswith nextflow execution reports in details. This is a great place to check if you want to see the exact commands that were run, the resources used, and any errors that occurred during execution.
Result checklist
- How to interpret the Salmon Output Inside
results/salmon/[sample]_quant/, the most important file isquant.sf.
Name: The transcript/gene ID.
Length: The length of the sequence.
TPM: (Transcripts Per Million) - This is a normalized value used to compare expression within a sample.
NumReads: The estimated number of reads that mapped to this gene (the “Raw Counts”). Use these for DESeq2.
Summary
| Tool | Purpose | Graduate Tip |
|---|---|---|
| Pixi | Environment | Run pixi list to see exactly which versions are in your thesis. |
| Nextflow | Orchestration | Use the work/ directory to debug if a specific step fails. |
| Salmon | Indexing | Run once per reference. Store the index for future runs. |
| FastQC | Quality Control | Look for “Per base sequence quality” and “Adapter Content” in the report. |
| Fastp | Quality Control | Always check the “Adapter Content” in the MultiQC report. |
| Salmon | Quantification | TPM is for visualization; Raw Counts are for statistics. |
| MultiQC | Reporting | Customize the config to highlight the most important metrics for your thesis. |
Troubleshooting
| Error Message | Likely Cause | Solution |
|---|---|---|
0 processes matched |
Your file path in nextflow.config is wrong |
Check the extension (e.g., .fastq.gz vs .fq.gz) |
Permission denied |
Nextflow can’t write to the work/ folder |
Check your directory permissions; don’t run from a read-only drive |
Java version error |
System Java is older than 17 | Ensure you ran pixi add openjdk and use pixi run |