Reproducible RNA-Seq analysis with Nextflow + Pixi

Author

Abu Bakar Siddique

Published

April 1, 2026

1 Nextflow

1.1 The Philosophy

Nextflow is a powerful workflow manager that allows you to define complex bioinformatics pipelines in a clear and modular way. Pixi is a modern package manager that ensures your software dependencies are consistent and reproducible across different environments.

In research, reproducibility is king. If a colleague can’t run your code 6 months from now, the science is incomplete. This stack ensures that every dependency (Java, STAR, Fastp) is locked to a specific version and hash.

1.2 Install Pixi

Pixi is a modern package or project manager. Unlike standard Conda, it is written in Rust and is significantly faster and more reliable.

# Install Pixi
curl -fsSL https://pixi.sh/install.sh | bash
source ~/.bashrc  # or ~/.zshrc depending on your shell # to load pixi into your terminal session
pixi --version # should show the version number

IMPORTANT: Restart your terminal after installation

1.3 Initialize Project

Create a new directory for your analysis and move into it.

mkdir rna_seq_project  # create a new folder for your project

cd rna_seq_project     # move into the project folder

Now, initialize the project with Pixi. This will create a pixi.toml file in your directory. This file is the blueprint of your project, listing all the software and dependencies you will use.

pixi init --channel conda-forge --channel bioconda   # initialize a new project with specified channels

This creates a pixi.toml. Think of this as the “DNA” of your project—it lists every piece of software required.

1.4 Add Bioinfo tools

We need to add our bioinformatic tools. Nextflow requires Java, and for RNA-seq analysis, we typically use fastqc for quality control, Fastp for quality control and salmon for transcript quantification. We’ll also add Samtools for BAM file handling and MultiQC for summarizing QC reports.

pixi init --channel conda-forge --channel bioconda --platform linux-64 # initialize with specific name, channels, and platform
pixi add nextflow "openjdk>=17" fastqc fastp salmon samtools multiqc   # add dependencies to your project (star etc)

Note:

The above command will automatically create a pixi.lock file. This file is what makes your pipeline 100% reproducible on this and other machines.
Never edit this manually—it stores the exact hashes of your software.
Notice we include openjdk because Nextflow requires a Java environment to run.

Here is pixis pixi.toml file after adding the dependencies. This file is human-readable and shows the software you have added, but it does contain the exact versions or hashes. The pixi.lock file contains that information:

[workspace]
name = "nextflow-pixi-simple"
channels = ["conda-forge", "bioconda"]

platforms = ["linux-64"]  # ✅ safest for WSL/Linux. Remove extra platforms.

[dependencies]
nextflow = ">=25.10.4,<26" # Nextflow itself (the workflow manager)
openjdk = ">=17"           # Java environment for Nextflow
fastp = ">=1.3.1,<2"       # Fastp for quality control
samtools = ">=1.22.1,<2"    # Samtools for BAM file handling
multiqc = ">=1.33,<2"       # MultiQC for summarizing QC reports
fastqc = ">=0.12.1,<0.13"   # FastQC for quality control
salmon = ">=1.10.3,<2"      # Salmon for transcript quantification

1.5 Prepare data

Bioinformatics pipelines expect specific folder structures. let’s say our rnaseq data (fastq files and reference genome) is in a folder called data/, let’s create that structure and add some (dummy) files if you don’t have them yet. Create data folder with mkdir and move with cd into the data folder and create dummy files:

mkdir data # create a data folder for your input files
cd data    # move into the data folder

Add your files: Place your reference file and fastq files inside data/. For this tutorial, we will use dummy files from the Nextflow RNA-seq tutorial. You can replace these with your actual data.

Reference

genome or transcriptome

wget https://github.com/nextflow-io/rnaseq-nf/blob/ca20a6dfd2d799b903e557dc2e736d55419c7a1a/data/ggal/ggal_1_48850000_49020000.Ggal71.500bpflank.fa data/

Fastq files

R1:

wget https://github.com/nextflow-io/rnaseq-nf/blob/ca20a6dfd2d799b903e557dc2e736d55419c7a1a/data/ggal/ggal_gut_1.fq data/

R2:

wget https://github.com/nextflow-io/rnaseq-nf/blob/ca20a6dfd2d799b903e557dc2e736d55419c7a1a/data/ggal/ggal_gut_2.fq data/

Add your files: Place your reference.fa and .fastq.gz files inside data/.

1.6 Write `main.nf`

Create your main.nf file or script with nano e.g., nano main.nf. This defines all the “Processes” (the tasks) and the “Workflow” (how data flows between them). Nextflow’s syntax is straightforward. Each process has inputs, outputs, and a script section that defines the commands to run. The workflow section connects everything together.

Here’s a simple example that performs quality control with fastqc and Fastp and then aligns reads to a reference transcriptome with Salmon. you can expand this with more processes (e.g., STAR for alignment, DESeq2 for differential expression) as needed. The key is to keep it modular and clear.

/*
 * RNA-Seq Pipeline: FastQC -> Fastp -> Salmon -> MultiQC
 */

// ===========================================
// PROCESSES
// ===========================================

// 1. Indexing (Run once per transcriptome)
process SALMON_INDEX {
    tag "${transcriptome.baseName}"
    publishDir "${params.outdir}/salmon_index", mode: 'copy'

    input:  path transcriptome
    output: path "salmon_index", emit: index
    script: "salmon index -t ${transcriptome} -i salmon_index -k 15"
}

// 2. Quality Control
process FASTQC {
    tag "${pair_id}"
    publishDir "${params.outdir}/fastqc", mode: 'copy'

    input:  tuple val(pair_id), path(r1), path(r2)
    output: path "*_fastqc.{zip,html}"
    script: "fastqc ${r1} ${r2}"
}

// 3. Trimming
process FASTP {
    tag "${pair_id}"
    // No publishDir here usually, to save space. Results go to Salmon.
    
    input:  tuple val(pair_id), path(r1), path(r2)
    output: 
        tuple val(pair_id), path("tr_${pair_id}_R1.fq.gz"), path("tr_${pair_id}_R2.fq.gz"), emit: reads
        path "*.json", emit: json

    script:
    """
    fastp -i ${r1} -I ${r2} \
          -o tr_${pair_id}_R1.fq.gz -O tr_${pair_id}_R2.fq.gz \
          -j ${pair_id}.json
    """
}

// 4. Quantification
process SALMON_QUANT {
    tag "${pair_id}"
    publishDir "${params.outdir}/counts", mode: 'copy'

    input:
    tuple val(pair_id), path(r1), path(r2)
    path index

    output:
    path "${pair_id}_quant"

    script:
    """
    salmon quant -i ${index} -l A \
          -1 ${r1} -2 ${r2} \
          -o ${pair_id}_quant
    """
}

// ===========================================
// WORKFLOW
// ===========================================

workflow {
    // Read input pairs (flat: true gives val, path, path)
    read_ch = channel.fromFilePairs(params.reads, flat: true)

    // Run Indexing once
    SALMON_INDEX(params.transcriptome)

    // Run QC and Trimming in parallel
    fastqc_out = FASTQC(read_ch)
    fastp_out  = FASTP(read_ch)

    // Run Salmon using the index and the trimmed reads
    salmon_out = SALMON_QUANT(fastp_out.reads, SALMON_INDEX.out.index)
}

1.7 Config file `nextflow.config`

Nextflow needs to know where to find the software Pixi just installed and its environment. Create a nextflow.config file. This file handles the “Where” and “How.”

Define your parameters and resources here. This keeps your workflow flexible and reusable. first, we tell Nextflow to use the Conda environment created by Pixi. Then we set our input files and parameters. Finally, we define default resources for all processes.

// nextflow.config

params {
    // Inputs - Use absolute paths or ${projectDir}
    reads         = "${projectDir}/data/ggal_gut_{1,2}.fq"
    transcriptome = "${projectDir}/data/ggal_1_48850000_49020000.Ggal71.500bpflank.fa"
    outdir        = "results"
}

// Pixi/Conda integration
conda {
    enabled     = true
    useMamba    = true // Pixi is Mamba-compatible and very fast
}

process {
    // Point to your Pixi environment
    conda = "${projectDir}/.pixi/envs/default"
    
    // Default resources
    cpus   = 2
    memory = '4 GB'
    
    // Error strategy: finish other jobs but stop the pipeline if one fails
    errorStrategy = 'finish' 
}

// Capture execution details for your lab notebook
report {
    enabled = true
    file    = "${params.outdir}/reports/execution_report.html"
}

Explanation: What’s happening here?

the params block defines the input files and output directory. Using ${projectDir} ensures that the paths are relative to your project, making it portable.
The conda block tells Nextflow to use Conda environments, and we specify that we want to use Mamba for faster environment resolution. Pixi creates a Conda environment under .pixi/envs/default, which is where our tools are installed.
The process block sets default computational resources for all processes. You can override these in individual processes if needed. Here, we specify that each process should use 2 CPUs and 4 GB of memory. Adjust these values based on your system’s capabilities and the requirements of your tasks.
The errorStrategy is set to ‘finish’, which means that if one process fails, Nextflow will finish any currently running processes but will not start any new ones. This allows you to see the results of completed steps while preventing further errors from cascading.
The report block enables the generation of an execution report, which will be saved as an HTML file in the specified location. This report provides a detailed overview of the pipeline execution, including which processes ran, their status, and any errors encountered.
The conda.enabled line tells Nextflow to use the Conda environment.
The process.conda line points to the specific environment Pixi created.
The params block defines our input files and output directory.
By configuring these settings, you ensure that your pipeline is not only reproducible but also well-documented and resource-efficient.
This setup is ideal for a lab notebook, as it captures all the necessary details about the execution environment and the pipeline’s behavior, making it easier to reproduce and troubleshoot in the future.
Remember to adjust the cpus and memory settings based on the specific requirements of your tasks and the capabilities of your system. For example, if you are working with larger datasets or more computationally intensive processes, you may need to allocate more resources.

1.8 Main Run

Run the pipeline with the following command.

pixi run nextflow run main.nf

If this is your first time running the pipeline, Nextflow will download the required software and set up the environment. This may take a few minutes. If you need to stop the pipeline for any reason (e.g., power outage, typo in the code), you can simply re-run the same command with the -resume flag. This allows Nextflow to pick up where it left off, skipping any steps that have already completed successfully. This is especially useful for long-running RNA-seq analyses, as it saves time and computational resources.

pixi run nextflow run main.nf -resume

Why -resume? If the pipeline fails (e.g., power outage or a typo), Nextflow will skip the steps that already finished successfully. This is a lifesaver for long-running RNA-seq jobs.

Summary for your Lab Notebook Pixi ensures your environment is identical on your laptop, the lab server, and the HPC.

Nextflow manages the heavy lifting of file names, directories, and parallel processing.

Reproducibility: If you share the pixi.toml, pixi.lock, main.nf, and nextflow.config, any other student can replicate your results perfectly.

1.9 Technical Summary

Component	Tool	Responsibility
Orchestration	Nextflow	Job scheduling and data flow
Env Manager	Pixi	Lightning-fast dependency resolution
Reproducibility	`pixi.lock`	Strict version locking
Reporting	Quarto	Documentation and results

1.10 Add new process

(e.g., MultiQC) You can easily expand this workflow by adding new processes. For example, you might want to add a “MultiQC” step to this tutorial to aggregate all your QC results into one report.

By default, MultiQC scans the folder and orders modules based on its own internal priority. To get that specific “ordered” look—where you see General Stats followed by FastQC, then fastp, and finally Salmon—you need a small configuration file. Here’s how you can set that up:

// 5. Reporting
process MULTIQC {
    tag "Creating Ordered Report"
    publishDir "${params.outdir}/reports", mode: 'copy'

    input:
    path all_logs      // All the tool outputs
    path mqc_config    // The .yaml file we just made

    output:
    path "multiqc_report.html"

    script:
    // We add the -c flag to point to our config
    """
    multiqc . -c ${mqc_config}
    """
}


workflow {
// COLLECT all outputs into one list for MultiQC
    // This ensures MultiQC waits for everything to finish
    ch_multiqc_input = fastqc_out.collect()
        .mix(fastp_out.json.collect())
        .mix(salmon_out.collect())
        .collect()

    // Pass the logs AND the config file
    MULTIQC(ch_multiqc_input, "${projectDir}/multiqc_config.yaml")
}

This will run MultiQC after all the processes have completed, collecting all the JSON files into a single report. The collect() method gathers all the emitted JSON paths into a list that MultiQC can process. This is the beauty of Nextflow’s dataflow programming model—it allows you to easily connect processes and manage complex workflows with minimal code changes.

you saw that we added a new process called MULTIQC that takes all the logs from the previous steps and generates a comprehensive report. The multiqc_config.yaml file is used to specify the order of modules in the MultiQC report, ensuring that the most important QC metrics are highlighted first.

Here is how to set up a multiqc_config.yaml and update your workflow to use it.

Create the Config (multiqc_config.yaml)

Create this file in your project root. This tells MultiQC exactly how to stack the report.

# multiqc_config.yaml
report_header_info:
    project_name: "RNA-Seq Analysis with Nextflow + Pixi"
    contact_email: "abu.siddique@slu.se"

# This defines the vertical order of the modules in the report
top_modules:
  - fastqc
  - fastp
  - salmon

module_order:
  - fastqc:
      name: "Raw Read Quality (FastQC)"
      anchor: "fastqc_raw"
  - fastp:
      name: "Adapter Trimming (fastp)"
  - salmon:
      name: "Quantification (Salmon)"

# Customizes the General Statistics table columns
table_columns_visible:
  fastqc:
    percent_duplicates: False
    percent_gc: True
  fastp:
    pct_trimmed: True
  salmon:
    percent_mapped: True

Now, pass this multiqc_config.yaml file to the MultiQC process in your workflow. This ensures that when MultiQC runs, it uses your custom configuration to generate the report. update the MultiQC process in your main.nf to include the config file as an input:

Updated main.nf

with MultiQC process

/*
 * RNA-Seq Pipeline: FastQC -> Fastp -> Salmon -> MultiQC
 */

// ===========================================
// PROCESSES
// ===========================================

// 1. Indexing (Run once per transcriptome)
process SALMON_INDEX {
    tag "${transcriptome.baseName}"
    publishDir "${params.outdir}/salmon_index", mode: 'copy'

    input:  path transcriptome
    output: path "salmon_index", emit: index
    script: "salmon index -t ${transcriptome} -i salmon_index -k 15"
}

// 2. Quality Control
process FASTQC {
    tag "${pair_id}"
    publishDir "${params.outdir}/fastqc", mode: 'copy'

    input:  tuple val(pair_id), path(r1), path(r2)
    output: path "*_fastqc.{zip,html}"
    script: "fastqc ${r1} ${r2}"
}

// 3. Trimming
process FASTP {
    tag "${pair_id}"
    // No publishDir here usually, to save space. Results go to Salmon.
    
    input:  tuple val(pair_id), path(r1), path(r2)
    output: 
        tuple val(pair_id), path("tr_${pair_id}_R1.fq.gz"), path("tr_${pair_id}_R2.fq.gz"), emit: reads
        path "*.json", emit: json

    script:
    """
    fastp -i ${r1} -I ${r2} \
          -o tr_${pair_id}_R1.fq.gz -O tr_${pair_id}_R2.fq.gz \
          -j ${pair_id}.json
    """
}

// 4. Quantification
process SALMON_QUANT {
    tag "${pair_id}"
    publishDir "${params.outdir}/counts", mode: 'copy'

    input:
    tuple val(pair_id), path(r1), path(r2)
    path index

    output:
    path "${pair_id}_quant"

    script:
    """
    salmon quant -i ${index} -l A \
          -1 ${r1} -2 ${r2} \
          -o ${pair_id}_quant
    """
}

// 5. Reporting
process MULTIQC {
    tag "Creating Ordered Report"
    publishDir "${params.outdir}/reports", mode: 'copy'

    input:
    path all_logs      // All the tool outputs
    path mqc_config    // The .yaml file we just made

    output:
    path "multiqc_report.html"

    script:
    // We add the -c flag to point to our config
    """
    multiqc . -c ${mqc_config}
    """
}


// ===========================================
// WORKFLOW
// ===========================================

workflow {
    // Read input pairs (flat: true gives val, path, path)
    read_ch = channel.fromFilePairs(params.reads, flat: true)

    // Run Indexing once
    SALMON_INDEX(params.transcriptome)

    // Run QC and Trimming in parallel
    fastqc_out = FASTQC(read_ch)
    fastp_out  = FASTP(read_ch)

    // Run Salmon using the index and the trimmed reads
    salmon_out = SALMON_QUANT(fastp_out.reads, SALMON_INDEX.out.index)

    // COLLECT all outputs into one list for MultiQC
    // This ensures MultiQC waits for everything to finish
    ch_multiqc_input = fastqc_out.collect()
        .mix(fastp_out.json.collect())
        .mix(salmon_out.collect())
        .collect()

    // Pass the logs AND the config file
    MULTIQC(ch_multiqc_input, "${projectDir}/multiqc_config.yaml")
}

New run

same command with resume flag

pixi run nextflow run main.nf -resume

1.11 Monitoring Results

Once the run finishes, check the results/ folder. You should see subfolders for fastqc, counts, and reports. The multiqc_report.html will be in the reports/ folder, summarizing all your QC metrics in one place.

main.nf will now run multiqc after all steps are complete. The new output will be in results/multiqc/ with a single report combining all the tool outputs.

💡 Note: You will see extra folder as results/reports with nextflow execution reports in details. This is a great place to check if you want to see the exact commands that were run, the resources used, and any errors that occurred during execution.

Result checklist

How to interpret the Salmon Output Inside results/salmon/[sample]_quant/, the most important file is quant.sf.

Name: The transcript/gene ID.

Length: The length of the sequence.

TPM: (Transcripts Per Million) - This is a normalized value used to compare expression within a sample.

NumReads: The estimated number of reads that mapped to this gene (the “Raw Counts”). Use these for DESeq2.

Summary

Tool	Purpose	Graduate Tip
Pixi	Environment	Run pixi list to see exactly which versions are in your thesis.
Nextflow	Orchestration	Use the work/ directory to debug if a specific step fails.
Salmon	Indexing	Run once per reference. Store the index for future runs.
FastQC	Quality Control	Look for “Per base sequence quality” and “Adapter Content” in the report.
Fastp	Quality Control	Always check the “Adapter Content” in the MultiQC report.
Salmon	Quantification	TPM is for visualization; Raw Counts are for statistics.
MultiQC	Reporting	Customize the config to highlight the most important metrics for your thesis.

Troubleshooting

Error Message	Likely Cause	Solution
`0 processes matched`	Your file path in `nextflow.config` is wrong	Check the extension (e.g., `.fastq.gz` vs `.fq.gz`)
`Permission denied`	Nextflow can’t write to the `work/` folder	Check your directory permissions; don’t run from a read-only drive
`Java version error`	System Java is older than 17	Ensure you ran pixi add `openjdk` and use `pixi run`