This section is a work in progress and documents a full workflow for analyzing direct RNA-seq data from Oxford Nanopore, using the example dataset PRJNA1264749. The goal is to reproduce published results and explore additional discoveries using modern long-read tools.
Table of contents
Nanopore Direct RNA-seq Analysis (WIP)
Introduction
This tutorial walks through a full analysis of direct RNA-seq data generated with Oxford Nanopore technology. We use the public dataset PRJNA1264749 as an example, aiming to reproduce published findings and explore new insights, including differential gene expression and alternative splicing.
The workflow covers:
- Quality control (QC)
- Alignment
- Quantification
- Differential gene expression (DGE)
- Alternative splicing analysis
All scripts and commands are provided, and the workflow is modular and reproducible.
Data Preparation
Example Dataset
- Accession: PRJNA1264749
- Data: Direct RNA-seq (Nanopore), WT and mutant (mettl5-1) lines, 3 replicates each
- Directory structure:
PRJNA1264749/
contains all FASTQ files
Environment Setup
Create an environment.yml
for conda:
name: nanopore-rnaseq
channels:
- conda-forge
- bioconda
dependencies:
- python=3.9
- minimap2
- subread
- isoquant
- suppa
- r-ggrepel
- samtools
- pandas
- pip
- parallel
Install and activate:
conda env create -f environment.yml
conda activate nanopore-rnaseq
pip install NanoPlot pycoQC "scipy<1.11" "statsmodels<1.15" pandas numpy suppa
Downloading FASTQ Files
Before starting the analysis, download all FASTQ files for your experiment using the provided Python script. This script will download and verify all FASTQ files for a given ENA experiment accession (e.g., PRJNA1264749) and save them in a folder named after the experiment.
Requirements:
- Python 3
- pandas
- axel (for fast FTP downloads)
Usage:
python download.fastq.py
Edit the experiment
variable in the script to your ENA accession if needed.
What the script does:
- Downloads all FASTQ files for the experiment using axel
- Renames files to {accession}.fastq.gz
- Verifies MD5 checksums
- Skips files that already exist and are verified
Script docstring:
"""
download.fastq.py
Downloads all FASTQ files for a given ENA experiment accession (e.g., PRJNA1264749), verifies their MD5 checksums, and saves them in a folder named after the experiment.
Requirements:
- Python 3
- pandas
- axel (for fast FTP downloads)
Usage:
python download.fastq.py
Edit the 'experiment' variable in the script to your ENA accession if needed.
This script will:
- Download all FASTQ files for the experiment using axel
- Rename files to {accession}.fastq.gz
- Verify MD5 checksums
- Skip files that already exist and are verified
"""
Quality Control (QC)
Run QC on all FASTQ files using NanoPlot:
bash nanopore.qc.sh PRJNA1264749
Where nanopore.qc.sh
is:
#!/usr/bin/env bash
# Usage: ./nanopore.qc.sh <experiment_folder>
set -euo pipefail
EXPERIMENT="$1"
QC_DIR="$EXPERIMENT/qc"
mkdir -p "$QC_DIR"
for fq in "$EXPERIMENT"/*.fastq "$EXPERIMENT"/*.fastq.gz; do
if [ -e "$fq" ]; then
fname=$(basename "$fq")
outdir="$QC_DIR/${fname%.fastq*}_NanoPlot"
NanoPlot --fastq "$fq" -o "$outdir"
fi
done
echo "QC complete. Results in $QC_DIR/"
Alignment
Align all reads to the reference genome using minimap2:
bash nanopore.aln.sh PRJNA1264749 AtCol-0.fa
Where nanopore.aln.sh
is:
#!/usr/bin/env bash
# Usage: ./nanopore.aln.sh <experiment_folder> <reference.fa>
set -euo pipefail
EXPERIMENT="$1"
REF="$2"
BAM_DIR="$EXPERIMENT/bam"
mkdir -p "$BAM_DIR"
for fq in "$EXPERIMENT"/*.fastq "$EXPERIMENT"/*.fastq.gz; do
if [ -e "$fq" ]; then
fname=$(basename "$fq")
sample="${fname%%.fastq*}"
bam_out="$BAM_DIR/${sample}.bam"
echo "Aligning $fq -> $bam_out"
minimap2 -ax splice -uf -k14 "$REF" "$fq" | samtools sort -o "$bam_out"
samtools index "$bam_out"
fi
done
echo "Alignment and indexing complete. BAM files in $BAM_DIR/"
Quantification
Count reads per gene using featureCounts:
bash nanopore.cnt.sh PRJNA1264749 AtCol-0.gtf
Where nanopore.cnt.sh
is:
#!/usr/bin/env bash
# Usage: ./nanopore.cnt.sh <experiment_folder> <annotation.gtf>
set -euo pipefail
EXPERIMENT="$1"
GTF="$2"
BAM_DIR="$EXPERIMENT/bam"
QUANT_DIR="$EXPERIMENT/quant"
mkdir -p "$QUANT_DIR"
BAM_FILES=("$BAM_DIR"/*.bam)
if [ "${#BAM_FILES[@]}" -eq 0 ]; then
echo "No BAM files found in $BAM_DIR."
exit 1
fi
featureCounts -L -a "$GTF" -o "$QUANT_DIR/counts.txt" "${BAM_FILES[@]}"
echo "featureCounts complete. Results in $QUANT_DIR/counts.txt"
Differential Gene Expression (DGE)
Run DGE analysis in R using edgeR:
# nanopore.deg.R
library(edgeR)
library(ggplot2)
experiment <- "PRJNA1264749"
deg_dir <- file.path(experiment, "deg")
dir.create(deg_dir, showWarnings=FALSE, recursive=TRUE)
counts <- read.delim(file.path(experiment, "quant/counts.txt"), comment.char="#", row.names=1, check.names=FALSE)
counts <- counts[,6:ncol(counts)]
colnames(counts) <- basename(colnames(counts))
colnames(counts) <- sub("\\.[^.]*$", "", colnames(counts))
rownames(counts) <- sub("\\..*$", "", rownames(counts))
group_info <- read.csv("sample_table.csv", row.names=1)
counts <- counts[, rownames(group_info)]
group <- factor(group_info$condition)
dge <- DGEList(counts=counts, group=group)
keep <- filterByExpr(dge)
dge <- dge[keep, , keep.lib.sizes=FALSE]
dge <- calcNormFactors(dge)
design <- model.matrix(~group)
dge <- estimateDisp(dge, design)
fit <- glmQLFit(dge, design)
qlf <- glmQLFTest(fit, coef=2)
write.csv(topTags(qlf, n=Inf), file=file.path(deg_dir, "dge_edger_results.csv"))
pdf(file.path(deg_dir, "dge_edger_MAplot.pdf"))
plotMD(qlf)
dev.off()
logCPM <- cpm(dge, log=TRUE, prior.count=1)
pca <- prcomp(t(logCPM), scale.=TRUE)
pca_df <- data.frame(PC1 = pca$x[,1], PC2 = pca$x[,2], condition = group)
pdf(file.path(deg_dir, "dge_edger_PCAplot.pdf"))
ggplot(pca_df, aes(x=PC1, y=PC2, color=condition)) +
geom_point(size=4) +
labs(title="PCA of samples", x="PC1", y="PC2") +
theme_minimal()
dev.off()
write.csv(logCPM, file=file.path(deg_dir, "cpm_matrix.csv"))
Isoform Quantification (IsoQuant)
Run IsoQuant on all BAMs to get isoform-level TPMs:
python nanopore.isq.py PRJNA1264749 AtCol-0.fixed.gtf AtCol-0.fa
Where nanopore.isq.py
is:
# Only the IsoQuant step is run here; see the full script in the repo
# Output: PRJNA1264749/quant/isoquant_all/OUT/OUT.discovered_transcript_grouped_tpm.tsv
Alternative Splicing Analysis (SUPPA2)
Run SUPPA2 event calculation and splicing analysis:
python nanopore.dsu.py PRJNA1264749 sample_table.csv
Where nanopore.dsu.py
is:
# See the full script in the repo for all steps:
# - Event calculation (generateEvents)
# - Combine .ioe files
# - Clean headers
# - Patch TPM file for missing transcripts
# - Remove 'gene_id' from header
# - Run psiPerEvent
# - Split files by condition
# - Run diffSplice
Notes and Tips
- If your GTF file does not contain gene rows, use the provided
fix.gtf.py
script to add them before running IsoQuant. - This workflow is a work in progress and will be updated as new best practices and tools emerge.
References
This tutorial is under active development. Please check back for updates and improvements!