This section includes code for processing and analyzing transcriptomics data.
Table of contents
RNA-seq data analysis
In this tutorial, I will be talking about how I analyze short-read RNA-seq data and how I use it to obtain valuable information on understanding biological processes. There are a number of advantages to using RNA-seq data compared to other omics approaches. Mainly, it is fairly simple and inexpensive to generate large amount of data and there is a vast amount of data publicly available.
Introduction
In the ChIP-seq tutorial, we have started analyzing data from two papers that have both provided both ChIP-seq and RNA-seq data under similar conditions. The RNA-seq experiments are found under the PRJNA339762
and PRJNA429781
accession, and we are going to process the data and see if we can reproduce some of the results from the papers, and draw some interesting conclusions about the role of WRKY transcription factors in plant response to immunity-inducing peptide hormones. In later step, perhaps we can combine all the data together and run a small multi-omics comparative analysis.
Data preparation
Brief introduction
There are many different good ways to analyze short-read RNA-seq data, and they generally produce results of relatively similar quality. Most analysis workflows consist of the following steps:
- Quality control
- Trimming
- Alignment
- Sorting
- Quantification
If you are working with your own data, I recommend quality control on the raw data, such as FastQC. If you are working with public data, I find the report generated by fastp to be sufficient to notice problematic samples. For adapter trimming of FASTQ files I use fastp, which also does preprocessing, quality control, and deduplication. For alignment I like to use HISAT2. For sorting and filtering reads I use Sambamba. For quantification I use the FeatureCounts function from the Subread package. We are also going to use Samtools to compress the BAM file to a CRAN files to save some disk space. Not all tools support the CRAN format, so you might need to convert it back to BAM.
Creating a conda environment
Let’s start with creating a conda environment. Create an environment.yml
file that contains the following text:
name: rnaseq
channels:
- conda-forge
- bioconda
dependencies:
- python=3.9.18
- snakemake
- hisat2
- fastp
- sambamba
- subread
- samtools
- pandas
- axel
- gffread
- pip
- parallel
Next, run the command conda env create -f environment.yml
. After it installs, run conda activate rnaseq
. My script assumes that there is ramdisk
directory so lets create one by running mkdir ramdisk
. It is an optional step, but we can mount the ramdisk directory as an actual ramdisk, so all the temporary files are written on your RAM memory instead of the harddrive. In addition to being faster, this will also reduce the amount of writing hardrive and will help to prolong the life of the harddrive. The caveat here is that you need a relatively large amount of available RAM especially if you are dealing with a large genome or large FASTQ files. To mount a ramdisk, run the command sudo mount -t tmpfs -o size=250000m tmpfs ramdisk/
. We can now proceed to the next step and run the actual analysis.
Creating the genome index files
We are going to run an analysis on Arabidopsis RNA-seq data so we will need to build
# Build hisat2 genome index for aligning
hisat2-build -p 32 genome.fa genome
# Create a gtf file
gffread genome.gff3 -T -o genome.gtf
Run the analysis workflow
I am currently using a python script to automate the analysis workflow. You can modify the below file for your own needs, but it does is it iterates over all the directories in the current directory and checks if it can extract the RNA-seq metadata for the RNA-seq experiments for the given directory name. If it finds metadata, it generates a temporary process.tsv
, first for the paired-end reads, and then another one for the single-end reads. We need this because many, but not all, RNA-seq experiments have mixed single- and paired-end FASTQ files. There are also samples with both paired- and single-end FASTQ files but I chose to ignore the single-end FASTQ files and just analyze the paired-end ones. If it is critical for you to get all reads, SAMTOOLS or SAMBAMBA can be used to merged the aligned reads after the alignment is done.
The whole analysis is done by using the GNU parallel to call a bash script that runs all the processing steps. The parallel -j1 --colsep '\\t' -a process.tsv bash workflow.paired.sh {1} {2} {3} {4} {5} {6} {7}
is very convenient here because it iterates over the rows in the process.tsv
files we are generating, running the script with all the relevant arguments for each sample.
Here’s the code for the workflow.rnaseq.py
file:
import glob, os
import pandas as pd
experiment_list = []
for experiment in os.listdir("."):
if "ramdisk" in experiment:
continue
# check whether the current object is a folder or not
if os.path.isdir(os.path.join(os.path.abspath("."), experiment)):
experiment_list.append(experiment)
if ".gtf" in experiment:
index_name = experiment.split(".gtf")[0]
for experiment in experiment_list:
os.makedirs(os.path.join(experiment, "crams"), exist_ok=True)
os.makedirs(os.path.join(experiment, "counts"), exist_ok=True)
os.makedirs(os.path.join(experiment, "reports"), exist_ok=True)
url="https://www.ebi.ac.uk/ena/portal/api/filereport?accession={0}&result=read_run&fields=fastq_ftp,fastq_md5".format(experiment)
# Open the URL with pandas and load it as a dataframe
try:
accession_df = pd.read_csv(url, sep="\t")
except:
print("Did not work:", experiment)
continue
accession_df["num_fq"] = accession_df["fastq_ftp"].apply(lambda x: len(x.split(";")))
done_list = []
for fl in glob.glob(os.path.join(experiment, "crams", "*")):
done_list.append(os.path.splitext(os.path.basename(fl))[0])
paired_df = []
single_df = []
for key, val, md5 in accession_df[["run_accession", "fastq_ftp", "fastq_md5"]].values.tolist():
splits = val.split(";")
md5 = md5.split(";")
# Sometimes there are paired + single fastq files, keep only paired
if len(splits)==3:
splits = splits[1:]
md5 = md5[1:]
if key not in done_list:
if len(splits)==2:
paired_df.append([experiment, key, splits[0], md5[0], splits[1], md5[1], index_name])
if len(splits)==1:
single_df.append([experiment, key, splits[0], md5[0], index_name])
print(f"Paired: {len(paired_df)} and Single: {len(single_df)}")
if len(paired_df)>0:
paired_df = pd.DataFrame(paired_df)
paired_df.to_csv("process.tsv", sep="\t", index=None, header=None)
("Starting to run {}. Done: {} Remaining: {}".format(experiment, len(done_list), len(paired_df)+len(single_df)))
os.system("parallel -j1 --colsep '\\t' -a process.tsv bash workflow.paired.sh {1} {2} {3} {4} {5} {6} {7}")
if len(single_df)>0:
single_df = pd.DataFrame(single_df)
single_df.to_csv("process.tsv", sep="\t", index=None, header=None)
print("Starting to run {}. Done: {} Remaining: {}".format(experiment, len(done_list), len(paired_df)+len(single_df)))
os.system("parallel -j1 --colsep '\\t' -a process.tsv bash workflow.single.sh {1} {2} {3} {4} {5}")
Here’s the code for the workflow.paired.sh
file:
# $1 = experiment accession
# $2 = sample name
# $3 = fastq file 1
# $4 = md5 file 1
# $5 = fastq file 2
# $6 = md5 file 2
# $7 = genome index
while true; do
axel --quiet -n 8 -o ramdisk $3
actual_md5=$(md5sum "ramdisk/$2_1.fastq.gz" | awk '{ print $1 }')
if [ "$actual_md5" == "$4" ]; then
break
else
rm ramdisk/$2_1.fastq.gz
fi
done
while true; do
axel --quiet -n 8 -o ramdisk $5
actual_md5=$(md5sum "ramdisk/$2_2.fastq.gz" | awk '{ print $1 }')
if [ "$actual_md5" == "$6" ]; then
break
else
rm ramdisk/$2_2.fastq.gz
fi
done
fastp --thread 15 --in1 ramdisk/$2_1.fastq.gz --out1 ramdisk/$2_1.fastq --in2 ramdisk/$2_2.fastq.gz --out2 ramdisk/$2_2.fastq --html $1/reports/fastp_$2.html
rm ramdisk/$2_1.fastq.gz ramdisk/$2_2.fastq.gz
hisat2 -p 35 --min-intronlen 20 --max-intronlen 6000 -x $7 -1 ramdisk/$2_1.fastq -2 ramdisk/$2_2.fastq --summary-file $1/reports/hisat2_$2.txt | \
sambamba view -S -f bam -o /dev/stdout /dev/stdin | \
sambamba sort -F "not unmapped" --tmpdir="ramdisk/tmpmba" -t 35 -o ramdisk/$2.bam /dev/stdin
featureCounts -p --countReadPairs -t exon,CDS -T 35 -a $7.gtf -o $1/counts/$2.counts ramdisk/$2.bam
samtools view -@ 35 -T $7.fa -C -o $1/crams/$2.cram ramdisk/$2.bam
rm ramdisk/$2*
Here’s the code for the workflow.single.sh
file:
# $1 = experiment accession
# $2 = sample name
# $3 = fastq file 1
# $4 = md5 file 1
# $5 = genome index
while true; do
axel --quiet -n 8 -o ramdisk $3
actual_md5=$(md5sum "ramdisk/$2.fastq.gz" | awk '{ print $1 }')
if [ "$actual_md5" == "$4" ]; then
break
else
rm ramdisk/$2.fastq.gz
fi
done
fastp --thread 15 --in1 ramdisk/$2.fastq.gz --out1 ramdisk/$2.fastq --html $1/reports/fastp_$2.html
rm ramdisk/$2.fastq.gz
hisat2 -p 35 --max-intronlen 6000 -x $5 -U ramdisk/$2.fastq --summary-file $1/reports/hisat2_$2.txt | \
sambamba view -S -f bam -o /dev/stdout /dev/stdin | \
sambamba sort -F "not unmapped" --tmpdir="ramdisk/tmpmba" -t 35 -o ramdisk/$2.bam /dev/stdin
featureCounts -t exon,CDS -T 35 -a $5.gtf -o $1/counts/$2.counts ramdisk/$2.bam
samtools view -@ 35 -T $5.fa -C -o $1/crams/$2.cram ramdisk/$2.bam
rm ramdisk/$2*
Running the automated workflow
We can start the analysis by running the following commands:
conda activate rnaseq
mkdir ramdisk
mkdir PRJNA339762
mkdir PRJNA429781
sudo mount -t tmpfs -o size=250000m tmpfs ramdisk # optional, but recommended
python workflow.rnaseq.py
- Note that
/mount/ramdisk
is mounted as a tmpfs folder, which means that all intermediate files are written to the RAM memory and not to disk. The main reason I am using this is to reduce the amount of write-cycles, but it should also speed up the analysis because files are not written and read from the disk. This might not work if you do not have the proper permissions, so you might need remove the--mount
process. - It is very important to run the MD5 checksum on the downloaded FASTQ files from ENA. Often times, especially for very large FASTQ files, the downloaded file does not match the provided MD5. Instead of throwing an error, you are still very likely to generate all the correct files but they are just going to be incomplete and there is no easy way to tell if the output was correctly analyzed or not.