This section includes code for processing and analyzing transcriptomics data.
Table of contents
Short-read RNA-seq data
Going with the workflow: Quantification
There are many different good ways to analyze short-read RNA-seq data, and they generally produce results of relatively similar quality. Most analysis workflows consist of the following steps:
- Quality control
- Trimming
- Alignment
- Sorting
- Quantification
If you are working with your own data, I recommend quality control on the raw data, such as FastQC. If you are working with public data, I find the report generated by fastp to be sufficient to notice problematic samples. For adapter trimming of FASTQ files I use fastp, which also does preprocessing, quality control, and deduplication. For alignment I like to use HISAT2. For sorting and filtering reads I use Sambamba. For quantification I use the FeatureCounts function from the Subread package. We are also going to use Samtools to compress the BAM file to a CRAN files to save some disk space. Not all tools support the CRAN format, so you might need to convert it back to BAM.
I am planning to add more details in the near future but for the time being I will include the code for running the analysis using my Docker image for RNA-seq analysis. Before we can run the complete Snakemake workflow, we will need to prepare some files.
First, we need to convert the genome FASTA file to a HISAT2 index file that will be for the alignment process. Add the FASTA file to the mount/
folder and run the following command: docker run -v ${PWD}/mount/:/mount --rm -it externelly/plantapp:rnaseq /bin/bash -c "hisat2-build -p 32 genome_name.fa genome_name"
Second, if you have a GFF3 file but not a GTF file, you will need to convert it to run featureCounts. Add the GTF file to the mount/
folder and run the following command: docker run -v ${PWD}/mount/:/mount --rm -it externelly/plantapp:rnaseq /bin/bash -c "gffread genome_name.gff3 -T -o genome_name.gtf"
- The Snakemake workflow assumes that the
genome_name
of the index and GTF files are the same
Finally, we can start the Snakemake workflow to start the analysis. You can download the Snakemake workflow from here. The following Snakemake workflow assumes you want to align an existing RNA-seq experiment. I prefer working with the data hosted on ENA, but it is possible to access the data from SRA, although this Snakemake workflow only supports the ENA experiment accession ID. Once you have it, move the index, gtf and Snakemake files to the /mount folder and run the following command: docker run -v ${PWD}/mount/:/mount --mount type=tmpfs,destination=/mount/ramdisk,tmpfs-size=96G --rm externelly/plantapp:rnaseq /bin/bash -c "snakemake -F -p -j1 --snakefile ena.accession.smk --config sra=PRJNA idx=genome_name
- Note that
/mount/ramdisk
is mounted as a tmpfs folder, which means that all intermediate files are written to the RAM memory and not to disk. The main reason I am using this is to reduce the amount of write-cycles, but it should also speed up the analysis because files are not written and read from the disk. This might not work if you do not have the proper permissions, so you might need remove the--mount
process. - Note that the Snakemake workflow is set to run using 31 threads. If you are using the Docker container, it is best to keep one thread free, otherwise you mount encounter a
Bus error
. Otherwise, you can change the number and the maximum number of threads. - Note that if you do mount the
/mount/ramdisk
using Docker, at the end of the analyis you might update the folder permission to be able to move or edit the result files. On linux you can use:sudo chmod -R 777 mount/
.