This section includes code for processing and analyzing transcriptomics data.

Table of contents

Processing raw RNA-seq data
1. Introduction
2. Data preparation
Differential gene expression
Removing batch effects

Processing raw RNA-seq data

In this tutorial, I will be talking about how I analyze short-read RNA-seq data and how I use it to obtain valuable information on understanding biological processes. There are a number of advantages to using RNA-seq data compared to other omics approaches. Mainly, it is fairly simple and inexpensive to generate large amount of data and there is a vast amount of data publicly available.

Introduction

In the ChIP-seq tutorial, we have started analyzing data from two papers that have both provided both ChIP-seq and RNA-seq data under similar conditions. The RNA-seq experiments are found under the PRJNA339762 and PRJNA429781 accession, and we are going to process the data and see if we can reproduce some of the results from the papers, and draw some interesting conclusions about the role of WRKY transcription factors in plant response to immunity-inducing peptide hormones. In later step, perhaps we can combine all the data together and run a small multi-omics comparative analysis.

Data preparation

Brief introduction

There are many different good ways to analyze short-read RNA-seq data, and they generally produce results of relatively similar quality. Most analysis workflows consist of the following steps:

Quality control
Trimming
Alignment
Sorting
Quantification

If you are working with your own data, I recommend quality control on the raw data, such as FastQC. If you are working with public data, I find the report generated by fastp to be sufficient to notice problematic samples. For adapter trimming of FASTQ files I use fastp, which also does preprocessing, quality control, and deduplication. For alignment I like to use HISAT2. For sorting and filtering reads I use Sambamba. For quantification I use the FeatureCounts function from the Subread package. We are also going to use Samtools to compress the BAM file to a CRAN files to save some disk space. Not all tools support the CRAN format, so you might need to convert it back to BAM.

Creating a conda environment

Let’s start with creating a conda environment. Create an environment.yml file that contains the following text:

name: rnaseq
channels:
  - conda-forge
  - bioconda
dependencies:
  - python=3.9.18
  - snakemake
  - hisat2
  - fastp
  - sambamba
  - subread
  - samtools
  - pandas
  - axel
  - gffread
  - pip
  - parallel

Next, run the command conda env create -f environment.yml. After it installs, run conda activate rnaseq. My script assumes that there is ramdisk directory so lets create one by running mkdir ramdisk. It is an optional step, but we can mount the ramdisk directory as an actual ramdisk, so all the temporary files are written on your RAM memory instead of the harddrive. In addition to being faster, this will also reduce the amount of writing hardrive and will help to prolong the life of the harddrive. The caveat here is that you need a relatively large amount of available RAM especially if you are dealing with a large genome or large FASTQ files. To mount a ramdisk, run the command sudo mount -t tmpfs -o size=250000m tmpfs ramdisk/. We can now proceed to the next step and run the actual analysis.

Creating the genome index files

We are going to run an analysis on Arabidopsis RNA-seq data so we will need to build the HISAT2 index file for the arabidopsis genome and make sure that we have a GTF file.

# Build hisat2 genome index for aligning
hisat2-build -p 32 genome.fa genome

# Create a gtf file
gffread genome.gff3 -T -o genome.gtf

Run the analysis workflow

I am currently using a python script to automate the analysis workflow. You can modify the below file for your own needs, but it does is it iterates over all the directories in the current directory and checks if it can extract the RNA-seq metadata for the RNA-seq experiments for the given directory name. If it finds metadata, it generates a temporary process.tsv, first for the paired-end reads, and then another one for the single-end reads. We need this because many, but not all, RNA-seq experiments have mixed single- and paired-end FASTQ files. There are also samples with both paired- and single-end FASTQ files but I chose to ignore the single-end FASTQ files and just analyze the paired-end ones. If it is critical for you to get all reads, SAMTOOLS or SAMBAMBA can be used to merged the aligned reads after the alignment is done.

The whole analysis is done by using the GNU parallel to call a bash script that runs all the processing steps. The parallel -j1 --colsep '\\t' -a process.tsv bash workflow.paired.sh {1} {2} {3} {4} {5} {6} {7} is very convenient here because it iterates over the rows in the process.tsv files we are generating, running the script with all the relevant arguments for each sample.

Here’s the code for the workflow.rnaseq.py file:

import glob, os
import pandas as pd

experiment_list = []
for experiment in os.listdir("."):
    if "ramdisk" in experiment:
        continue
    # check whether the current object is a folder or not
    if os.path.isdir(os.path.join(os.path.abspath("."), experiment)): 
        experiment_list.append(experiment)
    if ".gtf" in experiment:
        index_name = experiment.split(".gtf")[0]

for experiment in experiment_list:
    os.makedirs(os.path.join(experiment, "crams"), exist_ok=True)
    os.makedirs(os.path.join(experiment, "counts"), exist_ok=True)
    os.makedirs(os.path.join(experiment, "reports"), exist_ok=True)

    url="https://www.ebi.ac.uk/ena/portal/api/filereport?accession={0}&result=read_run&fields=fastq_ftp,fastq_md5".format(experiment)
    # Open the URL with pandas and load it as a dataframe
    try:
        accession_df = pd.read_csv(url, sep="\t")
    except:
        print("Did not work:", experiment)
        continue
    
    accession_df["num_fq"] = accession_df["fastq_ftp"].apply(lambda x: len(x.split(";")))

    done_list = []
    for fl in glob.glob(os.path.join(experiment, "crams", "*")):
        done_list.append(os.path.splitext(os.path.basename(fl))[0])

    paired_df = []
    single_df = []
    for key, val, md5 in accession_df[["run_accession", "fastq_ftp", "fastq_md5"]].values.tolist():
        splits = val.split(";")
        md5 = md5.split(";")
        # Sometimes there are paired + single fastq files, keep only paired
        if len(splits)==3:
            splits = splits[1:]
            md5 = md5[1:]
        if key not in done_list:
            if len(splits)==2:
                paired_df.append([experiment, key, splits[0], md5[0], splits[1], md5[1], index_name])
            if len(splits)==1:
                single_df.append([experiment, key, splits[0], md5[0], index_name])
            	
    print(f"Paired: {len(paired_df)} and Single: {len(single_df)}")
    if len(paired_df)>0:
        paired_df = pd.DataFrame(paired_df)
        paired_df.to_csv("process.tsv", sep="\t", index=None, header=None)
        ("Starting to run {}. Done: {} Remaining: {}".format(experiment, len(done_list), len(paired_df)+len(single_df)))
        os.system("parallel -j1 --colsep '\\t' -a process.tsv bash workflow.paired.sh {1} {2} {3} {4} {5} {6} {7}")
    
    if len(single_df)>0:
        single_df = pd.DataFrame(single_df)
        single_df.to_csv("process.tsv", sep="\t", index=None, header=None)
        print("Starting to run {}. Done: {} Remaining: {}".format(experiment, len(done_list), len(paired_df)+len(single_df)))
        os.system("parallel -j1 --colsep '\\t' -a process.tsv bash workflow.single.sh {1} {2} {3} {4} {5}")

Here’s the code for the workflow.paired.sh file:

# $1 = experiment accession
# $2 = sample name
# $3 = fastq file 1
# $4 = md5 file 1
# $5 = fastq file 2
# $6 = md5 file 2
# $7 = genome index

while true; do
    axel --quiet -n 8 -o ramdisk $3
    actual_md5=$(md5sum "ramdisk/$2_1.fastq.gz" | awk '{ print $1 }')
    if [ "$actual_md5" == "$4" ]; then
        break
    else
        rm ramdisk/$2_1.fastq.gz
    fi
done

while true; do
    axel --quiet -n 8 -o ramdisk $5
    actual_md5=$(md5sum "ramdisk/$2_2.fastq.gz" | awk '{ print $1 }')
    if [ "$actual_md5" == "$6" ]; then
        break
    else
        rm ramdisk/$2_2.fastq.gz
    fi
done

fastp --thread 15 --in1 ramdisk/$2_1.fastq.gz --out1 ramdisk/$2_1.fastq --in2 ramdisk/$2_2.fastq.gz --out2 ramdisk/$2_2.fastq --html $1/reports/fastp_$2.html
rm ramdisk/$2_1.fastq.gz ramdisk/$2_2.fastq.gz

hisat2 -p 35 --min-intronlen 20 --max-intronlen 6000 -x $7 -1 ramdisk/$2_1.fastq -2 ramdisk/$2_2.fastq --summary-file $1/reports/hisat2_$2.txt | \
sambamba view -S -f bam -o /dev/stdout /dev/stdin | \
sambamba sort -F "not unmapped" --tmpdir="ramdisk/tmpmba" -t 35 -o ramdisk/$2.bam /dev/stdin

featureCounts -p --countReadPairs -t exon,CDS -T 35 -a $7.gtf -o $1/counts/$2.counts ramdisk/$2.bam

samtools view -@ 35 -T $7.fa -C -o $1/crams/$2.cram ramdisk/$2.bam

rm ramdisk/$2*

Here’s the code for the workflow.single.sh file:

# $1 = experiment accession
# $2 = sample name
# $3 = fastq file 1
# $4 = md5 file 1
# $5 = genome index

while true; do
    axel --quiet -n 8 -o ramdisk $3
    actual_md5=$(md5sum "ramdisk/$2.fastq.gz" | awk '{ print $1 }')
    if [ "$actual_md5" == "$4" ]; then
        break
    else
        rm ramdisk/$2.fastq.gz
    fi
done

fastp --thread 15 --in1 ramdisk/$2.fastq.gz --out1 ramdisk/$2.fastq --html $1/reports/fastp_$2.html
rm ramdisk/$2.fastq.gz

hisat2 -p 35 --max-intronlen 6000 -x $5 -U ramdisk/$2.fastq --summary-file $1/reports/hisat2_$2.txt | \
sambamba view -S -f bam -o /dev/stdout /dev/stdin | \
sambamba sort -F "not unmapped" --tmpdir="ramdisk/tmpmba" -t 35 -o ramdisk/$2.bam /dev/stdin

featureCounts -t exon,CDS -T 35 -a $5.gtf -o $1/counts/$2.counts ramdisk/$2.bam

samtools view -@ 35 -T $5.fa -C -o $1/crams/$2.cram ramdisk/$2.bam

rm ramdisk/$2*

Running the automated workflow

We can start the analysis by running the following commands:

conda activate rnaseq
mkdir ramdisk
mkdir PRJNA339762
mkdir PRJNA429781
sudo mount -t tmpfs -o size=250000m tmpfs ramdisk # optional, but recommended
python workflow.rnaseq.py

Note that /mount/ramdisk is mounted as a tmpfs folder, which means that all intermediate files are written to the RAM memory and not to disk. The main reason I am using this is to reduce the amount of write-cycles, but it should also speed up the analysis because files are not written and read from the disk. This might not work if you do not have the proper permissions, so you might need remove the --mount process.
It is very important to run the MD5 checksum on the downloaded FASTQ files from ENA. Often times, especially for very large FASTQ files, the downloaded file does not match the provided MD5. Instead of throwing an error, you are still very likely to generate all the correct files but they are just going to be incomplete and there is no easy way to tell if the output was correctly analyzed or not.

Differential gene expression

Creating a conda environment

Let’s first create a environment.yml file. We will probably make changes to this file as we continue running the analysis. I prefer having the RStudio installed and opened through the conda environment to avoid incompatibility issues and to keep the analysis more reproducible as packages continue to update.

name: deg
channels:
  - conda-forge
  - bioconda
  - r
dependencies:
  - python
  - pip
  - rstudio
  - r-base 
  - r-essentials 
  - bioconductor-edger 
  - bioconductor-clusterprofiler 
  - bioconductor-annotationforge
  - r-ggplot2
  - r-ggvenndiagram
  - r-venndiagram

conda env create -f environment.yml
conda activate deg
rstudio

Running edgeR to detect DEGs

# Load required libraries
library(edgeR)

# Load the raw count data
counts <- read.table("PRJNA1152285.clean.tsv", row.names = 1, sep="\t", header=T)

# Define group information
group <- factor(c(rep("Control", 3), rep("B. japonicum", 3), rep("B. velezensis", 3), rep("Both", 3)))

# Create a DGEList object
dge <- DGEList(counts = counts, group = group)

# Filter lowly expressed genes
keep <- filterByExpr(dge)
dge <- dge[keep, , keep.lib.sizes = FALSE]

# Normalize the data
dge <- calcNormFactors(dge)

# Estimate dispersion
dge <- estimateDisp(dge)

# Create a design matrix
design <- model.matrix(~0 + group)
colnames(design) <- levels(group)

# Fit the model
fit <- glmFit(dge, design)

# Perform likelihood ratio tests for each treatment vs control
lrt_A <- glmLRT(fit, contrast = c(-1, 1, 0, 0))  # TreatmentA vs Control
lrt_B <- glmLRT(fit, contrast = c(-1, 0, 1, 0))  # TreatmentB vs Control
lrt_C <- glmLRT(fit, contrast = c(-1, 0, 0, 1))  # TreatmentC vs Control

# Save results to CSV
write.csv(topTags(lrt_A)$table, "DEG_TreatmentA_vs_Control.csv")
write.csv(topTags(lrt_B)$table, "DEG_TreatmentB_vs_Control.csv")
write.csv(topTags(lrt_C)$table, "DEG_TreatmentC_vs_Control.csv")

# Generate an MDS plot
plotMDS(dge, col = as.numeric(group))
legend("topright", legend = levels(group), col = 1:length(levels(group)), pch = 16)

We can see that all our samples group well together

Comparing the DEGs between the different conditions

# Identify significantly up- and down-regulated genes
up_A <- rownames(topTags(lrt_A, n = Inf)$table[topTags(lrt_A, n = Inf)$table$logFC > 1 & topTags(lrt_A, n = Inf)$table$FDR < 0.05, ])
down_A <- rownames(topTags(lrt_A, n = Inf)$table[topTags(lrt_A, n = Inf)$table$logFC < -1 & topTags(lrt_A, n = Inf)$table$FDR < 0.05, ])
up_B <- rownames(topTags(lrt_B, n = Inf)$table[topTags(lrt_B, n = Inf)$table$logFC > 1 & topTags(lrt_B, n = Inf)$table$FDR < 0.05, ])
down_B <- rownames(topTags(lrt_B, n = Inf)$table[topTags(lrt_B, n = Inf)$table$logFC < -1 & topTags(lrt_B, n = Inf)$table$FDR < 0.05, ])
up_C <- rownames(topTags(lrt_C, n = Inf)$table[topTags(lrt_C, n = Inf)$table$logFC > 1 & topTags(lrt_C, n = Inf)$table$FDR < 0.05, ])
down_C <- rownames(topTags(lrt_C, n = Inf)$table[topTags(lrt_C, n = Inf)$table$logFC < -1 & topTags(lrt_C, n = Inf)$table$FDR < 0.05, ])

Now that we have the list of up- and down-regulated DEGs for the three different conditions, we can use ggVennDiagram to make two Venn diagrams to compare the overlap between the different gene sets.

# Load the library
library(ggplot2)
library(ggVennDiagram)

# Create a list of up-regulated genes for each treatment
venn_list_up <- list(
  "B. japonicum" = up_A,
  "B. velezensis" = up_B,
  "Both" = up_C
)

# Create a list of up-regulated genes for each treatment
venn_list_down <- list(
  "B. japonicum" = down_A,
  "B. velezensis" = down_B,
  "Both" = down_C
)

# Generate a Venn diagram
ggVennDiagram(venn_list_up) +
  scale_fill_gradient(low = "white", high = "blue") +
  #theme_minimal() +
  theme(
    panel.background = element_blank(),
    plot.background = element_blank()
  ) +
  ggplot2::theme(
    legend.position = "none",  # Optional: Remove the legend
    plot.margin = ggplot2::margin(0, 30, 0, 30)  # Adjust margins
  )                                  

# Generate a Venn diagram
ggVennDiagram(venn_list_down) +
  scale_fill_gradient(low = "white", high = "blue") +
  #theme_minimal() +
  theme(
    panel.background = element_blank(),
    plot.background = element_blank()
  ) +
  ggplot2::theme(
    legend.position = "none",  # Optional: Remove the legend
    plot.margin = ggplot2::margin(0, 30, 0, 30)  # Adjust margins
  )

GO enrichment analysis

TODO

Removing batch effects

Introduction

Batch effects are a common and often unavoidable challenge in RNA-seq experiments. These are systematic variations introduced by technical factors, such as differences in sample preparation, sequencing runs, or laboratory conditions, rather than biological differences of interest. Batch effects can obscure true biological signals and lead to incorrect or innacurate conclusions in downstream analyses. In this tutorial, we will explore methods to identify and correct batch effects in RNA-seq data, including visualization techniques for detecting batch. It’s worth noting that I want to note that this is my first attempt at addressing batch effects in RNA-seq data. In the past year I have processed over 20k RNA-seq samples from multiple plant species, tissues, and conditions, and I would like to put these data to better use. While I am still gaining experience in this area, I’ve found that there aren’t universally accepted guidelines or best practices for every scenario. This tutorial is based on widely used tools and techniques, and I encourage you to adapt these methods to fit the specific needs of your dataset and research goals.

Data preparation

We are going to run two different methods for batch effect correction, edgeR and ComBat-seq. To make the results a little bit more comparable, we are going to use edgeR to filter genes by expression and convert counts to logCPM data. The script reads the individual count files located in the AtCol-0/counts/ directory and uses the rnaseq_sample_meta.tsv to get sample annotations, including which experiment they belong to (batch), what tissue they belong to (organ), if the samples were treated in any way (pertubation), and treatment groups within each experiment (group_order). This is the main part, where we will process the samples and run the batch corrections, all in R.

edgeR Tutorial Paper ComBat-seq Tutorial Paper General

# Load required libraries
library(data.table)
library(dplyr)
library(sva)
library(qsmooth)
library(edgeR)

# Set the paths to the count files and sample metadata
counts_path <- "AtCol-0"
batches_file <- "rnaseq_samples_meta.tsv"
output_folder <- "AtCol-0"

# Read the batches.tsv file
batches <- fread(batches_file)
batches <- data.frame(batches)

# Initialize an empty data frame for counts
df <- data.frame()

# Get all file paths matching the pattern
file_paths <- list.files(path = paste0(counts_path, "/counts/"), pattern = "*", full.names = TRUE)

# Merge all the separate count files
for (i in seq_along(file_paths)) {
  file_path <- file_paths[i]
  file_name <- tools::file_path_sans_ext(basename(file_path))  # Extract the batch name from the file name
  file_name <- sub("\\.counts.tsv", "", file_name)
  tmp <- fread(file_path)
  
  # Ensure the batch name exists in the batches file
  if (!(file_name %in% batches$experiment_acc)) {
    stop(paste("Batch", file_name, "not found in batches.tsv"))
  }
  
  # Merge data frames by geneID
  if (nrow(df) == 0) {
    df <- tmp
  } else {
    df <- merge(df, tmp, by = "geneID", all = TRUE)
  }
}

df <- data.frame(df)
geneIDs <- df$geneID
rownames(df) <- geneIDs
df <- df[, -1]

# Ensure all samples in the counts dataframe are in the batches.tsv file
if (!all(colnames(df) %in% batches$sample_acc)) {
  stop("Some samples in the counts data are not present in batches.tsv")
}

# Extract and reorder the batches file to match the column order of the counts dataframe
batches <- batches[batches$sample_acc %in% colnames(df), ]
batches <- batches[match(colnames(df), batches$sample_acc), ]

# Create the design matrix with the individual experiments as batches
batch <- as.factor(batches$experiment_acc)
design <- model.matrix(~ batch)

# We will use edgeR to process the data

# Create a DGEList object
y <- DGEList(counts = df)

# Filter lowly expressed genes (optional, but recommended)
keep <- filterByExpr(y, design)
y <- y[keep, , keep.lib.sizes = FALSE]

# Normalize library sizes
y <- calcNormFactors(y)

# Calculate log-transformed CPM
logCPM_raw <- cpm(y, log = TRUE, prior.count = .5)

# Save intermediate file: batch-corrected data
write.table(data.frame("geneID" = rownames(logCPM_raw), logCPM_raw, check.names = FALSE), 
            file = paste0(output_folder, "/AtCol-0.raw.logCPM.tsv"), sep = "\t", quote = FALSE, row.names = FALSE, col.names = TRUE)

# Remove batch effects using the design matrix
logCPM_edger <- removeBatchEffect(logCPM_raw, batch = batch)
logCPM_edger <- data.frame(logCPM_edger)
rownames(logCPM_edger) <- geneIDs[keep]

# Save intermediate file: batch-corrected data
write.table(data.frame("geneID" = rownames(logCPM_edger), logCPM_edger, check.names = FALSE), 
            file = paste0(output_folder, "/AtCol-0.edgeR.logCPM.tsv"), sep = "\t", quote = FALSE, row.names = FALSE, col.names = TRUE)

# Perform batch correction using ComBat-Seq on the count data after filtering for filterByExpr
counts_combat <- ComBat_seq(as.matrix(df[keep,]), batch = batch)

# Save intermediate file: batch-corrected data
write.table(data.frame("geneID" = rownames(counts_combat), counts_combat, check.names = FALSE), 
            file = paste0(output_folder, "/AtCol-0.CombatSeq.counts.tsv"), sep = "\t", quote = FALSE, row.names = FALSE, col.names = TRUE)

# Use the batch corrected counts with edgeR
y <- DGEList(counts = counts_combat)
y <- y[keep, , keep.lib.sizes = FALSE]
y <- calcNormFactors(y)
logCPM_combat <- cpm(y, log = TRUE, prior.count = .5)

# Save intermediate file: batch-corrected data
write.table(data.frame("geneID" = rownames(logCPM_combat), logCPM_combat, check.names = FALSE), 
            file = paste0(output_folder, "/AtCol-0.CombatSeq.logCPM.tsv"), sep = "\t", quote = FALSE, row.names = FALSE, col.names = TRUE)

Measuring distances between samples

Now that we have the logCPM data from uncorrected, edgeR-corrected, and ComBat-seq corrected, we can try to look at the distribution of distances between samples. To keep it simple, we are just going to look at the distribution between the control samples that comes from the same tissue, so they should be relatively similar to each other. The underlying hypothesis is that batch effects between experiments are going to dominate, leading to samples that come from the same batches are going to be more similar to each other and more distant to samples from other batches. Although the KDE plots shown below are not a definitive proof that this is indeed what we are observing, the bi- and multi-modal distributions observed for the uncorrected samples suggest that there is a strong batch effect present. The observed tentative batch effects are stronger for the leaf and root samples compared to the flower samples likely because there are many more batches of root and leaf samples and very few batches of flower samples. Intresetingly, the distribution of between-sample distances for the edgeR-corrected batches are inconsistent, producing a multi-modal distribution in the root samples and a uni-modal distribution in the leaf samples, while also being inconsistent in comparison to the distribution of the uncorrected samples. In contrast to the edgeR-corrected samples, the ComBat-seq corrected samples show much more consistent changes in the distributions, producing unimodal distributions that have, on average, higher between-sample distances compared to the uncorrected samples. These results are a good indication that ComBat-seq has been able to effectively remove the batch effects, so now the distances between samples are not dominated by the experiments that came from. Of course, to be able to make more difinitive arguments, we would have to dive deeper into the results.

# Load necessary libraries
library(ggplot2)
library(reshape2)

# Load your dataframes
# Assuming the dataframes are named: logCPM_raw, logCPM_edger, logCPM_combat, and batches
# Replace these with the actual file loading code if needed
logCPM_raw <- read.table(paste0(output_folder, "/AtCol-0.edgeR.logCPM.raw.tsv"), sep = "\t", header=TRUE, row.names="geneID")
logCPM_edger <- read.table(paste0(output_folder, "/AtCol-0.edgeR.logCPM.batch.tsv"), sep = "\t", header=TRUE, row.names="geneID")
logCPM_combat <- read.table(paste0(output_folder, "/expression.CombatSeq.logCPM.tsv"), sep = "\t", header=TRUE, row.names="geneID")

# Filter samples where organ == "Root" and perturbation == "None"
filtered_samples <- batches[batches$organ == "Root" & batches$pertubation == "None",]$sample_acc

# Subset the logCPM dataframes to include only the filtered samples
logCPM_raw_filtered <- logCPM_raw[, colnames(logCPM_raw) %in% filtered_samples]
logCPM_edger_filtered <- logCPM_edger[, colnames(logCPM_edger) %in% filtered_samples]
logCPM_combat_filtered <- logCPM_combat[, colnames(logCPM_combat) %in% filtered_samples]

# Calculate pairwise distances for each dataframe
dist_raw <- as.matrix(dist(t(logCPM_raw_filtered)))
dist_edger <- as.matrix(dist(t(logCPM_edger_filtered)))
dist_combat <- as.matrix(dist(t(logCPM_combat_filtered)))

# Convert distance matrices to long format for plotting
dist_raw_long <- melt(dist_raw)
dist_raw_long$method <- "Uncorrected"
dist_edger_long <- melt(dist_edger)
dist_edger_long$method <- "EdgeR"
dist_combat_long <- melt(dist_combat)
dist_combat_long$method <- "ComBat-Seq"

# Combine all distance data into one dataframe
distances <- rbind(dist_raw_long, dist_edger_long, dist_combat_long)
# Remove self-comparisons (distance = 0)
distances <- distances[distances$value != 0, ]

# Plot KDE of distances
ggplot(distances, aes(x = value, color = method, fill = method)) +
  geom_density(alpha = 0.4) +
  labs(
    title = "Distribution of pairwise distances: Seed",
    x = "Pairwise Distance",
    y = "Density",
    color = "Batch correction:",
    fill = "Batch correction:"
  ) +
  theme_minimal() +
  theme(
    legend.position = "top",
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.background = element_rect(fill = "white", color = NA),  # Set background to white
    plot.background = element_rect(fill = "white", color = NA)    # Set plot area background to white
  )

ggsave("distribution_across_root.png", width = 5, height = 5, dpi = 300)

Generating t-SNE plots

One useful way to look at batch effects before and after being corrected is by visualizing the samples in a low dimensional space. We will use a popular method, t-SNE, that can be used to embed the high-dimensional sample data onto a lower space, a two-dimensional map in our case. In this case, the hypothesis is that in the uncorrected data, samples from the same batches will cluster more closely together than samples from the same tissue but from different clusters. After batch correction, samples that are more bioloficaly similar to each other should cluster more closely together, irregardless of the batch they belong to. Let’s use all of our samples this time, and not just the control samples. When comparing the t-SNE plot for the uncorrected samples with t-SNE plots of the edgeR-corrected samples we can see that the edgeR-corrected samples are neither clustered by the experiment batch that they come from nor the tissue that they came from. This suggest that we weren’t able to properly remove batch effects using edgeR, and that we probably removed important biological information on the way. When comparing the t-SNE plot for the uncorrected samples with t-SNE plots of the edgeR-corrected samples, we see more subtle changes. It is hard to see clear evidence for removal of batch effects, as the ComBat-seq samples still to seem to cluster based on the batches they came from, but there also seems to be more clustering based on the tissues they came from. In the future, we could try to reduce the perplexity parameter for t-SNE from 30 to 10 to make the 2d embedding mappings less tight, and check for more clear clustering patterns between samples. Additionaly, we could filter the the gene set analyzed to a smaller number, such as the top 1,000 genes with the highest variance, in case we are diluting the biological signal we are interested in by including most of the genes in the genomes. Either way, the t-SNE plot provides strong evidence that ComBat-seq is the preferable method for removing batch effects from RNA-seq data, at least for the way that we have done in.

# Load necessary libraries
library(Rtsne)
library(ggplot2)
library(gridExtra)

# Filter samples where organ == "Root" or "Leaf" and perturbation == "None"
filtered_samples <- batches[batches$organ %in% c("Root", "Leaf", "Seedling", "Meristem", "Flower", "Seed", "Shoot", "Embryo"), ]
unique(batches$organ)
# Subset the logCPM dataframes to include only the filtered samples
logCPM_raw_filtered <- logCPM_raw[, colnames(logCPM_raw) %in% filtered_samples$sample_acc]
logCPM_edger_filtered <- logCPM_edger[, colnames(logCPM_edger) %in% filtered_samples$sample_acc]
logCPM_combat_filtered <- logCPM_combat[, colnames(logCPM_combat) %in% filtered_samples$sample_acc]

# Ensure the sample order matches between the data and annotations
filtered_samples <- filtered_samples[filtered_samples$sample_acc %in% colnames(logCPM_raw_filtered), ]
logCPM_raw_filtered <- logCPM_raw_filtered[, filtered_samples$sample_acc]
logCPM_edger_filtered <- logCPM_edger_filtered[, filtered_samples$sample_acc]
logCPM_combat_filtered <- logCPM_combat_filtered[, filtered_samples$sample_acc]

# Perform t-SNE for each condition
set.seed(42) # For reproducibility
tsne_raw <- Rtsne(t(logCPM_raw_filtered), dims = 2, perplexity = 30, verbose = TRUE, max_iter = 500)
tsne_edger <- Rtsne(t(logCPM_edger_filtered), dims = 2, perplexity = 30, verbose = TRUE, max_iter = 500)
tsne_combat <- Rtsne(t(logCPM_combat_filtered), dims = 2, perplexity = 30, verbose = TRUE, max_iter = 500)

# Create dataframes for plotting
tsne_raw_df <- data.frame(
  X = tsne_raw$Y[, 1],
  Y = tsne_raw$Y[, 2],
  Organ = filtered_samples$organ,
  Batch = filtered_samples$experiment_acc
)

tsne_edger_df <- data.frame(
  X = tsne_edger$Y[, 1],
  Y = tsne_edger$Y[, 2],
  Organ = filtered_samples$organ,
  Batch = filtered_samples$experiment_acc
)

tsne_combat_df <- data.frame(
  X = tsne_combat$Y[, 1],
  Y = tsne_combat$Y[, 2],
  Organ = filtered_samples$organ,
  Batch = filtered_samples$experiment_acc
)

library(viridis)

# Plot t-SNE for each condition
plot_raw <- ggplot(tsne_raw_df, aes(x = X, y = Y, color = Organ)) +
  geom_point(size = 2, alpha = 0.7) +
  labs(title = "Uncorrected", x = "t-SNE 1", y = "t-SNE 2") +
  theme_minimal() +
  theme(
    legend.position = "none",
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank()
  )
  #scale_color_viridis_d(option = "viridis")  # Automatically assigns colors from the "magma" palette

plot_edger <- ggplot(tsne_edger_df, aes(x = X, y = Y, color = Organ)) +
  geom_point(size = 2, alpha = 0.7) +
  labs(title = "EdgeR Corrected", x = "t-SNE 1", y = "") +
  theme_minimal() +
  theme(
    legend.position = "none",
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank()
  )
  #scale_color_viridis_d(option = "viridis")  # Automatically assigns colors from the "magma" palette

plot_combat <- ggplot(tsne_combat_df, aes(x = X, y = Y, color = Organ)) +
  geom_point(size = 2, alpha = 0.7) +
  labs(title = "ComBat-Seq Corrected", x = "t-SNE 1", y = "") +
  theme_minimal() +
  theme(
    legend.position = "none",
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank()
  )
  #scale_color_viridis_d(option = "viridis")  # Automatically assigns colors from the "magma" palette

# Combine the three plots into a single panel
options(repr.plot.width = 30)  # Adjust the width (e.g., 15 inches)
grid.arrange(plot_raw, plot_edger, plot_combat, ncol = 3)
ggsave("tsne_plot_batch_tissue.png", grid.arrange(plot_raw, plot_edger, plot_combat, ncol = 3), width = 15, height = 5, dpi=600)

# Plot t-SNE for each condition
plot_raw <- ggplot(tsne_raw_df, aes(x = X, y = Y, color = Batch)) +
  geom_point(size = 2, alpha = 0.7) +
  labs(title = "t-SNE: Uncorrected", x = "t-SNE 1", y = "t-SNE 2") +
  theme_minimal()
plot_raw <- plot_raw + theme(legend.position = "none")

plot_edger <- ggplot(tsne_edger_df, aes(x = X, y = Y, color = Batch)) +
  geom_point(size = 2, alpha = 0.7) +
  labs(title = "t-SNE: edgeR Corrected", x = "t-SNE 1", y = "t-SNE 2") +
  theme_minimal()
plot_edger <- plot_edger + theme(legend.position = "none")

plot_combat <- ggplot(tsne_combat_df, aes(x = X, y = Y, color = Batch)) +
  geom_point(size = 2, alpha = 0.7) +
  labs(title = "t-SNE: ComBat-Seq Corrected", x = "t-SNE 1", y = "t-SNE 2") +
  theme_minimal()
plot_combat <- plot_combat + theme(legend.position = "none")

# Combine the three plots into a single panel
options(repr.plot.width = 30)  # Adjust the width (e.g., 15 inches)
grid.arrange(plot_raw, plot_edger, plot_combat, ncol = 3)
ggsave("tsne_plot_batches.png", grid.arrange(plot_raw, plot_edger, plot_combat, ncol = 3), width = 15, height = 5, dpi = 600)
?ggsave
# Function to remove zero-variance columns
remove_zero_variance <- function(data) {
  data[, apply(data, 2, var) > 0]  # Keep only columns with variance > 0
}

# Remove zero-variance columns from the datasets
logCPM_raw_filtered <- remove_zero_variance(logCPM_raw_filtered)
logCPM_edger_filtered <- remove_zero_variance(logCPM_edger_filtered)
logCPM_combat_filtered <- remove_zero_variance(logCPM_combat_filtered)

# Perform PCA for each condition
pca_raw <- prcomp(t(logCPM_raw_filtered), center = TRUE, scale. = TRUE)
pca_edger <- prcomp(t(logCPM_edger_filtered), center = TRUE, scale. = TRUE)
pca_combat <- prcomp(t(logCPM_combat_filtered), center = TRUE, scale. = TRUE)

# Create dataframes for plotting
pca_raw_df <- data.frame(
  PC1 = pca_raw$x[, 1],
  PC2 = pca_raw$x[, 2],
  Organ = filtered_samples$organ
)

pca_edger_df <- data.frame(
  PC1 = pca_edger$x[, 1],
  PC2 = pca_edger$x[, 2],
  Organ = filtered_samples$organ
)

pca_combat_df <- data.frame(
  PC1 = pca_combat$x[, 1],
  PC2 = pca_combat$x[, 2],
  Organ = filtered_samples$organ
)

# Plot PCA for each condition
plot_raw_pca <- ggplot(pca_raw_df, aes(x = PC1, y = PC2, color = Organ)) +
  geom_point(size = 2, alpha = 0.7) +
  labs(title = "PCA: Uncorrected", x = "PC1", y = "PC2") +
  theme_classic()

plot_edger_pca <- ggplot(pca_edger_df, aes(x = PC1, y = PC2, color = Organ)) +
  geom_point(size = 2, alpha = 0.7) +
  labs(title = "PCA: edgeR Corrected", x = "PC1", y = "PC2") +
  theme_classic()

plot_combat_pca <- ggplot(pca_combat_df, aes(x = PC1, y = PC2, color = Organ)) +
  geom_point(size = 2, alpha = 0.7) +
  labs(title = "PCA: ComBat-Seq Corrected", x = "PC1", y = "PC2") +
  theme_classic()

# Combine the three plots into a single panel
options(repr.plot.width = 15)  # Make the plot wider
grid.arrange(plot_raw_pca, plot_edger_pca, plot_combat_pca, ncol = 3)

qsmooth

I am not currently planning on using qsmooth, but I thought I would still leave this hese. The qsmooth R package can be used to adjust for within group variations by computing a weight at every quantile that compares the variability between groups relative to within groups. In one extreme, quantile normalization is applied and in the other extreme quantile normalization within each biological condition is applied. My understanding is that it could be a useful method to adjust for some of the within-group variance and retain more biologically relevant information for biological replicates.

qsmooth Tutorial Paper

group <- as.factor(paste(batches$experiment_acc, batches$group_order, sep = "_"))

qlogCPM_raw <- qsmooth(object = logCPM_raw, group_factor = group)
qlogCPM_edger <- qsmooth(object = logCPM_edger, group_factor = group)
qlogCPM_combat <- qsmooth(object = logCPM_combat, group_factor = group)

# Filter samples where organ == "Root" and perturbation == "None"
filtered_samples <- batches[batches$organ == "Root" & batches$pertubation == "None",]$sample_acc

# Subset the logCPM dataframes to include only the filtered samples
qsmooth_raw_filtered <- qlogCPM_raw@qsmoothData[, colnames(qlogCPM_raw@qsmoothData) %in% filtered_samples]
qsmooth_edger_filtered <- qlogCPM_edger@qsmoothData[, colnames(qlogCPM_edger@qsmoothData) %in% filtered_samples]
qsmooth_combat_filtered <- qlogCPM_combat@qsmoothData[, colnames(qlogCPM_combat@qsmoothData) %in% filtered_samples]