This section includes notes and code for functional annotation of protein sequences using different databases.
Table of contents
Orthogroups
Proteinortho
I have been working on an analysis that uses OrthoFinder to predict orthgroups. Unfortunately, I am under the impression that, with default parameters, OrthoFinder tends to group non-orthologous proteins in the same orthogroup, which throws my analysis off. Looking into other alternatives, I have come across Proteinortho
, that mentions the following in the discussion section: Results generated by Proteinortho are consistently among the highest-performing tools in terms of precision and archived scores are generally close to the results of OMA. In terms of sensitivity, Proteinortho produces among the lowest scores compared to the other tools, highlighting a distinct trade-off.
At least on paper, the improved precision and speed are what I am actually currently looking for, so I decided to give it a try.
First, let’s create a conda environment to run Proteinortho
:
conda create --name proteinortho python=3.10
conda install bioconda::proteinortho conda-forge::liblapacke conda-forge::openmp
Next, let’s create a proteinortho.sh
bash script that will automatically add all the FASTA files in out directory and run the prediction on them:
#!/bin/bash
# Check if there are any FASTA files in the current directory
fasta_files=(*.fasta)
if [ "${#fasta_files[@]}" -eq 0 ]; then
echo "No FASTA files found in the current directory."
exit 1
fi
# Construct the command with all FASTA files
command="proteinortho -project=all ${fasta_files[@]}"
# Print the command for verification
echo "Running the following command:"
echo "$command"
# Execute the command
eval "$command"
Finally, we can run the bash script to start the analysis.
# Files must have .fasta suffix. This replaces .fa to .fasta
for file in *.fa; do mv "$file" "${file%.fa}.fasta"; done
# Run the Proteinortho analysis on all FASTA files in the folder
bash proteinortho.sh
InterProScan
InterProScan is one of the most popular protein sequence annotation tools that incorporates a number of different databases.
Editing the interproscan.sh file
I am not sure if this is necessary, but I think that I sometimes run into memory issues when running InterProScan on large FASTA files (>100k sequences). If you have the resource available on your computer, it might help to change the last few lines in the interproscan.sh
. Below you can see the edited part where I changed the ParallelGCThreads
to 72, -Xms
to 10G and -Xmx
to 100G.
"$JAVA" \
-XX:ParallelGCThreads=72 \
-Xms10G -Xmx100G \
$PROPERTY \
-jar interproscan-5.jar $@ -u $USER_DIR
Running InterProScan
# Make sure you have JAVA installed
sudo apt update
sudo apt install default-jre
java -version
# Create a folder to save the output files
mkdir output
# Run your genome. Information about args can be found at: https://interproscan-docs.readthedocs.io/en/latest/HowToRun.html
bash interproscan.sh -i clean/genomeID.fa --cpu 70 --output-dir output --formats TSV,GFF3 --iprlookup --goterms --pathways
LRRprofiler
LRRprofiler is a great tool for annotating plant leucine-rich repeat (LRR)-containing proteins.
Singualrity has been replaced by Apptainer, which can be installed following the instructions here. Once you have singularity installed, you will need to download the SIF file for the LRRprofiler (v0.2) from SyLabs, using this link here. Then, to run LRRprofiler on a FASTA file containing protein sequences, you simply need to run:
apptainer run cgottin_default_lrr_profiler.sif --in_proteome proteins.fa --name prots