This section includes notes and code for functional annotation of protein sequences using different databases.

Table of contents
  1. InterProScan
    1. Editing the interproscan.sh file
    2. Running InterProScan
  2. LRRprofiler

InterProScan

Paper GitHub Webserver

InterProScan is one of the most popular protein sequence annotation tools that incorporates a number of different databases.

Editing the interproscan.sh file

I am not sure if this is necessary, but I think that I sometimes run into memory issues when running InterProScan on large FASTA files (>100k sequences). If you have the resource available on your computer, it might help to change the last few lines in the interproscan.sh. Below you can see the edited part where I changed the ParallelGCThreads to 72, -Xms to 10G and -Xmx to 100G.

"$JAVA" \
 -XX:ParallelGCThreads=72 \
 -Xms10G -Xmx100G \
 $PROPERTY \
 -jar interproscan-5.jar $@ -u $USER_DIR

Running InterProScan

# Make sure you have JAVA installed
sudo apt update
sudo apt install default-jre
java -version

# Create a folder to save the output files
mkdir output

# Run your genome. Information about args can be found at: https://interproscan-docs.readthedocs.io/en/latest/HowToRun.html
bash interproscan.sh -i clean/genomeID.fa --cpu 70 --output-dir output --formats TSV,GFF3 --iprlookup --goterms --pathways

LRRprofiler

Paper GitHub SIF

LRRprofiler is a great tool for annotating plant leucine-rich repeat (LRR)-containing proteins.

Singualrity has been replaced by Apptainer, which can be installed following the instructions here. Once you have singularity installed, you will need to download the SIF file for the LRRprofiler (v0.2) from SyLabs, using this link here. Then, to run LRRprofiler on a FASTA file containing protein sequences, you simply need to run:

apptainer run cgottin_default_lrr_profiler.sif --in_proteome proteins.fa --name prots