Skip to content

Accelerating de novo SINE annotation in plant and animal genomes

License

Notifications You must be signed in to change notification settings

liaoherui/AnnoSINE_v2

 
 

Repository files navigation

install with bioconda

AnnoSINE_v2

SINE Annotation Tool for Plant/Animal Genomes

Table of Contents

Introduction

AnnoSINE_v2 is a SINE annotation tool for plant/animal genomes. The program is designed to efficiently generate high-quality non-redundant SINE libraries for genome annotation. This program is a new version of AnnoSINE. Thus, it has the same workflow as AnnoSINE (shown below).

Prerequisites

To use AnnoSINE_v2, you need to install the tools listed below.

Installation

Installation via GitHub.

git clone https://github.com/liaoherui/AnnoSINE_v2.git
cd AnnoSINE_v2

# conda
conda env create -f AnnoSINE.conda.yaml

conda activate AnnoSINE

## change the permission of IRF
chmod 755 irf308.linux.exe

Installation via Bioconda.

mamba install annosine2

or

conda install -c bioconda annosine2

It should be noted that some commands have been replaced if you install AnnoSINE_v2 using bioconda/pip. (See below)

Command (Not bioconda/pip) Command (bioconda/pip)
python AnnoSINE_v2.py -h AnnoSINE_v2 -h

Usage

conda activate AnnoSINE
python AnnoSINE_v2 [options] <mode> <input_filename> <output_filename>

If the program stops in a certain step or has no output, this may result from the strict filtering cutoff. You can try the command below:

python AnnoSINE_v2 [options] <mode> -e 0.01 -minc 1 -s 150 <input_filename> <output_filename>

Argument

positional arguments:
  mode                  [1 | 2 | 3]
                        Choose the running mode of the program.
                                1--Homology-based method;
                                2--Structure-based method;
                                3--Hybrid of homology-based and structure-based method.
  input_filename        input genome assembly path
  output_filename       output files path

optional arguments:
  -h, --help                   show this help message and exit
  -e, --hmmer_evalue           Expectation value threshold for saving hits of homology search (default: 1e-10)
  -v, --blast_evalue           Expectation value threshold for sequences alignment search (default: 1e-10)
  -l, --length_factor          Threshold of the local alignment length relative to the the BLAST query length (default: 0.3)
  -c, --copy_number_factor     Threshold of the copy number that determines the SINE boundary (default: 0.15)
  -s, --shift                  Maximum threshold of the boundary shift (default: 80)
  -g, --gap                    Maximum threshold of the trancated gap (default: 10)
  -minc, --copy_number         Minimum threshold of the copy number for each element (default: 20)
  -numa, --num_alignments      --num_alignments value for blast alignments (default: 50000)
  -maxb, --base_copy_number    Maximum threshold of copy number for the first and last base (default: 1)
  -a, --animal                 If set to 1, then Hmmer will search SINE using the animal hmm files from Dfam. If set to 2, then Hmmer will search SINE using both the plant and animal hmm files. (default: 0)
  -b, --boundary               Output SINE seed boundaries based on TSD or MSA (default: msa)
  -f, --figure                 Output the SINE seed MSA figures and copy number profiles (y/n). Please note that this step may take a long time to process. (default: n)
  -temd, --temp_dir            The temp dir used by paf2blast6 script. If not set, will use /tmp folder automatically.
  -auto, --automatically_continue If set to 1, then the program will skip finished steps and continue unifinished steps for a previously processed output dir. (default: 0)
  -r, --non_redundant          Annotate SINE in the whole genome based on the non—redundant library (y/n) (default: y)
  -t, --threads                Threads for each tool in AnnoSINE (default: 36)
  -irf, --irf_path	            Path to the irf program (default: '')
  -rpm, --RepeatMasker_enable  If set to 0, then will not run RepearMasker (Step 8 for the code). (default: 1)

Inputs

Genome sequence(fasta format).

Outputs

  • Redundant SINE library: $ Step7_cluster_output.fasta
  • Non-redundant SINE library with serial number: $Seed_SINE.fa.
  • Whole-genome SINE annotation: $Input_genome.fasta.out. This file contains high-similarity SINE annotations.

Intermediate Files

  • SINE candidates information predicted by homology search: $ ../Family_Seq/Family_Name/Family_Name.out. (m=1 or 3 required)
  • SINE candidate sequences predicted by structure search: $ ../Input_Files/Input_genome-matches.fasta. (m=2 or 3 required)
  • Extended candidate sequences for TSD search: $ Step1_extend_tsd_input.fa
  • TSD identification outputs: $ Step2_tsd.txt
  • MSA extended input sequences flanked with TSD: $ Step2_extend_blast_input.fa
  • MSA output: $ Step3_blast_output.out
  • Intermediate sequences with MSA quality examination: $ Step3_blast_process_output.fa
  • SINE candidate sequences after MSA quality examination: $ Step4_rna_input.fasta
  • SINE candidates blast against RNA database outputs $ Step4_rna_output.out
  • Classified SINE candidates after RNA examintation $ Step4_rna_output.fasta
  • TRF output $ Step4_rna_output.fasta.2.5.7.80.10.10.2000.dat
  • SINE candidates after removing elements consist of tandem repeats $ Step5_trf_output.fasta
  • SINE candidate sequences after extension: $ Step6_irf_input.fasta.
  • IRF output $ Step6_irf_input.fasta.2.3.5.80.10.20.500000.10000.dat
  • SINE candidates after removing elements flanked with inverted repeats: $ Step6_irf_output.fasta
  • CD-HIT output: $ Step7_cluster_output.fasta.clstr

Testing

You can test the AnnoSINE_v2 with one chromosome in Arabisopsis thaliana (it takes about 6 mins).

cd ./AnnoSINE_v2/Testing
python ../bin/AnnoSINE_v2 -t 20 3 A.thaliana_Chr4.fasta ./Output_Files

Results of AnnoSINE_v2 tests on testing data are saved in Output_Files.

Citations

Liao, H., Ou, S. & Sun, Y. Accelerating de novo SINE annotation in plant and animal genomes. bioRxiv (2024). https://doi.org/10.1101/2024.03.01.582874

About

Accelerating de novo SINE annotation in plant and animal genomes

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 96.6%
  • JavaScript 3.4%