Skip to content

SINE annotation tool for plant genomes

License

Notifications You must be signed in to change notification settings

oushujun/AnnoSINE_v2

 
 

Repository files navigation

install with bioconda

AnnoSINE_v2

SINE Annotation Tool for Plant/Animal Genomes

Table of Contents

Introduction

AnnoSINE_v2 is a SINE annotation tool for plant/animal genomes. The program is designed to generate high-quality non-redundant SINE libraries for genome annotation. It uses the manually curated SINE library in the Oryza sativa genome to benchmark the annotation performance.

Prerequisites

To use AnnoSINE_v2, you need to install the tools listed below.

Installation

# pip
cd ./AnnoSINE/bin
pip3 install -r requirements.txt

# conda
conda env create -f AnnoSINE.conda.yaml

## change the permission of IRF
chmod 755 irf308.linux.exe

Usage

conda activate AnnoSINE
python3 AnnoSINE_v2.py [options] <mode> <input_filename> <output_filename>

If the program stops in a certain step or has no output, this may result from the strict filtering cutoff. You can try the command below:

python3 AnnoSINE.py [options] <mode> -e 0.01 -minc 1 -s 150 <input_filename> <output_filename>

Argument

positional arguments:
  mode                  [1 | 2 | 3]
                        Choose the running mode of the program.
                                1--Homology-based method;
                                2--Structure-based method;
                                3--Hybrid of homology-based and structure-based method.
  input_filename        input genome assembly path
  output_filename       output files path

optional arguments:
  -h, --help                   show this help message and exit
  -e, --hmmer_evalue           Expectation value threshold for saving hits of homology search (default: 1e-10)
  -v, --blast_evalue           Expectation value threshold for sequences alignment search (default: 1e-10)
  -l, --length_factor          Threshold of the local alignment length relative to the the BLAST query length (default: 0.3)
  -c, --copy_number_factor     Threshold of the copy number that determines the SINE boundary (default: 0.15)
  -s, --shift                  Maximum threshold of the boundary shift (default: 80)
  -g, --gap                    Maximum threshold of the trancated gap (default: 10)
  -minc, --copy_number         Minimum threshold of the copy number for each element (default: 20)
  -a, --animal                 If set to 1, then Hmmer will search SINE using the animal hmm files from Dfam. (default: 0)
  -b, --boundary               Output SINE seed boundaries based on TSD or MSA (default: msa)
  -f, --figure                 Output the SINE seed MSA figures and copy number profiles (y/n). Please note that this step may take a long time to process. (default: n)  
  -auto, --automatically_continue If set to 1, then the program will skip finished steps and continue unifinished steps for a previously processed output dir. (default: 0)
  -r, --non_redundant          Annotate SINE in the whole genome based on the non—redundant library (y/n) (default: y)
  -t, --threads		              Threads for each tool in AnnoSINE (default: 36)
  -irf, --irf_path	            Path to the irf program (default: '')
  -rpm, --RepeatMasker_enable  If set to 0, then will not run RepearMasker (Step 8 for the code). (default: 1)

Inputs

Genome sequence(fasta format).

Outputs

  • Redundant SINE library: $ Step7_cluster_output.fasta
  • Non-redundant SINE library with serial number: $Seed_SINE.fa.
  • Whole-genome SINE annotation: $Input_genome.fasta.out. This file contains high-similarity SINE annotations.

Intermediate Files

  • SINE candidates information predicted by homology search: $ ../Family_Seq/Family_Name/Family_Name.out. (m=1 or 3 required)
  • SINE candidate sequences predicted by structure search: $ ../Input_Files/Input_genome-matches.fasta. (m=2 or 3 required)
  • Extended candidate sequences for TSD search: $ Step1_extend_tsd_input.fa
  • TSD identification outputs: $ Step2_tsd.txt
  • MSA extended input sequences flanked with TSD: $ Step2_extend_blast_input.fa
  • MSA output: $ Step3_blast_output.out
  • Intermediate sequences with MSA quality examination: $ Step3_blast_process_output.fa
  • SINE candidate sequences after MSA quality examination: $ Step4_rna_input.fasta
  • SINE candidates blast against RNA database outputs $ Step4_rna_output.out
  • Classified SINE candidates after RNA examintation $ Step4_rna_output.fasta
  • TRF output $ Step4_rna_output.fasta.2.5.7.80.10.10.2000.dat
  • SINE candidates after removing elements consist of tandem repeats $ Step5_trf_output.fasta
  • SINE candidate sequences after extension: $ Step6_irf_input.fasta.
  • IRF output $ Step6_irf_input.fasta.2.3.5.80.10.20.500000.10000.dat
  • SINE candidates after removing elements flanked with inverted repeats: $ Step6_irf_output.fasta
  • CD-HIT output: $ Step7_cluster_output.fasta.clstr

Testing

You can test the AnnoSINE with one chromosome in Arabisopsis thaliana (it takes about 6 mins).

cd ./AnnoSINE/Testing
python3 ../bin/AnnoSINE.py -t 20 3 A.thaliana_Chr4.fasta ./Output_Files

Results of AnnoSINE tests on testing data are saved in Output_Files.

Citations

About

SINE annotation tool for plant genomes

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 97.3%
  • JavaScript 2.7%