RM_TRIPS

RepeatMasker Trinity based Parse Script

Author = Christopher L Butler

OUTLINE

This R script aims to parse RepeatMasker.out files generated from de-novo Trinity data for better transposable element (TE) annotation across whole transcriptome sequences. Output is given in .CSV format so further analyses can be conducted at ease.

The script conducts four key steps:

Repetitive elements not classed as TEs (e.g. microsatellites, simple repeats & sRNAs) are removed.
TEs found on the same transcript are merged if they have the same element name, orientation and their combined sequence length is less than or equal to the corresponding reference sequence in the focal TE library.
In cases where multiple copies of the same element are found across different transcript isoforms, only one is retained. This ensures that each trasposable element corresponds to a unique genomic loci.
Merged repeats with a length less than 80bp are removed.

USAGE

The R script is compatible with any output file from RepeatMasker (.out) derived from Trinity based transcriptome sequences.

Necessary inputs include:

The RepeatMasker.out file
The RepeatMasker library used (e.g. RepBase or custom based repeat library) in .fasta format

The output is given as a .csv file and is written in the same directory where the .out file is found.

Column Header	Description
repeat_id	Name of TE with the significant hit
qry_id	Name of Trinity transcript with TE hit
matching_repeat	Is match complement (C) of the TE sequence?
matching_class	The transposon class in which the TE belongs to
reference_length	Sequence length of the TE as found in the reference library
merged_qrystart	Start of TE hit found on the transcript
merged_qryend	End of TE hit found on transcript
mergedfraglength	Sequence length of TE hit (bp)
perc_div	% of substitutions in matching region compared to the consesus
perc_del	% of bases opposite a gap in the query sequence
perc_insert	% of bases opposite a gap in the repeat sequence
Gene	Gene name
Isoform	Isoform number

Note

Before running RM_TRIPS, you may wish to ensure that your RepeatMasker.out file only contains distinct repeats by removing repeats which have a lower scoring match whose domain partly includes the domain of the current match, as indicated by an asterisk * in the final output column.

This can be achieved by running the following bash shell script -

awk '!/\*/' $file.out > noasterisk$file.out

Below is a table which details the impact each stage of the parse script has on estimated TE abundance. In this instance (Corydoras maculifer), running RM_trips on a RepeatMasker output against a Danio rerio TE library halves the estimated TE abundance, decreasing from 3.58% to 1.17%.

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
README.md		README.md
RM_TRIPS.R		RM_TRIPS.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

RM_TRIPS.R

RM_TRIPS.R

Repository files navigation

RM_TRIPS

About

Releases

Packages

Languages

clbutler/RM_TRIPS

Folders and files

Latest commit

History

README.md

README.md

RM_TRIPS.R

RM_TRIPS.R

Repository files navigation

RM_TRIPS

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages