The pipeline

ChimeraTE is a pipeline to detect chimeric transcripts derived from genes and transposable elements (TEs). It has two running Modes:

Mode 1 chimeric transcripts detection based upon exons and TE copies positions in the genome sequence;
Mode 2 chimeric transcripts detection regardless the genomic position, allowing the detection of chimeras from TEs that are not present in the referece genome, but with less sensitivity.

Install
Required data
ChimeraTE Mode 1
ChimeraTE Mode 2

Install

Conda

The installation may be easily done with conda. If you don't have conda installed in your machine, please follow this tutorial.

Once you have installed conda, you need to enable Bioconda channel with:

conda config --add channels bioconda
conda config --add channels conda-forge
conda config --set channel_priority strict

Then, all dependencies to run ChimeraTE can be easily installed in a new conda environment by using the chimeraTE.yml file:

Download repository from github:
git clone https://github.com/OliveiraDS-hub/ChimeraTE.git

Change to the ChimeraTE's folder:
cd ChimeraTE

Create chimeraTE environment with all dependencies:
conda env create -f chimeraTE.yml

Activate the new environment:
conda activate chimeraTE

Note: We advise you to return your condarc config to the default with:

conda config --remove channels bioconda
conda config --remove channels conda-forge
conda config --set channel_priority false

Singularity

Alternatively to conda, you can use singularity v3.10.0+ to build a container with all dependencies for ChimeraTE.

If you don't have sudo permissions:

singularity build --fakeroot chimeraTE.simg singularity.def

If you have sudo:

sudo singularity build chimeraTE.simg singularity.def

Then, to run ChimeraTE:

singularity exec chimeraTE.simg python3 chimTE_mode1.py --help
singularity exec chimeraTE.simg python3 chimTE_mode2.py --help

Requirements

If you don't have conda or singularity, you can install all dependecies as an old school bioinformatician. It's important to highlight that all of them must be installed in your path.

Python dependencies
Softwares

Required data

In order to run ChimeraTE, the following files are required according to the running Mode:

Data	Mode 1	Mode 2	Mode 2 --assembly
Stranded paired-end RNA-seq - Fastq files	X	X	X
Assembled genome - Fasta file with chromosomes/scaffolds/contigs sequences	X
Gene annotation - GTF file with gene annotations (UTRs,exons,CDS)	X
TE annotation - GTF file with TE insertions	X
Reference transcripts - Fasta file with reference transcripts		X	X
Reference TEs - Fasta with ref. TE insertions		X
Dfam taxonomy OR fasta with ref. TE consensuses			X

ChimeraTE genome-guided - Mode1

In the Mode 1, chimeric transcripts will be detected considering the genomic location of TE insertions and exons. Chimeras from this Mode can be classified as TE-initiated TE-exonized, and TE-terminated transcripts. Mode 1 does not detect chimeric transcripts derived from TE insertions absent from the reference genome that is provided.

cd ChimeraTE/
python3 chimTE_mode1.py --help

ChimeraTE Mode 1: The genome-guided approach to detect chimeric transcripts with RNA-seq data.

Required arguments:
  --genome      Genome in fasta
  --input       Paired-end files and their respective group/replicate
  --project     Directory name with output data
  --te          GTF file containing TE information
  --gene        GTF file containing gene information
  --strand      Define the strandness direction of the RNA-seq. Two options:
                "rf-stranded" OR "fwd-stranded"

Optional arguments:
  --chimera     Identify specific type of chimera: "TE-initiated" OR "TE-
                exonized" OR "TE-terminated"
  --window      Upstream and downstream window size (default = 3000)
  --replicate   Minimum recurrency of chimeric transcripts between RNA-seq
                replicates (default 2)
  --coverage    Minimum coverage (mean between replicates default 2 for
                chimeric transcripts detection)
  --fpkm        Minimum fpkm to consider a gene as expressed (default 1)
  --threads     Number of threads (default 6)
  --overlap     Minimum overlap between chimeric reads and TE insertions (default 0.50)
  --index       Absolute path to pre-existing STAR index

Prepare your data for Mode 1!

Input table

The input tab-delimited table provided with --input must have a specific format: First column: Mate 1 from the paired-end data Second column: Mate 2 from the paired-end data Third column: Replicate/group name

mate1	mate2	rep
/home/user/ChimeraTE/mate1_control1.fastq.gz	/home/user/ChimeraTE/mate2_control1.fastq.gz	rep1
/home/user/ChimeraTE/mate1_control2.fastq.gz	/home/user/ChimeraTE/mate2_control2.fastq.gz	rep2
/home/user/ChimeraTE/mate1_control3.fastq.gz	/home/user/ChimeraTE/mate2_control3.fastq.gz	rep3

The header must be absent, as it follows in the example --input table at example_data/mode1/input_example.tsv

GTF for TEs

Usually, the coordinates for TE insertions is given as the .out file from RepeatMasker in many databases. If you already have a .out file from RepeatMasker, you can convert it to .gtf on Linux with:

tail -n +4 RMfile.out | egrep -v 'Satellite|Simple_repeat|rRNA|Low_complexity|RNA|ARTEFACT' | awk -v OFS='\t' '{Sense=$9;sub(/C/,"-",Sense);$9=Sense;print $5,"RepeatMasker","similarity",$6,$7,$2,$9,".",$10}' > RMfile.gtf

If you don't have the .out file for your genome assembly, check it out the util section.

Example Data Mode 1

After installation, you can run ChimeraTE with the example data from the sampled RNA-seq from D. melanogaster used in our paper.

#Do not forget to activate your conda environment:
conda activate chimeraTE

#One-line
python3 chimTE_mode1.py --genome example_data/mode1/dmel_genome_sample.fa --input example_data/mode1/input_mode1.tsv --project example_mode1 --te example_data/mode1/dmel_TEs_sample.gtf --gene example_data/mode1/dmel_genes_sample.gtf --strand rf-stranded

#Multi-line
python3 chimTE_mode1.py --genome example_data/mode1/dmel_genome_sample.fa \
--input example_data/mode1/input_mode1.tsv \
--project example_mode1 \
--te example_data/mode1/dmel_5TEs_sample.gtf \
--gene example_data/mode1/dmel_5genes_sample.gtf \
--strand rf-stranded

If you have more than 6 threads available on your machine, you can use --threads to speed up the process.

Output Mode 1

The output files can be found at ChimeraTE/projects/$your_project_name. For instance, for the example data, you can find the output at ChimeraTE/projects/example_mode1. Inside this directory, you might found 3 tables:

TE-initiated_final.ct
TE-exonized_final.ct
TE-terminated_final.ct

These tables contain the chimeric transcripts list with the location of genes and TE insertions generating chimeras, as well as their corresponding coverage of chimeric reads (support). At the 7th column of TE-exonized_final.ct, you can find the position of the TE within the gene region (Embedded, Intronic, or Overlapped). As it follows in the example below:

=========================> TE-initiated_final.ct <=========================

gene_id	gene_strand	gene_pos	TE_id	TE_strand	TE_pos	chim_reads
FBgn0031188	-	X_RaGOO:21340686-21343686	S2	+	X_RaGOO:21341507-21342141	11.5

=========================> TE-exonized_final.ct <=========================

gene_id	gene_strand	gene_pos	TE_id	TE_strand	TE_pos	exonized_type	chim_reads
FBgn0285926	-	X_RaGOO:10476773-10513188	roo	-	X_RaGOO:10485868-10485985	Embedded	63.5
FBgn0052000	+	4_RaGOO:126456-137357	1360	+	4_RaGOO:133965-134061	Overlapped	4.5
FBgn0039923	-	4_RaGOO:761931-772400	FB	-	4_RaGOO:769101-769563	Intronic	91.0

=========================> TE-terminated_final.ct <=========================

gene_id	gene_strand	gene_pos	TE_id	TE_strand	TE_pos	chim_reads
FBgn0011747	-	4_RaGOO:106334-1093346	G5	-	4_RaGOO:109144-109334	5.0

ChimeraTE genome-blinded - Mode 2

Mode 2 is designed to identify chimeric transcripts without the reference genome, with the prediction of chimeras from fixed and polymorphic TEs. In Mode 2, two alignments with stranded RNA-seq reads are performed: (1) against transcripts; (2) against TE insertions. From these alignments, all reads supporting chimeric transcripts (chimeric reads) will be computed. These reads are thise ones that have different singleton mates from the same read pairs splitted between transcripts and TEs, or those that have concordant alignment in one of the alignments, but singleton aligned reads in the other. There is also an option to perform de novo transcriptome assembly with --assembly parameter. Such additional analysis will analyze whether gene transcripts contain TE-derived sequences.

cd ChimeraTE/
python3 chimTE_mode2.py --help

ChimeraTE Mode 2: The genome-blinded approach to detect chimeric transcripts with RNA-seq data.

Required arguments:
  --input         Paired-end files and their respective group/replicate
  --project       Directory name with output data
  --te            Fasta file containing TE information
  --transcripts   Fasta file containing gene information
  --strand        Define the strandness direction of the RNA-seq. Two options:
                  "rf-stranded" OR "fwd-stranded"

Optional arguments:
  --coverage      Minimum coverage (mean between replicates default 2 for
                  chimeric transcripts detection)
  --fpkm          Minimum fpkm to consider a gene as expressed (default = 1)
  --replicate     Minimum recurrency of chimeric transcripts between RNA-seq
                  replicates (default = 2)
  --threads       Number of threads (default = 6)
  --assembly      Search for chimeric transcript with transcriptome assembly
                  with Trinity
  --ref_TEs       "species" database used by RepeatMasker (flies, human,
                  mouse, arabidopsis; or a built TE library in fasta format)
  --ram           Ram memory in Gbytes 
                  (default = 8)
  --overlap       Minimum overlap between chimeric reads and TE insertions
                  (default 0.50)
  --TE_length     Minimum TE length to keep it from RepeatMasker output
                  (default = 80bp)
  --identity      Minimum identity between de novo assembled transcripts and
                  reference transcripts (default = 80)

Prepare your data for Mode 2

Despite the format of the input files are simple fastas, altogether with paired-end RNA-seq reads, the sequence IDs for transcripts and TEs must be in a specific pattern. In order make it easier to generate these formats, we provide util scripts to manage your data.

1. Reference transcripts (.fasta)

In order to run ChimeraTE correctly, this fasta file must have a specific header pattern. All IDs have be composed firstly by the isoform ID, followed by the gene name. For instance, in D. melanogaster, the gene FBgn0263977 has two transcripts:
Tim17b-RA_FBgn0263977
Tim17b-RB_FBgn0263977
Note that headers "Tim17b-RA" and "Tim17b-RB" have isoform ID separated from gene name by "_". This is not a usual ID format, thefore we have developed auxiliary scripts ($FOLDER/ChimeraTE/util/) to convert native ID formats to ChimeraTE format.
- transcripts_IDs_NCBI.sh (native IDs from NCBI to the ChimeraTE format)
- transcripts_IDs_ensembl.sh (native IDs from ENSEMBL to the ChimeraTE format)
- transcripts_IDs_FLYBASE.sh (native IDs from FLYBASE to the ChimeraTE format)

Example Data Mode 2

After installation, you can run ChimeraTE Mode 2 with the example data from the sampled RNA-seq from D. melanogaster used in our paper.

#Do not forget to activate your conda environment:
conda activate chimeraTE

#One-line
python3 chimTE_mode2.py --input example_data/mode2/input_mode2.tsv --project example_mode2 --te example_data/mode2/dmel-sampled_TE-copies.fa --transcripts example_data/mode2/dmel-sampled_transcripts.fa --strand rf-stranded --assembly

#Multi-line
python3 chimTE_mode2.py --input example_data/mode2/input_mode2.tsv \
--project example_mode2 \
--te example_data/mode2/dmel-sampled_TE-copies.fa\
 --transcripts example_data/mode2/dmel-sampled_transcripts.fa \
 --strand rf-stranded \
--assembly

Mode 2 will run with 8 threads and 8Gb of RAM memory, but you can speed up the analysis by increasing this values with --threads and --ram, respectively.

NOTE: If you are not working with Drosophila data, do not forget to change --ref_TEsparameter, providing a Dfam taxonomy level to use with RepeatMasker, or a fasta with TE consensuses.

Output Mode 2

The output files can be found at ChimeraTE/projects/$your_project_name. For instance, for the example data, you can find the output at ChimeraTE/projects/example_mode2. Inside this directory, you might found 3 tables:

chimreads_evidence_FINAL.tsv
In the "chimreads_evidence" table, you will find chimeric transcripts supported only by paired-end reads that have mapped in both transcripts and TE sequences (singletons and concordant/singleton - Check manuscripts's methods).
transcriptome_evidence_FINAL.tsv
In the "transcriptome_evidence" table, you will find chimeras supported only by the transcripme assembly method (if you have activated --assemblyoption). This table will provide you the gene, TE family, and the respective assembled transcript ID for which a TE sequence was found.
double_evidence_FINAL.tsv
Finally, "double_evidence" is the list of chimeras for which both previous methods have predicted the same chimera (strong evidence!), containing all information from both previous tables.

=========================> chimreads_evidence_FINAL.tsv <=========================

gene_id	TE_family	chim_reads	transcript_ID	transcript_FPKM
FBgn0058160	DNAREP1	60.0	CG40160-RH_FBgn0058160	62177.475

=========================> transcriptome_evidence_FINAL.tsv <=========================

gene_id	TE_family	transcript_ID	Trinity_transcripts	Identity_transcripts	trinity_length	ref_transcript_length	match_length	chim_reads
FBgn0286778	HMSBEAGLE_I	CG46385-RA	TRINITY_DN87_c0_g1_i1; TRINITY_DN88_c0_g1_i1	97.992	741.5	5129.0	732.0	31.0

=========================> double_evidence_FINAL.tsv <=========================

gene_id	TE_family	chim_reads	masked_family	chim_reads_masked	ref_transcript_FPKM	Trinity_transcripts	Identity_transcripts	trinity_length	ref_transcript_length	match_length	ref_transcript_IDs
FBgn0001169	ROO	32.0	ROO_I	4.0	3011.5781	TRINITY_DN13_c0_g1_i3; TRINITY_DN13_c0_g1_i2	100.0	604.5	4069.5	603.5	H-RD; H-RB H-RD_FBgn0001169; H-RB_FBgn0001169; H-RA_FBgn0001169

Please cite us!

Daniel S Oliveira, Marie Fablet, Anaïs Larue, Agnès Vallier, Claudia M A Carareto, Rita Rebollo, Cristina Vieira. ChimeraTE: A pipeline to detect chimeric transcripts derived from genes and transposable elements. Nucleic and Acids Research, 2023. https://doi.org/10.1093/nar/gkad671

Development and help

To report bugs and give us suggestions, you can open an issue on the github repository.

License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Name		Name	Last commit message	Last commit date
Latest commit History 344 Commits
example_data		example_data
image		image
projects		projects
scripts		scripts
util		util
LICENSE		LICENSE
README.md		README.md
chimTE_mode1.py		chimTE_mode1.py
chimTE_mode2.py		chimTE_mode2.py
chimeraTE.yml		chimeraTE.yml
singularity.def		singularity.def

License

OliveiraDS-hub/ChimeraTE

Folders and files

Latest commit

History

Repository files navigation

The pipeline

Install

Conda

Singularity

Requirements

Required data

ChimeraTE genome-guided - Mode1

Prepare your data for Mode 1!

Input table

GTF for TEs

Example Data Mode 1

Output Mode 1

ChimeraTE genome-blinded - Mode 2

Prepare your data for Mode 2

1. Reference transcripts (.fasta)

Example Data Mode 2

Output Mode 2

Please cite us!

Development and help

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages