TE_XXX in gff3 from panEDTA #462

CongLiu37 · 2024-05-09T09:07:36Z

Hello,

I am using EDTA+panEDTA to annotate genomes of 40 related species. I annotated each genome individually with EDTA v2.2.0 and generated a panEDTA library. Then for each genome, I run

RepeatMasker -e ncbi -pa 40 -q -div 40 -lib ${panEDTA.TElib} -cutoff 225 -gff ${genome}.mod.panEDTA > /dev/null
perl -i -nle 's/\s+DNA\s+/\tDNA\/unknown\t/; print $_' ${genome}.mod.panEDTA.out
EDTA.pl --genome ${genome}, -t 40 --step final --anno 1 --curatedlib ${panEDTA.TElib} --cds ${cds} --rmout ${genome}.mod.panEDTA.out

These are copy-paste from panEDTA.sh for parallization.

In my understanding, each sequence in the panEDTA TE library should represent a TE family. I am trying to extract genomic sequences for each TE family. I found some unusual Names in attributes field of TEanno.gff3:
(1) There are some panTE_XXX in gff3 but not in panEDTA.TElib. Instead, there are panTE_XXX_INT and panTE_XXX_LTR in panEDTA.TElib.
(2) There are TE_XXX in gff3, but not in panEDTA.TElib.

Lastly, how would you count the copy number of each TE family? I checked the ratio between length of regions in the gff3 and of corresponding sequences in panEDTA.TElib, and it differs a lot. Here are quantiles of the ratio:

> quantile(df$lengthABOVETE.fam.len,na.rm =TRUE,probs=seq(0,1,0.1))
          0%          10%          20%          30%          40%          50% 
 0.005845817  0.080485612  0.116917626  0.162465915  0.221638655  0.288018433 
         60%          70%          80%          90%         100% 
 0.376657825  0.494324624  0.678725237  0.937500000 73.812785388

I suspect whether these extremely short/long regions are really transposons and I am not sure whether it is a good idea to include them in analysis analysis on evolution of individual TE family (e.g. copy number dynamics). Do you have any suggestion?

Sincerely,

Cong

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TE_XXX in gff3 from panEDTA #462

TE_XXX in gff3 from panEDTA #462

CongLiu37 commented May 9, 2024

TE_XXX in gff3 from panEDTA #462

TE_XXX in gff3 from panEDTA #462

Comments

CongLiu37 commented May 9, 2024