Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TE_XXX in gff3 from panEDTA #462

Open
CongLiu37 opened this issue May 9, 2024 · 0 comments
Open

TE_XXX in gff3 from panEDTA #462

CongLiu37 opened this issue May 9, 2024 · 0 comments

Comments

@CongLiu37
Copy link

Hello,

I am using EDTA+panEDTA to annotate genomes of 40 related species. I annotated each genome individually with EDTA v2.2.0 and generated a panEDTA library. Then for each genome, I run

RepeatMasker -e ncbi -pa 40 -q -div 40 -lib ${panEDTA.TElib} -cutoff 225 -gff ${genome}.mod.panEDTA > /dev/null
perl -i -nle 's/\s+DNA\s+/\tDNA\/unknown\t/; print $_' ${genome}.mod.panEDTA.out
EDTA.pl --genome ${genome}, -t 40 --step final --anno 1 --curatedlib ${panEDTA.TElib} --cds ${cds} --rmout ${genome}.mod.panEDTA.out

These are copy-paste from panEDTA.sh for parallization.

In my understanding, each sequence in the panEDTA TE library should represent a TE family. I am trying to extract genomic sequences for each TE family. I found some unusual Names in attributes field of TEanno.gff3:
(1) There are some panTE_XXX in gff3 but not in panEDTA.TElib. Instead, there are panTE_XXX_INT and panTE_XXX_LTR in panEDTA.TElib.
(2) There are TE_XXX in gff3, but not in panEDTA.TElib.

Lastly, how would you count the copy number of each TE family? I checked the ratio between length of regions in the gff3 and of corresponding sequences in panEDTA.TElib, and it differs a lot. Here are quantiles of the ratio:

> quantile(df$lengthABOVETE.fam.len,na.rm =TRUE,probs=seq(0,1,0.1))
          0%          10%          20%          30%          40%          50% 
 0.005845817  0.080485612  0.116917626  0.162465915  0.221638655  0.288018433 
         60%          70%          80%          90%         100% 
 0.376657825  0.494324624  0.678725237  0.937500000 73.812785388 

I suspect whether these extremely short/long regions are really transposons and I am not sure whether it is a good idea to include them in analysis analysis on evolution of individual TE family (e.g. copy number dynamics). Do you have any suggestion?

Sincerely,

Cong

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant