Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combining RM rows #443

Open
aerilli opened this issue Mar 6, 2024 · 5 comments
Open

Combining RM rows #443

aerilli opened this issue Mar 6, 2024 · 5 comments
Labels
question Further information is requested

Comments

@aerilli
Copy link

aerilli commented Mar 6, 2024

Hi Shujun,

Thanks again for developing this amazing package!
I am running the newest v.2.2. I manually increased the max divergence for fragments to be combined from 3.5 to 4.5 at https://github.com/oushujun/EDTA/blob/v2.2.0/EDTA.pl#L694
The fragments below should be combined into two distinct elements. However this seems to not happen even if they overlap. This is how the annotation looks like:
image
The first three and the last two fragments should be merged. The gap in between is 200bp.
From my $genome.out.new:

10111    3.7  0.6  1.4  Chr5        19872566 19873827 (10543892) C VANDAL21               DNA/MULE-MuDR                 (4)    8240      6988  98588
16698    3.5  0.7  1.2  Chr5        19873941 19875999 (10541720) C VANDAL21               DNA/MULE-MuDR              (1286)    6958      4910  98588  
23201    1.5  2.4  0.0  Chr5        19878212 19880916 (10536803) + VANDAL21               DNA/Mutator                  2006    4775    (3469)  98592
16730    3.4  0.6  1.3  Chr5        19880889 19883063 (10534656) + VANDAL21               DNA/Mutator                  4910    7166    (1078)  98593 *
9665    3.6  2.8  1.5  Chr5        19883061 19884298 (10533421) + VANDAL21               DNA/Mutator                  6988    8240       (4)  98594 *

Do you have an idea about why this is happening?
Thankss!!

@oushujun
Copy link
Owner

Hi,

The directions of these entries are different and the physical distances between them are too far. The last two entries are close enough, but their TE coordinates substantially overlap (4910-7166 vs 6988-8240), thus they can not be considered as a single element.

Thanks!
Shujun

@oushujun oushujun added the question Further information is requested label Mar 14, 2024
@aerilli
Copy link
Author

aerilli commented Mar 15, 2024

Hey Shujun,

Thanks for the clarification! So if a substantial overlap is detected, then they cannot be considered a single element.
However, it is still a bit unclear to me how this can translate into the final annotation of this region, that looks like this:

Chr5    EDTA    Mutator_TIR_transposon  19872566        19873827        10111   -       .       ID=TE_homo_95784;Name=VANDAL21;classification=DNA/MULE-MuDR;sequence_ontology=SO:0002280;identity=0.963;method=homology;ID=TE_homo_98670;sequence_ontology=SO:0002280
Chr5    EDTA    Mutator_TIR_transposon  19873825        19874206        3057    -       .       ID=TE_homo_95785;Name=VANDAL21;classification=DNA/Mutator;sequence_ontology=SO:0002280;identity=0.968;method=homology;ID=TE_homo_98671;sequence_ontology=SO:0002280
Chr5    EDTA    Mutator_TIR_transposon  19873941        19877095        12213   -       .       ID=TE_homo_95786;Name=VANDAL21;classification=DNA/MULE-MuDR;sequence_ontology=SO:0002280;identity=0.966;method=homology;ID=TE_homo_98672;sequence_ontology=SO:0002280
Chr5    EDTA    Mutator_TIR_transposon  19877284        19883063        18267   +       .       ID=TE_homo_95787;Name=VANDAL21;classification=DNA/Mutator;sequence_ontology=SO:0002280;identity=0.976;method=homology;ID=TE_homo_98673;sequence_ontology=SO:0002280
Chr5    EDTA    Mutator_TIR_transposon  19883061        19884298        9665    +       .       ID=TE_homo_95788;Name=VANDAL21;classification=DNA/Mutator;sequence_ontology=SO:0002280;identity=0.964;method=homology;ID=TE_homo_98674;sequence_ontology=SO:0002280

Where at least in two cases the overlap is not substantial and the direction is the same.

Many thankss for your support Shujun! :)

@oushujun
Copy link
Owner

The gff rows you pasted seem to contain extra information compared to the RM out rows. To combine rows, both physical coordinate, direction, and the TE coordinate, divergence need to be considered. If the physical coordinate, direction, and divergence meet the criteria, but the TE coordinate overlaps substantially, they are still considered two elements. If the the TE coordinates have a large distance in between and are in the agreeable directions (first piece has smaller 5' coordinates), they are still considered a single element. In such a case, the annotated TE has a large deletion.

Shujun

@baozg
Copy link
Contributor

baozg commented Mar 22, 2024

Hi, Shujun

Sorry for jumping into this conversation. What we don't understand is why even meet all the standard in the script, but some rows still not tjoins?

Here is the code and small working example I used:
perl combine_RMrows.pl -rmout test -maxgap 35 -maxdiv 3.5, so same family, same strand, gap less than 35 bp and two elements divergence less than 3.5 will be joined, right?

But looking for these three rows:

# before joining
SW   perc perc perc  query       position in query              matching               repeat                           position in repeat
score   div. del. ins.  sequence    begin    end          (left)   repeat                 class/family            begin     end     (left)        ID
30291    4.5  0.2  0.4  Chr3        17485555 17489789  (8669366) + VANDAL12               DNA/Mutator                     1    4200    (9966)  64678 *
38777    2.6  0.5  0.2  Chr3        17489775 17494536  (8664619) + VANDAL12               DNA/Mutator                  3442    7944    (4030)  64679
26487    1.4  0.2  0.0  Chr3        17494533 17497540  (8661615) + VANDAL12               DNA/Mutator                  8849   11860     (114)  64680 *

# after joining
SW_score        perc_div.       perc_del.       perc_ins.       query_sequence  query_begin     query_end       query_remain    strand  matching_repeat repeat_class/family     repeat_begin  repeat_end       repeat_remain   ID
30291   4.5     0.2     0.4     Chr3    17485555        17489789        8669366 +       VANDAL12        DNA/Mutator     1       4200    (9966)  64678
34020   2.1     0.4     0.1     Chr3    17489775        17497540        8661615 +       VANDAL12        DNA/Mutator     3442    11860   (114)   64679_64680

So the 64679_64680 (the ID column) was joined, but why 64678 didn't joined with 64679_64680?
✅ Same family (VANDAL12)
✅ Same Strand (+)
✅ Overlapped (17485555-17489789 with 17489775-17497540; overlapped 14bp). How large overlap of this script will be ignored? We think it's not a substantial overlap.
✅ Divergence (4.5-2.1=2.4)

@baozg
Copy link
Contributor

baozg commented Apr 2, 2024

For anyone interested in these merging, the case I pasted here didn't merge is because the overlap in the repeat consensus of last four column. 1-4200 overlapped 800 bp with 3442-11860

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants