Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manage annotation with joined coordinates #206

Merged
merged 77 commits into from
Jun 3, 2024
Merged

Manage annotation with joined coordinates #206

merged 77 commits into from
Jun 3, 2024

Conversation

JeanMainguy
Copy link
Member

@JeanMainguy JeanMainguy commented Mar 28, 2024

PPanGGOLiN encountered issues when handling joined coordinates present in input annotation files (GFF or GBFF). Such annotations were disregarded when encountered in GBFF files and improperly managed in GFF files.

This PR solves this issue by managing properly these annotations.

Implemented solution

  • Joined annotation cases have been handled when parsing the input files to retrieve different start and stop coordinates.
  • The information of the different coordinates is stored in an attribute coordinates of gene objects. This attribute is a list of start and stop tuples.
  • To keep the information of the joined coordinates, a new table has been added in the hdf5 call joinedCoordinates and stores the multiple start and stop of joined genes.
  • Some pytest functions have been added to check the correctness of the implementation.
  • Ensure proper handling of circular RGPs, addressing issues observed in the spot plot (refer to issue RGP drawn in spot figure is incorrect  #124) and ProkSee output.

jpjarnoux and others added 30 commits March 7, 2024 14:41
@JeanMainguy JeanMainguy marked this pull request as ready for review April 16, 2024 15:53
@jpjarnoux jpjarnoux self-requested a review April 23, 2024 15:36
Copy link
Member

@jpjarnoux jpjarnoux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work on code. I just would like to do some improvement in docstring and variable name

ppanggolin/utils.py Outdated Show resolved Hide resolved
ppanggolin/utils.py Outdated Show resolved Hide resolved
ppanggolin/utils.py Outdated Show resolved Hide resolved
ppanggolin/utils.py Show resolved Hide resolved
ppanggolin/genome.py Outdated Show resolved Hide resolved
ppanggolin/annotate/annotate.py Show resolved Hide resolved
ppanggolin/annotate/annotate.py Show resolved Hide resolved
ppanggolin/annotate/annotate.py Show resolved Hide resolved
ppanggolin/annotate/annotate.py Show resolved Hide resolved
ppanggolin/formats/readBinaries.py Show resolved Hide resolved
@jpjarnoux
Copy link
Member

jpjarnoux commented May 27, 2024

Test on few pangenomes:

species D.pigrum P.aeruginosa E.Coli
Before After Before After Before After
Genes 59999 59999 4 933 371 4 939 509 14 932 814 15 017 229
Genomes 32 32 802 802 3190 3190
Families 4123 4123 32 196 32 279 45 002 45 077
Edges 5742 5742 59 301 60 524 131 142 139 257
Persistent 1436 1436 5 181 5 028 3 137 3 135
Shell 477 477 5 982 6 286 7 250 7 234
Cloud 2210 2210 21 033 20 965 34 676 34 708
RGP 806 806 33 894 34 148 236 867 236 824
Spots 97 97 915 829 2 037 1 721
Modules 122 122 1 420 1422 2 219 2 219

The difference in RGP count comes from a difference in clustering, which affects partitioning.

@jpjarnoux
Copy link
Member

The following image shows the difference in RGP on the GCF_016904235.1_ASM1690423v1_CDS_2283 genome with proksee.
cmp_proksee_RGP
There is a deletion of RGP66 due to the change in partition of the ECK0501 gene family from shell to persistent.
RGP66
RGPs 12 and 19 merge into RGP9 due to the change in partition of the ECK1494 gene family from persistent to shell.
RGP9

@jpjarnoux jpjarnoux merged commit 002931f into dev Jun 3, 2024
4 checks passed
@jpjarnoux jpjarnoux deleted the AnnotJoin branch June 3, 2024 09:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants