Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect E. coli sequences being represented by PanGraph (large dataset) #68

Open
TheHarshShow opened this issue Feb 7, 2024 · 5 comments

Comments

@TheHarshShow
Copy link

Hi there,

We want to report an issue with a PanGraph that we generated on a dataset representing 1000 E. coli sequences. We believe that 64 of these sequences are not represented correctly by the PanGraph.

Thankfully, since we think the sequence lengths are also wrong, we manually verified the issue by simply computing the lengths of one of the mismatching sequences. We did this by adding up the lengths of the consensus sequences of the blocks on its path and adding the lengths of the insertions in the sequences and subtracting the lengths of the deletions on the path.

We find that the sequence length of the sequence ‘NZ_AP019856.1’ is computed by the PanGraph to be 4800017 bases. However, its true length is 4800098 bases.

We have uploaded the three relevant files to the following folder: https://drive.google.com/drive/folders/1JAliSaWokYX2i5KaUjQiOPnCdL_uyZqG?usp=sharing

We believe the mismatching sequences are: NZ_AP019856.1, NZ_CP054407.1, NZ_CP010219.1, NZ_CP036202.1, NZ_CP014583.1, NZ_CP027587.1, NZ_CP027325.1, NZ_CP013029.1, NZ_CP027459.1, NZ_CP050865.1, NZ_CP050862.1, NZ_CP027534.1, NZ_CP014316.1, NZ_CP015085.1, NZ_CP018970.1, NZ_CP023826.1, NZ_CP032201.1, NZ_CP023844.1, NZ_CP015138.1, NZ_CP018983.1, NZ_CP018991.1, NZ_CP049077.2, NZ_CP010876.1, NZ_CP036245.1, NZ_CP049085.2, NZ_CP035476.1, NZ_CP035477.1, NZ_CP014522.1, NZ_CP014495.1, NZ_CP024720.1, NZ_CP024717.1, NZ_CP021207.1, NZ_CP019008.1, NZ_CP019020.1, NZ_CP035498.1, NZ_CP053245.1, NZ_CP037449.1, NZ_CP048304.1, NZ_CP048920.1, NZ_CP040456.1, NZ_CP024886.1, NZ_CP051700.1, NZ_CP030111.1, NZ_AP022650.1, NZ_CP053251.2, NZ_CP051688.1, NZ_CP033762.1, NZ_CP019273.1, NZ_AP017610.1, NZ_CP033850.1, NZ_CP019029.1, NZ_CP015834.1, NZ_CP009859.1, NZ_CP040919.1, NZ_CP023366.1, NZ_CP041300.1, NZ_CP033605.1, NZ_CP041452.1, NZ_CP041448.1, NZ_CP028166.1, NZ_AP021896.1, NZ_CP031833.1

Thanks,
Harsh

@mmolari
Copy link
Collaborator

mmolari commented Feb 7, 2024

Hi Harsh,
thanks for flagging the issue! I'll look into it. Could you also write which version of PanGraph was used to generate the graph and the exact command? By any chance could you also reproduce the issue with a smaller dataset? This would greatly help in debugging.
Cheers!
Marco

@TheHarshShow
Copy link
Author

TheHarshShow commented Feb 7, 2024

Hi Marco,

Thanks for the quick response. We used the version 0.7.3 and the command was pangraph build --circular --upper-case -a 200 -b 30 input.fa > output.json.

We understand that this dataset is very big, and we can try looking for issues in other datasets. However, before that, can you confirm whether we have identified the issue with the PanGraph correctly as it might be possible that we aren't properly interpreting some part of the JSON file? Is it possible that in the PanGraph JSON file that we have provided, you compute the length of the sequence NZ_AP019856.1 and tell us if you agree with our analysis? The length that we found from the PanGraph was 4800017 bases.

Thanks,
Harsh

@mmolari
Copy link
Collaborator

mmolari commented Feb 8, 2024

Hi Harsh,
sorry for the delay, today we're having a bit of troubles with the university cluster and the graph is too big for me to open on my laptop.

I checked the full sequence reconstruction for all isolates in the graph. It looks like 64/1000 isolates have minor problems in their sequence. In particular I agree that isolate NZ_AP019856.1 should be 4800098 bp long but it is 4800017 bp in the graph.

Here is a full list of the sequences containing small inconsistencies
--> isolate 'NZ_CP054407.1' incorrectly reconstructed
length of graph seq: 4954286
length of ref:       4954362
--> isolate 'NZ_CP027534.1' incorrectly reconstructed
length of graph seq: 5022404
length of ref:       5022408
--> isolate 'NZ_CP014316.1' incorrectly reconstructed
length of graph seq: 5081057
length of ref:       5081061
--> isolate 'NZ_CP014522.1' incorrectly reconstructed
length of graph seq: 5033278
length of ref:       5033359
--> isolate 'NZ_CP051688.1' incorrectly reconstructed
length of graph seq: 5328979
length of ref:       5329017
--> isolate 'NZ_CP019008.1' incorrectly reconstructed
length of graph seq: 4926068
length of ref:       4926149
--> isolate 'NZ_CP027325.1' incorrectly reconstructed
length of graph seq: 5135671
length of ref:       5135675
--> isolate 'NZ_CP015085.1' incorrectly reconstructed
length of graph seq: 5289894
length of ref:       5289898
--> isolate 'NZ_CP053245.1' incorrectly reconstructed
length of graph seq: 4675458
length of ref:       4675501
--> isolate 'NZ_CP024886.1' incorrectly reconstructed
length of graph seq: 5036886
length of ref:       5036925
--> isolate 'NZ_CP019020.1' incorrectly reconstructed
length of graph seq: 4913178
length of ref:       4913259
--> isolate 'NZ_AP022650.1' incorrectly reconstructed
length of graph seq: 5075871
length of ref:       5075911
--> isolate 'NZ_CP033850.1' incorrectly reconstructed
length of graph seq: 5231412
length of ref:       5231450
--> isolate 'NZ_CP018970.1' incorrectly reconstructed
length of graph seq: 5259383
length of ref:       5259387
--> isolate 'NZ_CP051700.1' incorrectly reconstructed
length of graph seq: 5053498
length of ref:       5053537
--> isolate 'NZ_CP014495.1' incorrectly reconstructed
length of graph seq: 5061740
length of ref:       5061821
--> isolate 'NZ_CP040919.1' incorrectly reconstructed
length of graph seq: 5209400
length of ref:       5209476
--> isolate 'NZ_CP035476.1' incorrectly reconstructed
length of graph seq: 5018238
length of ref:       5018242
--> isolate 'NZ_CP032201.1' incorrectly reconstructed
length of graph seq: 5107211
length of ref:       5107215
--> isolate 'NZ_AP021896.1' incorrectly reconstructed
length of graph seq: 4574662
length of ref:       4574715
--> isolate 'NZ_CP015138.1' incorrectly reconstructed
length of graph seq: 5009896
length of ref:       5009900
--> isolate 'NZ_CP018983.1' incorrectly reconstructed
length of graph seq: 4947528
length of ref:       4947532
--> isolate 'NZ_AP019856.1' incorrectly reconstructed
length of graph seq: 4800017
length of ref:       4800098
--> isolate 'NZ_CP023826.1' incorrectly reconstructed
length of graph seq: 5129464
length of ref:       5129468
--> isolate 'NZ_CP040456.1' incorrectly reconstructed
length of graph seq: 5234111
length of ref:       5234468
--> isolate 'NZ_CP015834.1' incorrectly reconstructed
length of graph seq: 5176712
length of ref:       5176750
--> isolate 'NZ_CP010219.1' incorrectly reconstructed
length of graph seq: 5102478
length of ref:       5102554
--> isolate 'NZ_CP050865.1' incorrectly reconstructed
length of graph seq: 4899865
length of ref:       4899869
--> isolate 'NZ_CP018991.1' incorrectly reconstructed
length of graph seq: 5434741
length of ref:       5434745
--> isolate 'NZ_CP023366.1' incorrectly reconstructed
length of graph seq: 4986674
length of ref:       4986712
--> isolate 'NZ_CP027445.1' incorrectly reconstructed
length of graph seq: 5196101
length of ref:       5196105
--> isolate 'NZ_CP019273.1' incorrectly reconstructed
length of graph seq: 5050946
length of ref:       5050984
--> isolate 'NZ_CP027587.1' incorrectly reconstructed
length of graph seq: 5235556
length of ref:       5235560
--> isolate 'NZ_CP033605.1' incorrectly reconstructed
length of graph seq: 5569767
length of ref:       5569804
--> isolate 'NZ_CP036202.1' incorrectly reconstructed
length of graph seq: 4834237
length of ref:       4834354
--> isolate 'NZ_CP041300.1' incorrectly reconstructed
length of graph seq: 5083034
length of ref:       5083072
--> isolate 'NZ_CP010876.1' incorrectly reconstructed
length of graph seq: 5010880
length of ref:       5010884
--> isolate 'NZ_CP031833.1' incorrectly reconstructed
length of graph seq: 4854454
length of ref:       4854459
--> isolate 'NZ_CP023844.1' incorrectly reconstructed
length of graph seq: 5144476
length of ref:       5144480
--> isolate 'NZ_CP019029.1' incorrectly reconstructed
length of graph seq: 5262936
length of ref:       5262974
--> isolate 'NZ_CP048304.1' incorrectly reconstructed
length of graph seq: 4959702
length of ref:       4959978
--> isolate 'NZ_CP049077.2' incorrectly reconstructed
length of graph seq: 5295147
length of ref:       5295151
--> isolate 'NZ_CP036245.1' incorrectly reconstructed
length of graph seq: 5187765
length of ref:       5187769
--> isolate 'NZ_AP017610.1' incorrectly reconstructed
length of graph seq: 4920790
length of ref:       4920828
--> isolate 'NZ_CP035477.1' incorrectly reconstructed
length of graph seq: 5061269
length of ref:       5061273
--> isolate 'NZ_CP013029.1' incorrectly reconstructed
length of graph seq: 5202846
length of ref:       5202850
--> isolate 'NZ_CP027459.1' incorrectly reconstructed
length of graph seq: 5253708
length of ref:       5253712
--> isolate 'NZ_CP014583.1' incorrectly reconstructed
length of graph seq: 5193732
length of ref:       5193734
--> isolate 'NZ_CP048920.1' incorrectly reconstructed
length of graph seq: 5356778
length of ref:       5357129
--> isolate 'NZ_CP010183.1' incorrectly reconstructed
length of graph seq: 4940358
length of ref:       4940434
--> isolate 'NZ_CP009859.1' incorrectly reconstructed
length of graph seq: 5310473
length of ref:       5310511
--> isolate 'NZ_CP030111.1' incorrectly reconstructed
length of graph seq: 4939419
length of ref:       4939457
--> isolate 'NZ_CP049085.2' incorrectly reconstructed
length of graph seq: 5255745
length of ref:       5255749
--> isolate 'NZ_CP035498.1' incorrectly reconstructed
length of graph seq: 5349746
length of ref:       5349824
--> isolate 'NZ_CP033762.1' incorrectly reconstructed
length of graph seq: 4977685
length of ref:       4977723
--> isolate 'NZ_CP021207.1' incorrectly reconstructed
length of graph seq: 5013732
length of ref:       5013813
--> isolate 'NZ_CP050862.1' incorrectly reconstructed
length of graph seq: 4901394
length of ref:       4901398
--> isolate 'NZ_CP041452.1' incorrectly reconstructed
length of graph seq: 4662326
length of ref:       4662393
--> isolate 'NZ_CP041448.1' incorrectly reconstructed
length of graph seq: 4860466
length of ref:       4860533
--> isolate 'NZ_CP024720.1' incorrectly reconstructed
length of graph seq: 5196876
length of ref:       5196957
--> isolate 'NZ_CP028166.1' incorrectly reconstructed
length of graph seq: 4815061
length of ref:       4815114
--> isolate 'NZ_CP024717.1' incorrectly reconstructed
length of graph seq: 5196875
length of ref:       5196956
--> isolate 'NZ_CP037449.1' incorrectly reconstructed
length of graph seq: 5080445
length of ref:       5080721
--> isolate 'NZ_CP053251.2' incorrectly reconstructed
length of graph seq: 5121194
length of ref:       5121232

It looks like in these cases there are few tens of bp of mismatches. I'll be investigating this further but it might take some time since it looks like these inconsistencies appear in complicated edge-cases that only happen when graphs are big and complex enough. We're working on a more robust re-implementation of some of the core functions of pangraph that will hopefully remove all of these inconsistencies once and for all. I'll keep you posted.

In the meantime thanks again for your feedback!

Marco

@TheHarshShow
Copy link
Author

Hi Marco,

Thanks for confirming this issue. We will investigate other datasets to look for mismatches and let you know if we find issues so if it can help debug the issue.

Thanks,
Harsh

@mmolari
Copy link
Collaborator

mmolari commented Feb 10, 2024

In case it can be useful for this we added the command line option --test for the build command. With this flag the program tests for consistency of the graphs, verifying that the input genomes can be exactly reconstructed from the output graph and fails if not. If builds succeeds with this option you can be sure that the graph is consistent. However it does not output a graph if consistency checks fail.

Thank you for all of the feedback!

Marco

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants