Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mismatching SARS-CoV-2 Sequences #65

Open
TheHarshShow opened this issue Jan 3, 2024 · 3 comments
Open

Mismatching SARS-CoV-2 Sequences #65

TheHarshShow opened this issue Jan 3, 2024 · 3 comments

Comments

@TheHarshShow
Copy link

Hi there,

I created a PanGraph of 200 SARS-CoV-2 sequences using FASTA sequences as input, and it seems that eleven of them aren't represented incorrectly in the JSON file. I have uploaded the data here. The original FASTA file is denoted by sars_200_orig.fa. The represented sequences (determined by me) are represented by sars_200_pangraph.fa, and the PanGraph JSON file is denoted by sars_200.json. The sequences that we believe aren't matching are England/BRBR-2B7C38D/2021|OV263009.1|2021-11-22, IMS-10178-CVDP-0E892CAB-4101-45AD-A5AB-82C23A77B85B|OX112182.1|2021-10-14, Denmark/DCGC-179132/2021|OW435830.1|2021-10-02, SouthAfrica/NHLS-UCT-GS-AD95/2021|OM739820.1|2021-08-30, IMS-10150-CVDP-7250DCF0-8B47-40DA-89AF-8E56669A8CB5|OU964784.1|2021-10-12, USA/CA-CDC-FG-175698/2021|OL666921.1|2021-11-18, Denmark/DCGC-196557/2021|OW446795.1|2021-10-24, Denmark/DCGC-151767/2021|OV830941.1|2021-08-12, USA/MA-CDCBI-CRSP_4TOCNN2I3HYX32WD/2021|MZ752955.1|2021-08-02, England/LOND-12FD57B/2021|OU391062.1|2021-05-23 and RNA|OX380648.1|2022-10-22.
Can you please look into it?

Best,
Harsh

@mmolari
Copy link
Collaborator

mmolari commented Jan 3, 2024

Hi Harsh,
thank you once more for your feedback! I'll look into it and report back.
Have a nice start in the year!
Marco

@mmolari
Copy link
Collaborator

mmolari commented Jan 3, 2024

Hi Harsh,
I did a preliminary test by building the graph with pangraph v.0.7.3 and the command:

pangraph build --test sars_200_orig.fa > graph.json

And I cannot reproduce the error, in the output graph all sequences are correctly reconstructed. When you get the error do you use other specific flags to build the graph?
I also looked into the graph you shared and I agree with you, exactly the sequences that you mentioned are incorrectly reconstructed.

Ps: another thing worth mentioning is that some sequences contain long stretches of N nucleotides, in few cases these cover 10%-30% of the sequence. PanGraph does not work well with that, it might cause excessive fragmentation of the graph or problems with alignment. Just thought I'd mention it so that if they are not central to the analysis maybe you can filter them out.

@TheHarshShow
Copy link
Author

Hi Marco,

Thanks a lot for looking into this issue. I will investigate this issue again with my team and let you know the updates on whether we did anything wrong. Also, thanks for pointing out the issue about sequences containing many N's. I believe this might have contributed to some inefficient PanGraphs representing 20k SARS-CoV-2 sequences. I'll investigate this too.

Thanks,
Harsh

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

2 participants