Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specific annotation requirements for viral TPA submissions #187

Open
2 tasks
taltman opened this issue Jul 10, 2020 · 9 comments
Open
2 tasks

Specific annotation requirements for viral TPA submissions #187

taltman opened this issue Jul 10, 2020 · 9 comments
Assignees

Comments

@taltman
Copy link
Collaborator

taltman commented Jul 10, 2020

From the Handbook:
https://www.ncbi.nlm.nih.gov/books/NBK53714/#gbankquickstart.i_have_viral_sequence_da

  • CDS feature(s) with product name(s), nucleotide locations, and amino acid translation(s) of all coding regions (showing start and stop codons, if present)
  • Gene symbol(s), if known
The information listed above should be applied to any virus submission.

If no coding region is present, provide another description of the sequence

If any of this information is not known, inform us at the time of your submission.

See an online example of viral sequence submission annotation.

Furthermore, the FASTA deflines should clearly indicate the primary sequence identifier:

> SEQ1 [org=coronavirus ABC123] [SRA=SRRXXXXXX1,SRRXXXXXX2]
ATGGTGTTTATAACACACACCTTAACCTACGACCTGGCAATCTTCTTGGCCACCTTAATAACGGCCTTTG
TAATTTACATAAAATGGGTGTACACATACTGGCAAAGAAAAGGTCTTGCTACAGAACCAACAGTCGTCCC
...

Double check that the files fulfill the following requirements:

https://www.ncbi.nlm.nih.gov/books/NBK53702/#gbankquickstart.can_you_give_me_stepbyst_1

https://www.ncbi.nlm.nih.gov/books/NBK53711/#gbankquickstart.what_do_you_mean_by_feat

@taltman taltman created this issue from a note in Serratus Annotation (To do) Jul 10, 2020
@taltman taltman self-assigned this Jul 10, 2020
@rcedgar
Copy link
Collaborator

rcedgar commented Jul 10, 2020

"all coding regions (showing start and stop codons, if present)" Finding start and stop codons is difficult with Cov, this is the main reason I gave up trying to do automated annotation myself. Finding a known gene is relatively easy with a local protein alignment (say, BLAST or an HMM), but extending the alignment out to the start or stop is hard unless the genome is very close to something which is already very well annotated -- at most a few SNPs in the gene. This is further complicated by frameshifts in some CDSs due to polymerase slippage. This is a very tricky genome to annotate.

@rcedgar
Copy link
Collaborator

rcedgar commented Jul 10, 2020

Figuring out CDS and gene symbols is also tricky because of the polyprotein which is cleaved into multiple genes. In these cases, both the poloyprotein before cleavage and the genes after cleavage should be annotated (I think...). I'm assuming cleaved genes lack start and/or stop codons, instead they have a cleavage site which should also be annotated; not sure, I never fully figured out how these things are represented in GB records.

@rcedgar
Copy link
Collaborator

rcedgar commented Jul 10, 2020

Here is a nice figure showing the complexity of an example Cov genome (SARS-CoV-1). Note the multiple levels of overlapping and nested ORFs and CDSs with a frameshift in one of the most important genes (RdRp). The figure shows ~14 cleavage sites which must be identified. When I saw stuff like this, I figured it would be impossible to automate annotation unless there was an existing Cov-specific tool. Now I suspect that such a tool is impossible anyway because there is too much variation in genome structure. Cov-2 has suspected leaky scanning towards the 3' end which was not present in Cov-1 AFAIK, to add one more complication to getting the translations.

https://viralzone.expasy.org/30

image

@ababaian
Copy link
Owner

We would require to specifically find protease cut sites to go sub-ORF. I think ORF1a and ORF1b would be a good starting place in this respect and protease cut sites which should be conserved (I hope) would give us the info on RdRP etc... This data is not annotated great in GenBank records as it so the examples are hard to find.

@rcedgar
Copy link
Collaborator

rcedgar commented Jul 10, 2020

Is cleavage well enough understood to know how well conserved the sites are, or if approximate sequence conservation necessarily implies cleavage or not? If the genome is diverged 1%, 2%, .. 5% ... 10% from a genome with known cleavage site, when do you / do you not believe the site is conserved?

@taltman
Copy link
Collaborator Author

taltman commented Jul 10, 2020

@rcedgar I've posted screenshots of the VADR annotation to Slack a few times. We're getting comparable annotations to the standard NCBI annotations for the SARS-CoV-2 reference sequence. It won't be as accurate as someone hand-annotating, but it's the best we can do in a high-throughput fashion.

@taltman
Copy link
Collaborator Author

taltman commented Jul 10, 2020

Especially for distantly-related CoVs.

@rcedgar
Copy link
Collaborator

rcedgar commented Jul 10, 2020

Sure, I saw the screenshots and it looks like the best-known genes are in roughly the right place, but I don't know if GenBank will find this acceptable. I don't see how any automated method can reliably meet the requirements per their documentation -- we don't know the start, stop, cleavage sites etc. etc. and we don't have reliable translations for any of the genes AFAICS.

@taltman
Copy link
Collaborator Author

taltman commented Jul 10, 2020

This all might be academic because GenBank might show us the door.

VADR is a tool used by NCBI for evaluating viral genome submissions, so I think we have a better chance that it will produce the content that they are looking for. I don't think the annotation has to be flawless to pass muster; just demonstrate "due diligence" in trying to do a reasonable job.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

3 participants