Specific annotation requirements for viral TPA submissions #187

taltman · 2020-07-10T10:12:45Z

From the Handbook:
https://www.ncbi.nlm.nih.gov/books/NBK53714/#gbankquickstart.i_have_viral_sequence_da

CDS feature(s) with product name(s), nucleotide locations, and amino acid translation(s) of all coding regions (showing start and stop codons, if present)
Gene symbol(s), if known

The information listed above should be applied to any virus submission.

If no coding region is present, provide another description of the sequence

If any of this information is not known, inform us at the time of your submission.

See an online example of viral sequence submission annotation.

Furthermore, the FASTA deflines should clearly indicate the primary sequence identifier:

> SEQ1 [org=coronavirus ABC123] [SRA=SRRXXXXXX1,SRRXXXXXX2]
ATGGTGTTTATAACACACACCTTAACCTACGACCTGGCAATCTTCTTGGCCACCTTAATAACGGCCTTTG
TAATTTACATAAAATGGGTGTACACATACTGGCAAAGAAAAGGTCTTGCTACAGAACCAACAGTCGTCCC
...

Double check that the files fulfill the following requirements:

https://www.ncbi.nlm.nih.gov/books/NBK53702/#gbankquickstart.can_you_give_me_stepbyst_1

https://www.ncbi.nlm.nih.gov/books/NBK53711/#gbankquickstart.what_do_you_mean_by_feat

The text was updated successfully, but these errors were encountered:

rcedgar · 2020-07-10T13:54:04Z

"all coding regions (showing start and stop codons, if present)" Finding start and stop codons is difficult with Cov, this is the main reason I gave up trying to do automated annotation myself. Finding a known gene is relatively easy with a local protein alignment (say, BLAST or an HMM), but extending the alignment out to the start or stop is hard unless the genome is very close to something which is already very well annotated -- at most a few SNPs in the gene. This is further complicated by frameshifts in some CDSs due to polymerase slippage. This is a very tricky genome to annotate.

rcedgar · 2020-07-10T13:57:59Z

Figuring out CDS and gene symbols is also tricky because of the polyprotein which is cleaved into multiple genes. In these cases, both the poloyprotein before cleavage and the genes after cleavage should be annotated (I think...). I'm assuming cleaved genes lack start and/or stop codons, instead they have a cleavage site which should also be annotated; not sure, I never fully figured out how these things are represented in GB records.

rcedgar · 2020-07-10T16:48:43Z

Here is a nice figure showing the complexity of an example Cov genome (SARS-CoV-1). Note the multiple levels of overlapping and nested ORFs and CDSs with a frameshift in one of the most important genes (RdRp). The figure shows ~14 cleavage sites which must be identified. When I saw stuff like this, I figured it would be impossible to automate annotation unless there was an existing Cov-specific tool. Now I suspect that such a tool is impossible anyway because there is too much variation in genome structure. Cov-2 has suspected leaky scanning towards the 3' end which was not present in Cov-1 AFAIK, to add one more complication to getting the translations.

https://viralzone.expasy.org/30

ababaian · 2020-07-10T17:23:13Z

We would require to specifically find protease cut sites to go sub-ORF. I think ORF1a and ORF1b would be a good starting place in this respect and protease cut sites which should be conserved (I hope) would give us the info on RdRP etc... This data is not annotated great in GenBank records as it so the examples are hard to find.

rcedgar · 2020-07-10T17:30:30Z

Is cleavage well enough understood to know how well conserved the sites are, or if approximate sequence conservation necessarily implies cleavage or not? If the genome is diverged 1%, 2%, .. 5% ... 10% from a genome with known cleavage site, when do you / do you not believe the site is conserved?

taltman · 2020-07-10T18:21:44Z

@rcedgar I've posted screenshots of the VADR annotation to Slack a few times. We're getting comparable annotations to the standard NCBI annotations for the SARS-CoV-2 reference sequence. It won't be as accurate as someone hand-annotating, but it's the best we can do in a high-throughput fashion.

taltman · 2020-07-10T18:22:01Z

Especially for distantly-related CoVs.

rcedgar · 2020-07-10T18:28:32Z

Sure, I saw the screenshots and it looks like the best-known genes are in roughly the right place, but I don't know if GenBank will find this acceptable. I don't see how any automated method can reliably meet the requirements per their documentation -- we don't know the start, stop, cleavage sites etc. etc. and we don't have reliable translations for any of the genes AFAICS.

taltman · 2020-07-10T18:31:58Z

This all might be academic because GenBank might show us the door.

VADR is a tool used by NCBI for evaluating viral genome submissions, so I think we have a better chance that it will produce the content that they are looking for. I don't think the annotation has to be flawless to pass muster; just demonstrate "due diligence" in trying to do a reasonable job.

taltman created this issue from a note in Serratus Annotation (To do) Jul 10, 2020

taltman self-assigned this Jul 10, 2020

ababaian mentioned this issue Dec 9, 2020

Create a checklist for GenBank submission -- begin to automate for high throughput #106

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specific annotation requirements for viral TPA submissions #187

Specific annotation requirements for viral TPA submissions #187

taltman commented Jul 10, 2020 •

edited

rcedgar commented Jul 10, 2020

rcedgar commented Jul 10, 2020

rcedgar commented Jul 10, 2020 •

edited

ababaian commented Jul 10, 2020

rcedgar commented Jul 10, 2020

taltman commented Jul 10, 2020

taltman commented Jul 10, 2020

rcedgar commented Jul 10, 2020

taltman commented Jul 10, 2020

Specific annotation requirements for viral TPA submissions #187

Specific annotation requirements for viral TPA submissions #187

Comments

taltman commented Jul 10, 2020 • edited

rcedgar commented Jul 10, 2020

rcedgar commented Jul 10, 2020

rcedgar commented Jul 10, 2020 • edited

ababaian commented Jul 10, 2020

rcedgar commented Jul 10, 2020

taltman commented Jul 10, 2020

taltman commented Jul 10, 2020

rcedgar commented Jul 10, 2020

taltman commented Jul 10, 2020

taltman commented Jul 10, 2020 •

edited

rcedgar commented Jul 10, 2020 •

edited