How similar do inputs need to be? #8

mtisza1 · 2020-02-09T18:25:06Z

Do you have a sense of how similar an input should be to the nearest available reference in NT?
Is it more like 99% ANI or 90% ANI?
I couldn't figure it out from the code or the paper. Sorry if I missed something.

Thanks,

Mike

rcs333 · 2020-02-09T23:17:23Z

Good to hear from you Mike!

I’ve never done hard testing like that so I don’t have exact numbers for you. The hard requirements are 1) the reference needs to have exactly the same name and number of genes as what you’re trying to annotate and 2) the start codons for the reference ORFs need to align to real start codons on your input - so like if your reference start codons are aligning to gap sequences in your input it may not work. It’s really not about ANI at all once you’ve found a starting and stopping spot for each ORF. The code is completely agnostic to what’s in the middle of ORFs as long as it can find working start-stop codons in the same linear order as your reference. And there’s actually some flexibility to finding working start/stop codons - i.e. they don’t have to exactly align all of the time.

AFAIK most viral species that I was annotating were at least 80% ANI to the references. The big exception to this is the enteroviruses which can be absolutely bonkers in terms of % ANI - but enteroviruses just have one giant ORF so that’s incredibly easy to annotate automatically.

In general, if you’re trying to annotate a virus that has the same genome organization and is similar enough to what you want to use as a reference that its sane to actually call it a reference it should work. I can pretty much promise that it should be working for anything above 90%, if it doesn’t then you’ll have to manually annotate your first sample, but then you can use that as the reference for all the rest!

Let me know if this doesn’t answer your questions or if you have any others. I’m happy to answer any questions or otherwise help out with whatever you’re trying to do with this code! :)

Ryan

mtisza1 · 2020-02-10T20:26:50Z

Thanks Ryan, good to hear from you as well!

I'm developing code for annotation of more divergent viruses from metagenomes. Inevitably, I run across virus highly similar to those already in GenBank. I think my annotations are good, but in many cases using VAPiD would be better for well-described viruses and related strains as the annotations include biological (not just computational) insight. So I largely want to know when to direct people (including myself) to VAPiD.

I'm also thinking that VAPiD would be useful for annotating strains during new virus outbreaks such as the one we have on our hands at the moment.

rcs333 · 2020-02-12T08:00:51Z

Great! Yes, I agree completely. VAPiD is basically designed exactly for what you’re saying, when you’ve sequenced many samples of a well described virus this is a good tool to consider in order to annotate all your sequences. However, with highly divergent or completely novel viruses VAPiD will have completely undefined behavior.

Yeah, I was thinking the same thing! I need to make sure the ribosomal slippage is handled correctly for 2019-nCov and will update the readme when I’m sure it’ll work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How similar do inputs need to be? #8

How similar do inputs need to be? #8

mtisza1 commented Feb 9, 2020

rcs333 commented Feb 9, 2020

mtisza1 commented Feb 10, 2020

rcs333 commented Feb 12, 2020

How similar do inputs need to be? #8

How similar do inputs need to be? #8

Comments

mtisza1 commented Feb 9, 2020

rcs333 commented Feb 9, 2020

mtisza1 commented Feb 10, 2020

rcs333 commented Feb 12, 2020