Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How similar do inputs need to be? #8

Open
mtisza1 opened this issue Feb 9, 2020 · 3 comments
Open

How similar do inputs need to be? #8

mtisza1 opened this issue Feb 9, 2020 · 3 comments

Comments

@mtisza1
Copy link

mtisza1 commented Feb 9, 2020

Do you have a sense of how similar an input should be to the nearest available reference in NT?
Is it more like 99% ANI or 90% ANI?
I couldn't figure it out from the code or the paper. Sorry if I missed something.

Thanks,

Mike

@rcs333
Copy link
Owner

rcs333 commented Feb 9, 2020

Good to hear from you Mike!

I’ve never done hard testing like that so I don’t have exact numbers for you. The hard requirements are 1) the reference needs to have exactly the same name and number of genes as what you’re trying to annotate and 2) the start codons for the reference ORFs need to align to real start codons on your input - so like if your reference start codons are aligning to gap sequences in your input it may not work. It’s really not about ANI at all once you’ve found a starting and stopping spot for each ORF. The code is completely agnostic to what’s in the middle of ORFs as long as it can find working start-stop codons in the same linear order as your reference. And there’s actually some flexibility to finding working start/stop codons - i.e. they don’t have to exactly align all of the time.

AFAIK most viral species that I was annotating were at least 80% ANI to the references. The big exception to this is the enteroviruses which can be absolutely bonkers in terms of % ANI - but enteroviruses just have one giant ORF so that’s incredibly easy to annotate automatically.

In general, if you’re trying to annotate a virus that has the same genome organization and is similar enough to what you want to use as a reference that its sane to actually call it a reference it should work. I can pretty much promise that it should be working for anything above 90%, if it doesn’t then you’ll have to manually annotate your first sample, but then you can use that as the reference for all the rest!

Let me know if this doesn’t answer your questions or if you have any others. I’m happy to answer any questions or otherwise help out with whatever you’re trying to do with this code! :)

Ryan

@mtisza1
Copy link
Author

mtisza1 commented Feb 10, 2020

Thanks Ryan, good to hear from you as well!

I'm developing code for annotation of more divergent viruses from metagenomes. Inevitably, I run across virus highly similar to those already in GenBank. I think my annotations are good, but in many cases using VAPiD would be better for well-described viruses and related strains as the annotations include biological (not just computational) insight. So I largely want to know when to direct people (including myself) to VAPiD.

I'm also thinking that VAPiD would be useful for annotating strains during new virus outbreaks such as the one we have on our hands at the moment.

@rcs333
Copy link
Owner

rcs333 commented Feb 12, 2020

Great! Yes, I agree completely. VAPiD is basically designed exactly for what you’re saying, when you’ve sequenced many samples of a well described virus this is a good tool to consider in order to annotate all your sequences. However, with highly divergent or completely novel viruses VAPiD will have completely undefined behavior.

Yeah, I was thinking the same thing! I need to make sure the ribosomal slippage is handled correctly for 2019-nCov and will update the readme when I’m sure it’ll work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants