New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Develop Nido HMM library for genome annotations #227
Comments
We can probably start from eggnog or something else initially, are we interested in ORF, mature proteins, domains within the proteins? What level of resolution should we aim for? |
Resolution = domain structure figure. |
Nothing overlapping. Not mature proteins because we can't predict them in highly diverged genomes. So the answer is probably close to whatever PFAM considers to be a "domain". |
We can extract relevant HMMs from https://github.com/EBI-Metagenomics/emg-viral-pipeline. The description is: "VIRify’s taxonomic classification relies on the detection of taxon-specific profile hidden Markov models (HMMs), built upon a set of 22,014 orthologous protein domains and referred to as ViPhOGs". |
I recommend the following:
This should give us a good balance of coverage and working with high-quality reference data. |
Tested, now documented too: Currently integrating into DARTH, to make it easy for @rchikhi to run it on our divergent assemblies. |
Implemented this in Darth on a separate branch, so that this doesn't interfere with the 'production' usage of Darth till this point. @rchikhi , please see new branch |
Reassigning to @rchikhi to test as part of generating the genome structure plots. |
Of note, the above structure plot had a bug. When Also, the longest hit is plotted regardless of whether it cover >= 75% of the profile (my definition of a complete hit), or not. I can change that on request. |
Now, here is a genome structure plot where I plot two things: the SARS-Cov-2 Pfam HMM (same plot as above) and also the Nido-HMM annotations (red domains). Some nido hits were inside sars-cov-2 hits. I have removed any nido hit that is included in a longer nido or sars-cov-2 hit (+/- 5% of the domain length). By request, I can also remove any sars-cov-2 hit that's inside a nido hit. |
I took PsNV and scanned over the whole Pfam using conservative
Worth to include? |
Can you apply these models to the other Epsy group and see if we get hits? |
In progress ;) |
Please search the genome against the full Pfam library using |
Sure. The point here as these matches were found even with such strict thresholding. |
It's good to know the high-confidence hits, true. But we want to know all of the hits in this scenario. |
I'm running the search myself; we can compare notes. 👍 |
Here's the list against all of Pfam at max sensitivity. Some obvious false positives.
|
For reference, here's the max sensitivity hits against the Nido-associated Pfams:
|
A good starting point is Pfam_SARS. I reviewed it a few weeks ago, and IIRC it has pretty good coverage of Betacov genes, but doesn't include some of the genes towards the 3' end found in other genera. For genome structure, we ideally want to include all genes known to occur in Nido. This is a fair amount of manual labor to go through the literature and online to make a list of genes and try to match them up with PFAM HMMs, which is not entirely trivial due to inconsistent terminology. If there are known proteins which do not have a corresponding HMM, I can build HMMs from those as needed. I'm going to assign @ababaian for now since he's probably best qualified to understand the nomenclatures.
The text was updated successfully, but these errors were encountered: