Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand the set of Nidovirales-associated Pfam HMMs for contig identification #231

Open
taltman opened this issue Aug 7, 2020 · 2 comments

Comments

@taltman
Copy link
Collaborator

taltman commented Aug 7, 2020

@asl @rcedgar

To help coronaSPAdes identify CoV-associated contigs from the full assembly graph, we need to expand our set of target HMMs.

Versions:

  1. Pfam-SARS model HMMs
  2. (current) All HMMs from curated Nidovirales proteins in UniProt, which is super-set of Pfam-SARS

The next version could run the full Pfam HMM library against the following sequences:

  1. All curated CoV genomes in GenBank & RefSeq (e.g., cov3ma)
  2. cov3ma + other Nidovirales genomes
  3. All HMMs from Nidovirales proteins in UniProt, whether SwissProt (curated) or TrEMBL (uncurated)

Also, we should figure out whether to run hmmsearch with max sensitivity --max -E 0.01, or be conservative and use --cut_ga.

@asl
Copy link

asl commented Aug 7, 2020

I would probably not extend the set of HMMs so rapidly and instead go for more "iterative" approach:

  1. Take all Epsy assemblies
  2. We expect that there should be a missed part that contains (quite weak) match to Spike_torovirin model
  3. Take assembly graphs, try to find the subgraph with missed match. Check whether there is some evidence for poly-A tail (e.g. high coverage tail in case of trimmed assemblies)
  4. Extract the missed part and try to improve the Spike model, probably also looking for other more or less conservative matches

Note that 1. above (set of CoV genomes vs Pfam) is already effectively done by https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6295324/ There is a set of 93 HMMs available. We just need to take the newer versions of them, if available (it was based on Pfam release 31 and we're at 33 these days).

@asl
Copy link

asl commented Aug 7, 2020

Also, we're already having #227 here...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants