Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop Nido HMM library for genome annotations #227

Open
rcedgar opened this issue Jul 25, 2020 · 21 comments
Open

Develop Nido HMM library for genome annotations #227

rcedgar opened this issue Jul 25, 2020 · 21 comments
Assignees

Comments

@rcedgar
Copy link
Collaborator

rcedgar commented Jul 25, 2020

A good starting point is Pfam_SARS. I reviewed it a few weeks ago, and IIRC it has pretty good coverage of Betacov genes, but doesn't include some of the genes towards the 3' end found in other genera. For genome structure, we ideally want to include all genes known to occur in Nido. This is a fair amount of manual labor to go through the literature and online to make a list of genes and try to match them up with PFAM HMMs, which is not entirely trivial due to inconsistent terminology. If there are known proteins which do not have a corresponding HMM, I can build HMMs from those as needed. I'm going to assign @ababaian for now since he's probably best qualified to understand the nomenclatures.

@ababaian
Copy link
Owner

We can probably start from eggnog or something else initially, are we interested in ORF, mature proteins, domains within the proteins? What level of resolution should we aim for?

@rcedgar
Copy link
Collaborator Author

rcedgar commented Jul 25, 2020

Resolution = domain structure figure.

@rcedgar
Copy link
Collaborator Author

rcedgar commented Jul 25, 2020

Nothing overlapping. Not mature proteins because we can't predict them in highly diverged genomes. So the answer is probably close to whatever PFAM considers to be a "domain".

@asl
Copy link

asl commented Jul 25, 2020

We can extract relevant HMMs from https://github.com/EBI-Metagenomics/emg-viral-pipeline. The description is: "VIRify’s taxonomic classification relies on the detection of taxon-specific profile hidden Markov models (HMMs), built upon a set of 22,014 orthologous protein domains and referred to as ViPhOGs".

@taltman
Copy link
Collaborator

taltman commented Jul 27, 2020

I recommend the following:

  • Pull out all curated Nidovirales proteins from UniProt (aka the ones from SwissProt), which is n=~540
  • Scrape out the Pfam annotations curated on those entries
  • Create a Pfam library consisting of the union of these Pfam entries from SwissProt and the CoV-specific Pfam set we've been working with to date

This should give us a good balance of coverage and working with high-quality reference data.

@taltman taltman assigned taltman and unassigned ababaian Jul 28, 2020
@taltman
Copy link
Collaborator

taltman commented Jul 28, 2020

Tested, now documented too:
https://github.com/ababaian/serratus/wiki/Creating-taxon-specific-slices-of-the-Pfam-HMM-database

Currently integrating into DARTH, to make it easy for @rchikhi to run it on our divergent assemblies.

@taltman
Copy link
Collaborator

taltman commented Jul 28, 2020

Implemented this in Darth on a separate branch, so that this doesn't interfere with the 'production' usage of Darth till this point. @rchikhi , please see new branch nido-pfam. I've tested it on SRR5234495, and it works fine. Please feel free to use this set-up for the other divergent assemblies that we want to annotate for the genome structure plot.

@taltman
Copy link
Collaborator

taltman commented Jul 28, 2020

Reassigning to @rchikhi to test as part of generating the genome structure plots.

@taltman taltman assigned rchikhi and unassigned taltman Jul 28, 2020
@rchikhi
Copy link
Collaborator

rchikhi commented Aug 1, 2020

For reference, here is the original genome structure plot with SARS-Cov-2 Pfam HMM, and using the --cut_ga option of hmmer3.
image

@rchikhi
Copy link
Collaborator

rchikhi commented Aug 1, 2020

Of note, the above structure plot had a bug. When hmmsearch reported multiple hits for a domain, I had kept only the rightmost one. Now, I keep only the longest hit.

Also, the longest hit is plotted regardless of whether it cover >= 75% of the profile (my definition of a complete hit), or not. I can change that on request.

@rchikhi
Copy link
Collaborator

rchikhi commented Aug 1, 2020

Here is a different plot made by fixing the bug, and also substituting --cut_ga with --max -E 0.01. (Epsy accessions were reordered arbitrarily, sorry about that).

Observe that more domains (e.g. in 3' Epsy) are found, and some (e.g. viral helicase) are more consistently found.

image

@rchikhi
Copy link
Collaborator

rchikhi commented Aug 1, 2020

Now, here is a genome structure plot where I plot two things: the SARS-Cov-2 Pfam HMM (same plot as above) and also the Nido-HMM annotations (red domains).

Some nido hits were inside sars-cov-2 hits. I have removed any nido hit that is included in a longer nido or sars-cov-2 hit (+/- 5% of the domain length). By request, I can also remove any sars-cov-2 hit that's inside a nido hit.

image

@asl
Copy link

asl commented Aug 4, 2020

I took PsNV and scanned over the whole Pfam using conservative --cut_ga. It looks like there are quite significant matches to 2 models that are not in Nido HMM:

#                                                                            --- full sequence --- -------------- this domain -------------   hmm coord   ali coord   env coord   
# target name        accession   tlen query name           accession   qlen   E-value  score  bias   #  of  c-Evalue  i-Evalue  score  bias  from    to  from    to  from    to  acc description of target                                                                                                                                                          #------------------- ---------- ----- -------------------- ---------- ----- --------- ------ ----- --- --- --------- --------- ------ ----- ----- ----- ----- ----- ----- ----- ---- ---------------------
AAA_30               PF13604.7    192 MK611985.1_2         -          12217   5.6e-11   42.5   0.0   1   1   2.3e-14   8.5e-11   41.9   0.0    21   130  5575  5704  5568  5707 0.85 AAA domain
LAP1C                PF05609.13   456 MK611985.1_3         -          12217   8.7e-80  269.1   0.1   1   1   1.6e-83   1.4e-79  268.3   0.1   238   455  7651  7869  7614  7870 0.94 Lamina-associated polypeptide 1C (LAP1C)

Worth to include?

@ababaian
Copy link
Owner

ababaian commented Aug 4, 2020

Can you apply these models to the other Epsy group and see if we get hits?

@asl
Copy link

asl commented Aug 4, 2020

In progress ;)

@taltman
Copy link
Collaborator

taltman commented Aug 4, 2020

Please search the genome against the full Pfam library using --max -E 0.01 instead of --cut_ga. These are very distant proteins, and the HMMER3 heuristics will miss many of them.

@asl
Copy link

asl commented Aug 4, 2020

Sure. The point here as these matches were found even with such strict thresholding.

@taltman
Copy link
Collaborator

taltman commented Aug 4, 2020

It's good to know the high-confidence hits, true. But we want to know all of the hits in this scenario.

@taltman
Copy link
Collaborator

taltman commented Aug 4, 2020

I'm running the search myself; we can compare notes. 👍

@taltman
Copy link
Collaborator

taltman commented Aug 4, 2020

Here's the list against all of Pfam at max sensitivity. Some obvious false positives.

120_Rick_ant
AAA_11
AAA_12
AAA_16
AAA_19
AAA_22
AAA_30
AAA_5
ACC_epsilon
AIF_C
ANP
ARL6IP6
Adeno_E3
Adeno_E3_CR1
Albumin_I
Antimicrobial_7
BTG
BTK
BamHI
BrkDBD
C1_1
CALCOCO1
CBFB_NFYA
CDC24_OB1
CENP-Q
COG2
CcmD
CnrY
CoV_Methyltr_1
CoV_Methyltr_2
CoV_NSP10
CoV_NSP15_C
CoV_NSP8
CoV_RPol_N
CoV_S1
Crescentin
CwsA
DAZAP2
DCP1
DGC
DNAP_B_exo_N
DNA_mis_repair
DNA_pol_delta_4
DUF1406
DUF1482
DUF1673
DUF1689
DUF1779
DUF1981
DUF2052
DUF2075
DUF2304
DUF2316
DUF2959
DUF316
DUF3180
DUF3433
DUF3458
DUF3483
DUF3754
DUF3894
DUF3955
DUF4271
DUF4306
DUF4485
DUF4580
DUF4682
DUF4795
DUF4998
DUF5395
DUF5461
DUF5557
DUF5665
DUF599
DUF782
DUF859
Defensin_3
Dimerisation2
E_raikovi_mat
EndoU_bacteria
Endonuclea_NS_2
EphA2_TM
EzrA
FAM150
Fer4_18
GatB_N
GlutR_N
Glyco_hydro_14
HAD_SAK_1
HAP1_N
HNH
HTH_37
HU-CCDC81_bac_2
Helicase_RecD
Hemagglutinin
Herpes_UL52
Herpes_ori_bp
Hs1pro-1_N
Hus1
IceA2
Img2
Inhibitor_I9
Integrase_Zn
Kei1
Keratin_matx
LAP1C
LEAP-2
LIFR_D2
LPD11
LRR19-TM
LRRC37
Lambda_CIII
LcrV
MFS_3
MG1
MTCP1
Multi-haem_cyto
NCD3G
NLRC4_HD2
NUP
Na_K-ATPase
OSK
OSTbeta
OTU
Orthoreo_P10
P53_C
PWWP
Peptidase_C30
Pesticin
Phage_Capsid_P3
Phosphatase
PhrC_PhrF
Plasmodium_Vir
Pox_A6
Pox_P4A
RANK_CRD_2
RE_HaeIII
RIG-I_C-RD
RINT1_TIP1
RPA_interact_C
Rad33
Rb_C
RcsF
RdRP_1
RdRP_4
ResIII
Ribosomal_S17_N
SCO1-SenC
SIX1_SD
SOAR
SPRY
Sec62
Seryl_tRNA_N
Spc7
Spike_torovirin
SpoVS
Steroid_dh
Syntaxin
TIG_plexin
TNFR_c6
Toxin_2
Transposase_28
TyrRSs_C
US2
UvrB
UvrD_C_2
V-ATPase_H_N
VWA_N2
Viral_helicase1
WD40_like
XhoI
YpmT
YrhC
ZapB
Zn-C2H2_12
zf-C2HC
zf-C3HC4_5
zf-CCHC_3
zf-CGNR
zf-UDP
zf_ZIC

@taltman
Copy link
Collaborator

taltman commented Aug 4, 2020

For reference, here's the max sensitivity hits against the Nido-associated Pfams:

AAA_12
CoV_Methyltr_1
CoV_Methyltr_2
CoV_NSP10
CoV_NSP15_C
CoV_NSP8
CoV_RPol_N
CoV_S1
Peptidase_C30
RdRP_1
Spike_torovirin
Viral_helicase1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants