Skip to content

NCBI-Codeathons/Domain_HMM_Boundaries

Repository files navigation

Detecting protein domains in metagenomic data

AwesomeLogo

We will be cite-able soon! Please check back for updates on the NCBI Virus Discovery codeathon paper submission.

Rationale

Metagenomic data can be difficult to interpret. Which types of bacteria and viruses are present? Are specific sequences present that might be indicative of pathogenicity? One way to address these questions is to assess the protein domain content of a metagenome. Our tools allow users to perform a comprehensive annotation of domains within their metagenomic assembly (using 6-frame translation RPS-BLAST), allowing follow-up on specific occurrences of domains of interest. We also provide a set of domain matches found in metagenomics assemblies, that can be used as a "gold-standard" (or at least a silver one:-) for future experiments. Additionally, we provide an alternative version of our pipeline that we began testing (but did not deploy on our full dataset) that uses MASH on translated reads to perform a more quick & rough scan of the domain content of their metagenomic data.

Potential uses of these results include:

  • Filtering your metagenome to just the contigs that contain a particular virus of interest
  • Summarizing the gene/protein content of your metagenome
  • Estimating taxonomy from domains or mapping your metagenome's domains onto known taxonomy (for more details see the Taxonomy Domain Integration repo)

Schematic description

Workflow

How to use this pipeline

This pipeline takes the following as inputs:

(1) Query sequences, which should be contigs of a metagenome assembly*,

and also

(2) Domain models, which represent existing domain models - e.g., from CDD, PFAM, POGs/PVOGs, etc., in PSSM format.

We provide tools for the user to perform Domain search using 6-frame translation of Reverse Position-Specific BLAST (RPS-BLAST) (sometimes unofficially referred to as "RPS-tBLASTn"), or a non-optimized implementation of Mash.

*There are several ways to generate a metagenomic assembly; we built the one for our use case with SKESA.

Output

If using RPS-BLAST, a tab delimited file will be generated with added information on the source SRR sample:

Example table:

contig_id CDD pident length mismatch gapopen qstart qend sstart send evalue bitscore SRR contig_id_only contig_length
Contig_321_123.726:1.12237 CDD:222861 38.528 231 131 3 5068 5739 1 227 1.04e-47 162.0 SRR4451607 Contig_321_123.726 12237
NC_019445.1_3:1.15349 CDD:283078 91.379 58 5 0 14004 14177 1 58 3.46e-27 97.9 SRR4451607 NC_019445.1_3 15349
Contig_17_145.65:1.11778 CDD:165469 47.489 219 106 4 9292 8645 99 311 6.62e-73 238.0 SRR4451607 Contig_17_145.65 11778
Contig_256_126.089:1.6592 CDD:164995 47.674 172 84 4 1614 2120 1 169 7.67e-54 177.0 SRR4451607 Contig_256_126.089 6592
Contig_30_137.493:1.78458 CDD:222896 28.342 374 232 13 34344 33238 17 359 2.93e-43 158.0 SRR4451607 Contig_30_137.493 78458
Contig_11_130.746:1.25303 CDD:222896 23.82 445 272 19 19072 17810 8 409 7.54e-34 129.0 SRR4451607 Contig_11_130.746 25303

If using Mash, a file will be generated containing the following information:

  • Domain ID
  • Query sequence
  • Estimated distance
  • P-value
  • Number of sketches in the query found in number of sketches for the domain.

Initial use case

We used the following data to assess the runtime, scalability, and accuracy of this pipeline:

(1) Query sequences are from assembled contigs found in Codeathon1. For RPS-BLAST, we used the translated ORFs from all 3,000+ datasets, while for MASH, we only used a subset of 700 of them.

(2) Domain models are from CDD.

Parallelization is built into the pipeline in a hard-coded form. We initially parallelized across 64 nodes on 10 cloud instances.

Example outputs

AwesomeLogo AwesomeLogo