Skip to content

Genome Alignments

Sam Minot edited this page Mar 10, 2021 · 4 revisions

Background

One of the ways to gain understanding from the results of a metagenomic analysis is to compare the assembled sequence information against a reference database of microbial genomes. With that external database of genomic sequences which are each thought to correspond to a single organism, longer contiguous sequences can be used to visualize the spatial organization of genetic elements which may have only assembled de novo into smaller fragments. In order to quickly process those alignments for geneshot results, a user may run the Annotation of Microbial Genomes by Microbiome Association (AMGMA) pipeline.

Concepts

Containment

One of the key outputs of AMGMA is the information summarizing which CAGs contain genes which align to which genomes. A term used frequently in this analysis is 'containment', which refers to the proportion of genes from two sets which are found in both. In this case we could refer to the proportion of genes from a single CAG which also align to a single genome, we could refer to the proportion of genes which align to a genome which also belong to a single CAG, and we could refer to the 'containment' of the CAG/genome as the proportion of genes from the union of both sets which are also in the intersection of those sets.

Abundance

After aligning the gene catalog generated by geneshot against a collection of reference genomes, AMGMA will estimate the relative abundance of the genes which align to each genome using the aggregate proportion of gene copies from each specimen which align to that genome. In this way, each genome is assigned an 'abundance' value for each specimen in the experiment.

Association

Using the abundance values for each genome, it is possible to estimate the association of the relative abundance of the organisms containing that group of genes with any experimental design. As implemented in AMGMA, the same formula used to describe an experimental design in a set of geneshot outputs will be applied to the AMGMA results, with the same set of estimated coefficients generated by corncob for each experimental parameter.

Indexing

Because each genome name can be quite long, an integer index is created for each genome and used to refer to it in many of the outputs. The integer index can be mapped back to the input genome using the /genomes/manifest table.

Contigs

In addition to the external genomes, AMGMA will align the gene catalog against the set of long contigs (above a given size threshold) generated de novo by geneshot. For this reason, users of AMGMA will see results for contig sequences (in which every contig name starts with the name of the specimen it was assembled from) in addition to the genomes which were input.

Output Files

The output of AMGMA consists of three files:

  • *.hdf5: Alignment information in HDF5 format (easily accessed with Python)
  • *.rdb: Alignment information in RDB format (used for visualization)
  • *.annotations.hdf5: Additional annotations for each genome (optional)

The output tables in the HDF5 file are as follows, with examples shown from a small dataset in which the experimental parameters are a series of different bacterial species labels:

Genome Manifest (/genomes/manifest)

Identifies the name and ID of each genome in the analysis.

index id name
0 ERR1204060__GENE__k99_100__flag=1__multi=90.0000__len=44225 ERR1204060__GENE__k99_100__flag=1__multi=90.0000__len=44225
1 ERR1204060__GENE__k99_102__flag=1__multi=84.0000__len=50383 ERR1204060__GENE__k99_102__flag=1__multi=84.0000__len=50383
2 ERR1204060__GENE__k99_103__flag=1__multi=73.0000__len=25356 ERR1204060__GENE__k99_103__flag=1__multi=73.0000__len=25356
3 ERR1204060__GENE__k99_108__flag=1__multi=76.0000__len=39999 ERR1204060__GENE__k99_108__flag=1__multi=76.0000__len=39999
4 ERR1204060__GENE__k99_109__flag=1__multi=72.0000__len=19678 ERR1204060__GENE__k99_109__flag=1__multi=72.0000__len=19678
5 ERR1204060__GENE__k99_110__flag=1__multi=69.0000__len=59237 ERR1204060__GENE__k99_110__flag=1__multi=69.0000__len=59237

Estimated Associations (/stats/genome/corncob)

Displays the estimated association of each genome (genome_ix) with each parameter of the experimental design.

genome_ix parameter estimate p_value std_error q_value neg_log10_qvalue wald
158 speciesClostridium scindens -27 1 5e+04 1 7.4e-09 -0.00054
158 speciesClostridium symbiosum -27 1 4.2e+04 1 7.4e-09 -0.00064
158 speciesEubacterium rectale -27 1 4.6e+04 1 7.4e-09 -0.00059
158 speciesRuminococcus gnavus -4.5 4.7e-05 0.83 0.00025 3.6 -5.4
158 speciesRuminococcus torques -27 1 3.8e+04 1 7.4e-09 -0.00071
159 speciesClostridium scindens -30 1 1.2e+05 1 7.4e-09 -0.00025
159 speciesClostridium symbiosum -29 1 9.8e+04 1 7.4e-09 -0.0003
159 speciesEubacterium rectale -30 1 1.1e+05 1 7.4e-09 -0.00028

Detailed Alignments (/genomes/detail/<CONTIG ID>)

Displays the alignment of a gene catalog against a single genome (identified with the id from the manifest).

index contig gene pident contig_start contig_end contig_len genome_id CAG
2126 ERR1203923__GENE__k99_100__flag=1__multi=77.0000__len=31515 gene_534aa489_677aa 1e+02 22220 20190 31515 ERR1203923__GENE__k99_100__flag=1__multi=77.0000__len=31515 1
2127 ERR1203923__GENE__k99_100__flag=1__multi=77.0000__len=31515 gene_0476f716_627aa 1e+02 16501 14621 31515 ERR1203923__GENE__k99_100__flag=1__multi=77.0000__len=31515 1
2128 ERR1203923__GENE__k99_100__flag=1__multi=77.0000__len=31515 gene_c435bcdb_475aa 93 18598 20022 31515 ERR1203923__GENE__k99_100__flag=1__multi=77.0000__len=31515 1
2129 ERR1203923__GENE__k99_100__flag=1__multi=77.0000__len=31515 gene_5faa9628_467aa 1e+02 23588 24988 31515 ERR1203923__GENE__k99_100__flag=1__multi=77.0000__len=31515 1
2130 ERR1203923__GENE__k99_100__flag=1__multi=77.0000__len=31515 gene_bf5e732b_418aa 1e+02 25005 26258 31515 ERR1203923__GENE__k99_100__flag=1__multi=77.0000__len=31515 1
2131 ERR1203923__GENE__k99_100__flag=1__multi=77.0000__len=31515 gene_2632a2fa_422aa 1e+02 5344 4079 31515 ERR1203923__GENE__k99_100__flag=1__multi=77.0000__len=31515 1
2133 ERR1203923__GENE__k99_100__flag=1__multi=77.0000__len=31515 gene_5b1f5a24_414aa 1e+02 17017 18258 31515 ERR1203923__GENE__k99_100__flag=1__multi=77.0000__len=31515 1
2134 ERR1203923__GENE__k99_100__flag=1__multi=77.0000__len=31515 gene_8cd358e9_411aa 1e+02 28314 27082 31515 ERR1203923__GENE__k99_100__flag=1__multi=77.0000__len=31515 1
2135 ERR1203923__GENE__k99_100__flag=1__multi=77.0000__len=31515 gene_96932e78_362aa 1e+02 3119 2034 31515 ERR1203923__GENE__k99_100__flag=1__multi=77.0000__len=31515 1
2136 ERR1203923__GENE__k99_100__flag=1__multi=77.0000__len=31515 gene_8b2599b7_284aa 1e+02 7014 6163 31515 ERR1203923__GENE__k99_100__flag=1__multi=77.0000__len=31515 1

Containment (/genomes/cags/containment)

Describes the degree of overlap between CAG assignment of genes and the alignment of those genes against each genome.

genome CAG n_genes containment genome_prop genome_bases cag_prop
ERR1204060__GENE__k99_100__flag=1__multi=90.0000__len=44225 8 11 0.67 0.67 29631 0.004
ERR1204060__GENE__k99_102__flag=1__multi=84.0000__len=50383 8 50 0.9 0.9 45415 0.018
ERR1204060__GENE__k99_102__flag=1__multi=84.0000__len=50383 1 1 0.032 0.032 1602 0.00025
ERR1204060__GENE__k99_103__flag=1__multi=73.0000__len=25356 8 25 0.8 0.8 20211 0.009
ERR1204060__GENE__k99_108__flag=1__multi=76.0000__len=39999 8 43 0.8 0.8 31905 0.016
ERR1204060__GENE__k99_108__flag=1__multi=76.0000__len=39999 7 1 0.043 0.043 1704 0.00034
ERR1204060__GENE__k99_109__flag=1__multi=72.0000__len=19678 8 22 0.82 0.82 16131 0.008
ERR1204060__GENE__k99_109__flag=1__multi=72.0000__len=19678 0 1 0.061 0.061 1191 0.00024
ERR1204060__GENE__k99_109__flag=1__multi=72.0000__len=19678 4 1 0.061 0.061 1191 0.00029
ERR1204060__GENE__k99_110__flag=1__multi=69.0000__len=59237 8 51 0.85 0.85 50276 0.018

Genome Annotations (/genomes/annotations/<GENOME ID>)

contig type start end orientation annotation
NC_012781.1 gene 1 1362 + ID=gene-EUBREC_RS00010;Name=dnaA;gbkey=Gene;gene=dnaA;gene_biotype=protein_coding;locus_tag=EUBREC_RS00010;old_locus_tag=EUBREC_0001
NC_012781.1 CDS 1 1362 + ID=cds-WP_012740936.1;Parent=gene-EUBREC_RS00010;Dbxref=Genbank:WP_012740936.1;Name=WP_012740936.1;gbkey=CDS;gene=dnaA;inference=COORDINATES: similar to AA sequence:RefSeq:WP_012740936.1;locus_tag=EUBREC_RS00010;product=chromosomal replication initiator protein DnaA;protein_id=WP_012740936.1;transl_table=11
NC_012781.1 gene 1648 2760 + ID=gene-EUBREC_RS00015;Name=EUBREC_RS00015;gbkey=Gene;gene_biotype=protein_coding;locus_tag=EUBREC_RS00015;old_locus_tag=EUBREC_0002
NC_012781.1 CDS 1648 2760 + ID=cds-WP_012740937.1;Parent=gene-EUBREC_RS00015;Dbxref=Genbank:WP_012740937.1;Name=WP_012740937.1;gbkey=CDS;inference=COORDINATES: similar to AA sequence:RefSeq:WP_015517736.1;locus_tag=EUBREC_RS00015;product=DNA polymerase III subunit beta;protein_id=WP_012740937.1;transl_table=11
NC_012781.1 gene 2769 2984 + ID=gene-EUBREC_RS00020;Name=EUBREC_RS00020;gbkey=Gene;gene_biotype=protein_coding;locus_tag=EUBREC_RS00020;old_locus_tag=EUBREC_0003
NC_012781.1 CDS 2769 2984 + ID=cds-WP_012740938.1;Parent=gene-EUBREC_RS00020;Dbxref=Genbank:WP_012740938.1;Name=WP_012740938.1;gbkey=CDS;inference=COORDINATES: similar to AA sequence:RefSeq:WP_012740938.1;locus_tag=EUBREC_RS00020;product=RNA-binding S4 domain-containing protein;protein_id=WP_012740938.1;transl_table=11
NC_012781.1 gene 2984 4072 + ID=gene-EUBREC_RS00025;Name=recF;gbkey=Gene;gene=recF;gene_biotype=protein_coding;locus_tag=EUBREC_RS00025;old_locus_tag=EUBREC_0004
NC_012781.1 CDS 2984 4072 + ID=cds-WP_012740939.1;Parent=gene-EUBREC_RS00025;Dbxref=Genbank:WP_012740939.1;Name=WP_012740939.1;gbkey=CDS;gene=recF;inference=COORDINATES: similar to AA sequence:RefSeq:WP_012740939.1;locus_tag=EUBREC_RS00025;product=DNA replication/repair protein RecF;protein_id=WP_012740939.1;transl_table=11
NC_012781.1 gene 4065 6002 + ID=gene-EUBREC_RS00030;Name=gyrB;gbkey=Gene;gene=gyrB;gene_biotype=protein_coding;locus_tag=EUBREC_RS00030;old_locus_tag=EUBREC_0005
NC_012781.1 CDS 4065 6002 + ID=cds-WP_012740940.1;Parent=gene-EUBREC_RS00030;Dbxref=Genbank:WP_012740940.1;Name=WP_012740940.1;gbkey=CDS;gene=gyrB;inference=COORDINATES: similar to AA sequence:RefSeq:WP_006857737.1;locus_tag=EUBREC_RS00030;product=DNA topoisomerase (ATP-hydrolyzing) subunit B;protein_id=WP_012740940.1;transl_table=11

Abundances (/genome/abund/raw/<SPECIMEN>)

Abundance of each genome in a given specimen

abund acc
0 ERR1204060__GENE__k99_100__flag=1__multi=90.0000__len=44225
1.7e-05 ERR1204060__GENE__k99_102__flag=1__multi=84.0000__len=50383
0 ERR1204060__GENE__k99_103__flag=1__multi=73.0000__len=25356
5.1e-05 ERR1204060__GENE__k99_108__flag=1__multi=76.0000__len=39999
0 ERR1204060__GENE__k99_109__flag=1__multi=72.0000__len=19678
0 ERR1204060__GENE__k99_110__flag=1__multi=69.0000__len=59237
0 ERR1204060__GENE__k99_112__flag=1__multi=80.0000__len=17323
8.3e-05 ERR1204060__GENE__k99_113__flag=1__multi=93.0000__len=37406
0.00013 ERR1204060__GENE__k99_114__flag=1__multi=74.0000__len=73606
0 ERR1204060__GENE__k99_115__flag=1__multi=61.0000__len=20828
Clone this wiki locally