Skip to content

What Is A Gene?

Sam Minot edited this page Feb 5, 2020 · 2 revisions

We talk a lot about "genes" when we talk about Geneshot -- after all it's in the name! However, there's a lot of complexity here that it might be worth explaining.

At the core, Geneshot is detecting and quantifying protein-coding nucleotide sequences from WGS datasets. When you consider why such a sequence would be present in a microbiome sample, it's because there is a microbe present which encodes a gene in its genome. In the context of the genome it makes a lot of sense to talk about a gene, because we have a firm concept of what a gene in a genome is. The tricky bit comes when you compare two genomes and ask the question, "Are these two highly-similar protein-coding sequences actually the same gene or different genes?"

The biologically rigorous answer to the question, "What makes two sequences the same gene?" involves a deep understanding of the function of the gene in both biological systems, and whether the functions of one are essentially the same as the functions of the other. This concern with gene homology strikes to the core of modern molecular biology and is a good stand-in for the full complexity of understanding that we have built up over the last hundred years as a worldwide community of researchers. But that level of understanding can't be automated (yet), and so we had to make some assumptions in order to get Geneshot off the ground.

Operational Definitions

For the purpose of Geneshot, we talk about protein-coding nucleotide sequences in terms of 'alleles' and 'genes'. An 'allele' is a protein-coding nucleotide sequence which has been detected in a single biological sample by de novo metagenomic assembly. We are fairly sure that this sequence represents a homogeneous sequence that can be found in multiple cells from that biological sample.

When comparing across multiple samples, we group 'alleles' by their amino acid similarity (by default, 90% amino acid identity and 50% overlap) into 'genes'. In this case we are really just using a stringent threshold for similarity of alleles to make an operational definition of 'genes' which allows us to proceed with analysis. It could easily be that there are multiple 'genes' in this analysis which perform essentially the same function, but which are less than 90% similar at the amino acid level, and therefore are being treated as different entities (for the purpose of this analysis).

This is not a perfect solution, but we think it helps us move forward with the analysis process, and we anticipate that in the future we will get to understand much more about what the concept of the 'gene' really means for microbiome research.