Skip to content

Summarizing genes by functional annotation

Sam Minot edited this page Jan 13, 2021 · 1 revision

For some projects it can be helpful to extract a simplified summary containing the proportion of gene copies from each specimen which have a given functional annotation. In the future these outputs may be included in the base geneshot output, but until that time we have provided a small utility to generate those summary files.

The entrypoint to perform this function is called gene_abund.nf, which can be run as follows:

    nextflow run Golob-Minot/geneshot/gene_abund.nf <ARGUMENTS>
    
    Options:
      --results_hdf         Location for results.hdf5 generated by geneshot
      --details_hdf         Location for details.hdf5 generated by geneshot
      --genes_fasta         Location for input 'genes.fasta.gz'
      --output_folder       Location for output files
      --output_prefix       Prefix for output files
      --query               Query string to use to subset eggNOG gene descriptions

The utility will extract all of the genes which contain the --query string as part of the eggNOG description. For example, using --query "Pectate lyase" will include the annotations Pectate lyase as well as Pectate lyase superfamily protein. When multiple annotations match the query string, each will be reported independently to the user.

The outputs generated by this utility are:

  • $OUTPUT_PREFIX.genes.csv: A CSV table with the annotations for all genes which match the --query string, including the gene length, the CAG it is assigned to, and taxonomic annotation
  • $OUTPUT_PREFIX.genes.fasta.gz: Amino acid sequences for all identified genes in FASTA format
  • $OUTPUT_PREFIX.long.csv.gz: A long-format CSV table listing the abundance of every gene across every specimen, including the depth of sequencing, number of reads aligned, coverage, etc.
  • $OUTPUT_PREFIX.manifest.csv: The manifest input by the user for this dataset (to provide all required specimen annotations)
  • $OUTPUT_PREFIX.wide.csv.gz: A wide-format CSV with a single row per specimen and a single column per eggNOG annotation. The value in each cell is the proportion of genome copies from the specimen which were given the same annotation. In this wide-format summary, all genes with the same eggNOG annotation are combined to provide a single estimate