Skip to content

bacpop/CELEBRIMBOR

 
 

Repository files navigation

CELEBRIMBOR

Core ELEment Bias Removal In Metagenome Binned ORthologs

A pipeline written in Snakemake to automatically generate pangenomes from metagenome assembled genomes (MAGs).

Dependencies:

  • Snakemake
  • MMseqs2
  • Bakta
  • Biopython
  • CheckM
  • Pandas
  • Rust toolchain
  • Panaroo

NOTE: Conda is used to call different environments and dependencies (see Snakemake file).

To install:

Install the required packages using conda/mamba:

git clone git@github.com:bacpop/CELEBRIMBOR.git
cd CELEBRIMBOR
mamba env create -f environment.yml
mamba activate celebrimbor

Download the required bakta database file:

bakta_db download --output /path/to/database

You can also use the light bakta database if using a suitable version of bakta:

bakta_db download --output /path/to/database --type light

Install cgt (will install cgt_bacpop executable in ./bin directory)

cargo install cgt_bacpop --root .

Or to build from source:

git clone https://github.com/bacpop/cgt.git
cd cgt
cargo install --path "."

Running inside a container

An alternative, if you are having trouble with the above, is to use the CELEBRIMBOR docker container. If you are comfortable running commands inside docker containers and mounting your external files, the whole pipeline is in the container available by running:

docker pull samhorsfield96/celebrimbor:main

To run within the container, use the below command, replacing path to output dir and path to fasta dir with absolute paths and changing other parameters as required:

docker run -v <path to output dir>:/output -v <path to fasta dir>:/data samhorsfield96/celebrimbor:main snakemake --cores 4 --config genome_fasta=/data output_dir=/output bakta_db=bakta_db/db-light cgt_exe=cgt_bacpop cgt_breaks=0.05,0.95 cgt_error=0.05 clustering_method=panaroo panaroo_stringency=moderate

Note: ensure that clustering_method and panaroo_stringency parameters are not in quotes.

Quick start:

Update config.yaml to specify workflow and directory paths.

  • core: gene frequency cutoff for core gene, anything above this frequency is annotated as a core gene.
  • output_dir: path to output directory. Does not need to exist prior to running.
  • genome_fasta: path to directory containing fasta files (must have .fasta extension).
  • bakta_db: path to bakta db downloaded above.
  • cgt_exe: path to cgt executable.
  • cgt_breaks: frequency for rare/core gene cutoff, e.g. 0.1,0.9, meaning genes predicted at <0.1 frequency will be rare, 0.1<=x<0.9 will be middle and >=0.9 will be core.
  • cgt_error: sets false assignment rate of gene to particular frequency compartment.
  • clustering_method: choice of either mmseqs2 (for speed) or panaroo (for accuracy).
  • panaroo_stringency: Stringency of Panaroo quality control measures. One of strict, moderate or sensitive.

Run snakemake (must be in same directory as Snakemake file):

snakemake --cores <cores>

To test running of the workflow, download this repository, replace path/to with actual paths, and run:

snakemake --cores 1 --config genome_fasta=test/fasta output_dir=test_output bakta_db=path/to/bakta_db/db-light cgt_exe=path/to/cgt_bacpop cgt_breaks=0.05,0.95 cgt_error=0.05 clustering_method=panaroo panaroo_stringency=moderate

This test directory contains simulated MAGs from Kallonen et al..

The output directory test_output will contain:

  • annotated directory, containing gene annotations from bakta.
  • mmseqs2 or panaroo directory, containing gene clusters from mmseqs2 or Panaroo respectively.
  • presence_absence_matrix.txt, a tab-separated file describing the presence/absence of genes (rows) in each genome (columns).
  • pangenome_summary.tsv, a tab-separated file detailing gene annotations, frequencies and pre-adjustment frequency compartments in the pangenome.
  • checkm_out.tsv, a summary file generated by CheckM describing genome completeness and contamination.
  • cgt_output.txt, a summary file detailing the observed frequency and adjusted frequency compartment of each gene in the pangenome.

Overview of workflow

This workflow annotates genes in metagenome-assembled genomes (MAGs) and using a probabilistic model to assign each gene to a gene frequency compartment based on their respective frequencies and genome completeness.

  1. Predict genes in all FASTA files in given directory using bakta
  2. Cluster genes using mmseqs2 or Panaroo and generate a gene presence/absence matrix
  3. Generate a pangenome summary of observed gene frequencies
  4. Calculate genome completeness using CheckM
  5. Probabistically assign each gene family as core|middle|rare using cgt

Citations

When using CELEBRIMBOR, please cite:

About

Core ELEment Bias Removal In Metagenome Binned ORthologs: A pipeline to make pangenomes from MAGs

Topics

Resources

License

Stars

Watchers

Forks

Languages

  • Python 79.5%
  • R 18.5%
  • Dockerfile 1.7%
  • Shell 0.3%