pesica: S. Infantis pESI Cluster Analysis

Work in progress

Pesica is a collection of tools to detect and extract pESI plasmid contigs from Salmonella Infantis genomes, and to assign them to clusters based on presence/absence of genes in an existing pangenome. There are three steps in the pipeline:

Screening: Detect the presence of marker genes in the input genomes that indicate the presence of the pESI plasmid
Extraction: If pESI is detected in the previous step, extract the plasmid contigs from the input genomes and annotate them
Ordination: Given a plasmid genome, insert it into an existing pangenome and cluster it into that pangenome.

Setup

Pesica is a collection of snakemake workflows collected together with a command line interface. It is required to use conda (and recommended to use mamba) to use the pipeline in order to manage the various virtual environments.

Create the pesica conda environment: conda/mamba env create -f env.yml
Install the light and full dbs from bakta
- See the database download section of the Bakta README for instructions
- If you already have an existing bakta database, you can skip this step
Edit workflow/config.yml. For basic uses of the workflow, you'll only need to edit the following:
- input_directory is the directory that contains the genomes that you would like to run the pipeline on.
  - For now, the pipeline has only been tested on gzipped files (ie, those ending in gz, not tar.gz). It should work for uncompressed fastas, but this hasn't been tested.
- run_name can be whatever name you choose to help organize the workflow products inside the outputs directory.
- Edit the light_db and full_db paths in the bakta section with the path where you downloaded the bakta directories to. The light_db and full_db values should be absolute paths ending in db-light and db.
  - Example shared/databases/db-light or shared/databases/db
- You can leave the other values alone.
Activate the virtual environment: conda activate pesica
Run the setup script: python pesica setup
- This will extract the included blastdb, download the required bakta databases, and set up the virtual environments used in the rest of the pipeline

Example usage

After setting the config values and activating the pesica virtual environment as described above, you should be able to run pesica using:

python pesica <subcommand> <options>

Here, subcommand is one of screen, extract, or ordinate (Note: for now, only screen and extract is implemented). Each subcommand (and pesica itself) has a help page that you can access with --help or -h:

python pesica --help
# To get help on the screen command:
python pesica screen -h

Screening the input genomes for the pESI plasmid

To get started, run the screen command in dry-run mode by doing:

python pesica screen --dry-run 
# Alternatively:
python pesica screen -n

This will run the screen workflow in dry-run mode, which will show you the result of running the pipeline on the inputs without actually running it. If this exits without error, you can then run the pipeline by repeating the command without the dry run flag:

python pesica screen

Assuming everything completes correctly, you should see an output directory outputs/<run_name>/screen, where run_name is the name that was set in the config.yml file. This directory contains the following files:

<genome_name>.bakta - A directory containing bakta outputs for each input genome. In particular, it contains <genome_name>.tsv, the gene annotations from bakta that will be parsed downstream
pesi_presence.tsv - a tsv with a row for each input genome, with columns:
- genome - the name of the genome
- has_repA - whether or not the pESI repA gene was detected in the genome
- pesi_gene_count - the number of pESI marker genes detected in the genome
- has_pesi - whether or not the genome is predicted to contain pESI. This value is True if pESI repA was detected in the genome and there were at least 2 pESI marker genes detected in the genome
genomes_with_pesi.txt - a list of the genomes that were predicted to contain pESI

Extracting pESI plasmid genomes

After running this first step, we can now extract the plasmid genomes from the input genomes where pESI was detected with the extract subcommand. We first check that all the required files are present with a dry-run:

python pesica extract --dry-run

After checking for errors, we can run it with:

python pesica extract

This will output several files that are detailed in the extract subcommand details, but the most important ones for the next steps are:

outputs/<run_name>/pesi_genomes - a directory containing the plasmid genomes extracted from the input genomes
outputs/<run_name>/pesi_annotations - directories containing the bakta annotations for the extracted plasmid genomes, which will be used in the ordinate step.

Assigning the extracted pESI genome to a cluster

To be implemented.

Notes

Since these are just snakemake workflows with a basic CLI, the pipeline needs to be run from inside this directory.
If you are familiar with snakemake, you can skip the CLI and run the workflow directly with your own commands and options. The commands should look something like snakemake --use-conda --snakefile workflow/rules/<subcommand>.smk, but you can add whatever options you want besides that.
The screen step isn't too resource intensive, but the extract step can be. It is recommended to run it with more cores, such as through an interactive session (eg, salloc -n 32). Job submission to slurm will be added later.

Workflow details

Screen

Searches input sequences for pESI contigs.

Extract

For those sequences found to contain pESI, the plasmid-only contigs will be extracted

Ordinate

After extracting the pESI plasmid DNA, ordinate them into the existing clusters found in the paper

Pull pESI out
Annotate pESI with bakta
Add pESI into existing pangenome
- This creates a gene presence absence vector
Figure out n nearest neighbors in pangenome space
In tSNE space, average the coordinates of those nearest neighbors
After putting it into tSNE space, assign cluster number based on tSNE nearest neighbors

Programs and tools

Snakemake - workflow management
Click - command line interface
Bakta - gene annotation
BLAST - pESI gene search
Panaroo - pangenome insertion and clustering

Upcoming features

Documentation site
HPC/Slurm support

FAQ

I get "OSError: [Errno 39] Directory not empty: 'envs'" when I run the pipeline

This error appears to mostly occur when a dry-run is attempted (ie, snakemake -np). This error should not impact the actual runs of the pipeline.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
resources		resources
workflow		workflow
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
env.yml		env.yml
pesica		pesica

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

resources

resources

workflow

workflow

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

env.yml

env.yml

pesica

pesica

Repository files navigation

pesica: S. Infantis pESI Cluster Analysis

Contents

Setup

Example usage

Screening the input genomes for the pESI plasmid

Extracting pESI plasmid genomes

Assigning the extracted pESI genome to a cluster

Notes

Workflow details

Screen

Extract

Ordinate

Programs and tools

Upcoming features

FAQ

I get "OSError: [Errno 39] Directory not empty: 'envs'" when I run the pipeline

About

Releases

Packages

Languages

License

pommevilla/pesica

Folders and files

Latest commit

History

Repository files navigation

pesica: S. Infantis pESI Cluster Analysis

Contents

Setup

Example usage

Screening the input genomes for the pESI plasmid

Extracting pESI plasmid genomes

Assigning the extracted pESI genome to a cluster

Notes

Workflow details

Screen

Extract

Ordinate

Programs and tools

Upcoming features

FAQ

I get "OSError: [Errno 39] Directory not empty: 'envs'" when I run the pipeline

About

Topics

Resources

License

Stars

Watchers

Forks

Languages