jEUKebox: Simulating metatranscriptomes for environmental eukaryotic microbes

Pronounced jukebox

Protistan microbial communities: the forgotten validation in metatranscriptome assessment

Protistan communities are diverse. They are also not generally the focus of assessment and validation procedures for 'omics techniques. Bacteria lead the way as part of the "pack" of microbes that need to be assembled, identified, and otherwise validated. This is because of the relative simplicity (speaking loosely) of bacterial genomes and transcriptomes, but also because of the existing research that already supports prokaryoticgenetic assessment. This is particularly poignant due to the role of bacteria in human disease and health, which makes them a priority for study.

However, metatranscriptomes (among other tools) are increasingly being applied to eukaryote-dominated environmental communities. Though validation has been performed using other genetic tools, e.g. PCR approaches, or through microscopic counts, there is an urgent need for general validation of omics tools for assembling and annotating eukaryote-dominated metatranscriptomes.

For this reason, we have developed this companion tool to the eukrhythmic microbial eukaryote metatranscriptome analysis pipeline!

How to run the pipeline

In order to run the pipeline, you need to have installed:

Snakemake > 7
mamba

The remainder of the dependent packages required will be downloaded for you during execution.

jEUKebox is written in Snakemake. The workflow can be executed just like an ordinary Snakemake workflow. The Snakefile is named jEUKebox, and lists all of the outputs expected from running the pipeline, calls upon methods written in the rules folder, which handles the execution of individual software tools needed to create the output files.

To run jEUKebox, one would execute:

snakemake --cores <number-of-cores> -s jEUKebox

This would run the pipeline locally. In order to execute the pipeline on a high-performance computing cluster, users can modify the file in submit/snake-submit.sh. This is designed for a SLURM HPC manager, but can be adapted to meet the needs of other systems.

The configuration file

The configuration file allows the user to customize the pipeline to suit the particular collection of organisms of interest and their abundance in the communities.

transcriptome_selections: files can be specified individually in list form; if this field is included in the configuration file, "protein_files" should have the same length.
protein_selections: analogous protein files for the transcriptomes in the first configuration file field. Alternatively, both of these fields can be circumvented by providing a directory.
community_spec: a CSV file containing important specification information for each of the generated communities. See the section below for more detailed information on how to set up this file.
outputdir: the base directory where the output of the processing pipeline should be stored.

The `community_spec` file in the configuration

This file is the way to control the desired community composition of the generated assemblies. Note that this does not control the raw reads that might come from each of the contigs (its representation in the sample), which is done based on an exponential distribution for the various transcripts which were simulated as having been sequenced, weighted by the abundance of each organism.

There is a default csv file included in the files directory of the repository, but this can be created manually using the following column fields:

Community - the number of each community to be included (should be distinct).
NumberOrganisms - the number of distinct organisms to be included in each community, from the nucleotide and protein files as specified in the configuration.
ProportionCommunity - a list specifying the proportion of contigs within its community that each organism should represent in the simulated assembly.
NumberHighSimilarity - a number (less than total NumberOrganisms for each community row) of the organisms within the community that should have high ANI similarity. The most similar organisms per ANI similarity will be chosen first.
GroupsHighSimilarity - the number of groups with high similarity that are expected to be found. This allows for the case in which you wish to have multiple high-similarity groups within your data, but want those two groups to be relatively unrelated to one another. The NumberHighSimilarity value should be divisible by the GroupsHighSimilarity value, else the total number of organism in the community will be reduced to accommodate even divisibility into the group size.

In the future, we intend to create a mode where users can just explicitly specify the output community file, such that the intermediate step is bypassed. If this feature is of interest to you, please submit an issue on the Issues tab describing your use case, and we'll get back to you as soon as we can!

Note: if there is no preference for either or both of NumberHighSimilarity or GroupsHighSimilarity, they can be set to -1 and will be ignored.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
envs		envs
files		files
rules		rules
scripts		scripts
submit		submit
LICENSE		LICENSE
README.md		README.md
cluster.yaml		cluster.yaml
config.yaml		config.yaml
config_commA.yaml		config_commA.yaml
jEUKebox		jEUKebox

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

envs

envs

files

files

rules

rules

scripts

scripts

submit

submit

LICENSE

LICENSE

README.md

README.md

cluster.yaml

cluster.yaml

config.yaml

config.yaml

config_commA.yaml

config_commA.yaml

jEUKebox

jEUKebox

Repository files navigation

jEUKebox: Simulating metatranscriptomes for environmental eukaryotic microbes

Protistan microbial communities: the forgotten validation in metatranscriptome assessment

How to run the pipeline

The configuration file

The `community_spec` file in the configuration

About

Releases

Packages

Languages

License

AlexanderLabWHOI/jEUKebox

Folders and files

Latest commit

History

Repository files navigation

jEUKebox: Simulating metatranscriptomes for environmental eukaryotic microbes

Protistan microbial communities: the forgotten validation in metatranscriptome assessment

How to run the pipeline

The configuration file

The community_spec file in the configuration

About

Resources

License

Stars

Watchers

Forks

Languages

The `community_spec` file in the configuration