Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

draft/outline: Loading/finding software #28

Open
ctb opened this issue Jul 3, 2023 · 0 comments
Open

draft/outline: Loading/finding software #28

ctb opened this issue Jul 3, 2023 · 0 comments

Comments

@ctb
Copy link
Member

ctb commented Jul 3, 2023

Loading and finding software to execute in shell: blocks

snakemake workflows typically run scripts and 3rd party programs via shell: blocks, but in order to run programs, snakemake needs to know how and where to find them, which may mean loading or activating software environments. In this chapter we'll cover the conceptual and practical issues involved!

The two main approaches you can use are these: either you can run snakemake in an environment that already contains all the software you want, or you can activate/deactivate software on a rule-by-rule basis. You can also mix and match these two approaches freely!

Why use "software environments"?

To cover: your default PATH / LD_LIBRARY_PATH, and how software install basically works; the modules and conda systems; how to examine and/or test software for loading with which, type, and dummy/smoke-test execution. Also discuss why isolated software environments are

(TODO: incorporate some of the intro discussion from this lesson on conda in this more detailed introduction for this chapter)

Before you get started

First things first: you'll need to be able to install whatever software you're trying to use! Snakemake (with only one exception noted below) will not magically solve software installation problems for you. So, first, make sure you can install and use whatever software you're trying to

A first approach: running snakemake in your default environment.

shell: commands inherit the environment that snakemake runs in, so if you have installed or activated software in the shell running snakemake, then the commands run by snakemake should run in that same environment.

Or, to put it more simply: if you can run your software yourself, snakemake should be able to run it too!

This makes testing straightforward and is a good default situation.

It also means that there is no "wrong" way to install software. As long as you can get it installed and running at your shell prompt, you can use it in snakemake! So if, for example, some software has to be installed in a custom manner in your account, that's fine - just install it and get it to the point where you can use it outside of snakemake, and then you should be able to use it with snakemake, too.

(CTB: what are exceptions here? Maybe if your .bashrc resets your PATH? Check/test this.)

However, you can also have software that is only used for specific commands. Read on!

An alternate approach: rule-specific environments

An alternative to having all the software available in your default shell is to use rule-specific environments. Before we tell you how, let's discuss why you might need or want to do this!

There are two main situations where rule-specific software environments are useful. The first is where you need to use software that is incompatible with software used in other rules. For example, if you need a specific version of R or Python for a particular step, but want to use a different version for other steps, you will need to which between multiple installations. The other place where rule-specific environments are useful are when you want to "pin" the version of software you're using for reproducibility purposes.

In both these cases you will have at least one rule that is adjusting your PATH/LD_LIBRARY_PATH to run a specific piece (or suite) of software.

Conveniently, this is also very easy to do with snakemake! The short version is this: place the necessary activation or loading commands needed to use the software in front of the actual software commands in a shell: block, and go on your merry way!

A slightly more involved explanation

Each set of instructions in a shell: block runs in a single bash shell environment via a subshell. So you can run module activate or conda activate at the beginning of your shell commands in order to load your software. The subshell is terminated at the end of the shell: block and so any changes to your PATH and LD_LIBRARY_PATH will be reset and not "leak" into other rules; this also means you don't need to do any cleanup (unloading or deactivation).

(CTB: check that bash is actually the default! Can it be overridden? how about on a case by case basis?)

(CTB: talk about docker/singularity containers here, too).

conda: blocks and --use-conda

snakemake also supports rule-specific environments where it will manage the loading and unloading of conda environments, along with the one-time creation of the environments. It does this via a conda environment file referenced in a conda: block, and the accompanying command line option --use-conda.

(CTB: more!)

A happy medium: mixing the two approaches

You can also mix and match the solutions above! A simple strategy would be to use an enclosing environment that contains most of the software used by a snakemake workflow, and then implement rule-specific environments for the subset of rules that need them.

So, you can have some rules that don't activate or load anything, but use the default enclosing environment; some rules that use rule-specific conda environments or docker containers; and some rules that do custom environment modification.

Special notes for scripts and scripting languages (R and Python)

You can also use specific version of R and Python, with specific R/Python packages installed.

(Add more motivation here)

Our recommended way to do this is to install the interpreter version of R or Python you need in e.g. a conda environment, and then in that conda environment and interpreter installation, also install whatever packages or libraries you need. They will then be usable from that environment while remaining properly isolated from other system or environment installs.

CTB: use #! /usr/bin/env python or #! /usr/bin/env Rscript at the top of scripts.

Talk about environment specific one-time install/setup/config stuff.

More advanced stuff

how conda works/complete stack of shared libraries.
also discuss/mention wrappers, as a more advanced topic)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant