Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reviews #1

Open
1 task
johanneskoester opened this issue May 22, 2017 · 13 comments
Open
1 task

Reviews #1

johanneskoester opened this issue May 22, 2017 · 13 comments

Comments

@johanneskoester
Copy link
Contributor

Please post below to request a review.

TODO

  • some example workflow
@jrderuiter
Copy link

jrderuiter commented Jun 9, 2017

I wrote an example RNA-seq pipeline for generating gene expression counts here: https://github.com/jrderuiter/snakemake-rnaseq-star-featurecounts. Would be interested to hear your thoughts/comments concerning the structure/code. I've used wrappers where available, though wrappers for the other tools should be easy to implement.

@johanneskoester
Copy link
Contributor Author

This looks very good! Besides some very minor things, it is exactly like a perfect snakemake workflow should be. Do you intend to move this over here?

  1. At least in one file, there is a superfluous path.join.
  2. We should indeed aim to have a wrapper for all common tools.
  3. Rules that have a wrapper don't need to be included in the conda environment. When using snakemake with --use-conda, wrappers automatically get their own env. For rules that are too special for a wrapper, you can use the conda directive. See here.
  4. I am not sure yet what should be the canonical way to configure the fastq files. If the path to the fastq files was stored in the samples.tsv, they can have non-standard names or paths anywhere in the system. The upside of requiring a fixed relative path like you have it is that everything could be automatically made sustainable and portable via snakemake --archive and zenodo, see here. However, for publication, one might rather want to point to fastqs via an URL to files in sra. It would be nice to support a simple transition between local and sra, e.g. via the config file.

@jrderuiter
Copy link

jrderuiter commented Jun 12, 2017

Thanks! Sure, moving here would be fine. I'll have a look at your comments (especially point 4, which I agree would be ideal if this would also accept URLs and such).

One other thing I was thinking about, is the flexibility of the output paths. For most of my projects, I use the cookiecutter data science template (https://github.com/drivendata/cookiecutter-data-science) as a starting point, which I think provides a nice default project structure. One nice thing about this template, is that it splits interim and 'final' processed files into two separate directories (data/interim and data/processed). I like this a lot, because it distinguishes interim steps (that data from which can be discarded) from the final processed outputs.

In my own pipelines, I therefore typically distinguish between the interim and processed directories (which I base all rule paths on and are configurable via the config file/command line). For example,
my snakemake calls typically look something like this:

	snakemake -s pipelines/exome.snake \
		--configfile ./configs/exome.yml \
		--config samples=data/raw/samples.exome.txt \
		         interim_dir=data/interim/exome \
			processed_dir=data/processed/exome \
			qc_dir=qc/exome \
			log_dir=logs/exome

I think it would be great if we could include some flexibility in the output paths, but am unsure what the best way is to do so. To some extent this can be done at runtime by changing the workdir for snakemake, but this does not allow us to distinguish between interim and processed output directories. I also don't want to introduce too much extra complexity into the pipelines.

What are your thoughts on this?

@jrderuiter
Copy link

Maybe a stupid question, but how are software versions managed with wrappers and conda? Are versions of tools hardcoded in the wrapper, or can I specify which version of the tool should be used?

@johanneskoester
Copy link
Contributor Author

By default, the wrapper uses a hardcoded conda environment (see here). By this, we ensure stability. If this shall be overridden, the user can specify a custom conda environment within the rule that uses the wrapper. This will always take precedence over the one that comes with the wrapper.

Regarding the directories:

While I also like the clean distinction, I think the datascience approach of interim and final directories is a bit too unflexible for Snakemake. With the support of temp files, you don't really need to distinguish output files like that in Snakemake. Interim files can simply be declared as temporary. Also, users of our workflows might have different ideas of what is interim. Sometimes, a bam is needed for further processing or even a result of its own, sometimes it can be discarded. Since the idea would be that people checkout the repo and possibly modify the workflow, we probably should use neutral directory names that rather describe the workflow step than what kind of usage the file will have.

@johanneskoester
Copy link
Contributor Author

@jrderuiter is your workflow ready for moving here? It does not need to be perfect, we can mark it as WIP.

@johnne
Copy link

johnne commented May 23, 2020

I've been developing a workflow for metagenomic projects located here: https://github.com/NBISweden/nbis-meta. It supports paired- and single-end data and includes preprocessing, read-based classification, assembly and binning as well as functional and taxonomic annotations. Any input is appreciated. Thanks!

@SilasK
Copy link

SilasK commented Aug 19, 2020

Hello, I would like to contribute my workflow to the snakemake-workflows.
Could you add me to the team?

@johanneskoester
Copy link
Contributor Author

johanneskoester commented Nov 25, 2020

Hey @johne and @SilasK sorry for being so late about this. I completely missed the notifications. Very nice pipelines! Both of them could be included here relatively quickly I think.

@johnne:

  • I'd vote for writing down filepaths without the opj helper. It limits readability compared to plain strings.

@SilasK

  • I'd suggest to move all conda packages from dependencies.yml into rule-specific environments. This improves provenance information in the reports and makes it easier to discover which software is used in which step.
  • Please consider annotation final results with the report() flag, such that a self-contained report can be created from your pipeline.
  • Please move testing from travis to github actions. This makes it more consistent with the other workflows in here, and also more future proof.

Both

  • If you haven't yet, please run snakemake --lint on your pipelines.

@johnne
Copy link

johnne commented Nov 25, 2020

Thanks @johanneskoester!
Yes the opj code is an old leftover that I've been putting off removing for a long time, but I completely agree. Will get to work on that.

I've run linting on the workflow which did help a lot (thanks for that functionality!) but I'm afraid the workflow doesn't conform 100% to the linting. Mostly it's missing log directive for rules using the script: directive. Will work on updating that.

@nikostr
Copy link

nikostr commented Apr 30, 2021

Hey! I've tried my hand at setting up a DeepVariant-GLnexus workflow. Would love to have it as a part of this repo. It has unit and integration tests. Genome fetching, mapping, etc is done similarly to the dna-seq-gatk-variant-calling workflow. I'm planning on moving this to the copier template system as well.

With the latest update to the GATK workflow - fixing paths etc - this workflow could in turn probably also be significantly refactored to draw more on the GATK one.

@LeilyR
Copy link

LeilyR commented Jan 31, 2022

Hi @johanneskoester , I would like to add our pipeline into the snakemake catalog. I am not entirely certain if it meets your criteria though. For example Snakefile is not directly under workflow folder, the reason is that it provides more than a single pipeline and it is a collection of several pipelines. Do you have any feedback on how to get it to be added there?

@Sagityq
Copy link

Sagityq commented Jan 31, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants