Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gailinpease/snakemake proof of concept #283

Draft
wants to merge 1 commit into
base: development
Choose a base branch
from

Conversation

gailin-p
Copy link
Collaborator

This PR demonstrates how we might use Snakemake as a workflow manager by implementing a snakemake rule for the final step in the pipeline (calculating consumed emissions).

Why snakemake?

  • In active development + active user community
  • Open source and free to use (so any OGE user can use it)
  • Supports running both locally and in cloud using the same rules (though configuring to run on AWS would require some work)
  • By default, decides which jobs need to be run based on file change timestamps (both source code and data). This means that as we make changes to the pipeline, Snakemake will detect which parts of the pipeline need to be re-run, and we can avoid re-computing expensive but rarely-changed tasks (eg, 930 cleaning) without having to hardcode that logic.

How does it work?

  1. Define rules that describe which inputs a step takes, what outputs it produces, and what to run to produce the outputs (in this PR, I use a python script, but the job can also be a command line script, a jupyter notebook (including with a custom conda env), or in-line python).
  2. On the command line, ask snakemake to produce a result file: snakemake -n data/results/base/2019/carbon_accounting (-n signals dry run, so snakemake won't actually produce anything)
  3. Snakemake tells you what rules it's running and why. Currently there's only one rule, but with more rules, snakemake will detect dependencies between input and output files and decide which jobs to run based on updates to input data or code.
    Screen Shot 2023-02-20 at 5 15 47 PM

If we instead ask for data/results/small/2019/carbon_accounting, snakemake will detect based on the input for the rule that we need to use the downloaded 930 file instead of the outputs 930 file, replicating the logic currently in data_pipeline.py:
Screen Shot 2023-02-20 at 5 27 27 PM

Advantages of snakemake over data_pipeline.py

  • New rules can be added (for example, for validation tasks) without adding computational time to the main pipeline, since Snakemake will detect that they're not necessary for computing results/2019, for example
  • Parts of the pipeline can be run separately without manually copying code into a jupyter notebook
  • Files for intermediate outputs between jobs are captured as input and output files (this could also eventually support validation and/or manual inspection of intermediate output files)
  • Dependencies between steps are captured
  • The workflow is captured in the workflow folder, which will eventually support separately packaging src

Disadvantages

  • Snakemake will likely not be able to detect code changes affecting individual jobs since most jobs depend on code in data_pipeline.py. You can tell snakemake to only consider the timestamps of input and output files using --rerun-trigger mtime
  • Because of how snakemake wildcards work, we'll need to put results and outputs in "results/base/{year}" if we also want to use the rules to generate "results/small/{year}"
  • Snakemake will not play nicely with skip_outputs (implemented in this PR as a config option, see config.yaml), since it deletes the output files for a job before running it. I think skip_outputs should be less necessary though if snakemake helps us avoid running the entire pipeline

@gailin-p
Copy link
Collaborator Author

@miloknowles this may also be interesting to you -- this is a proof of concept for one option for moving away from a monolithic script.

@gailin-p gailin-p changed the base branch from main to development February 20, 2023 22:29
@miloknowles
Copy link
Contributor

This is awesome! Would love to see this implemented in OGE and used in other projects too.

@grgmiller
Copy link
Collaborator

This is really interesting! Thanks for looking into this. I agree that this sounds really exciting/promising.

A couple of reactions/questions:

Snakemake will likely not be able to detect code changes affecting individual jobs since most jobs depend on code in data_pipeline.py.

What would be an example of this? What is the definition of a "job"?

Because of how snakemake wildcards work, we'll need to put results and outputs in "results/base/{year}" if we also want to use the rules to generate "results/small/{year}"
Snakemake will not play nicely with skip_outputs (implemented in this PR as a config option, see config.yaml), since it deletes the output files for a job before running it. I think skip_outputs should be less necessary though if snakemake helps us avoid running the entire pipeline

I think with snakemake implemented, we could probably get rid of both the --small option and the --skip_outputs options for the reasons you mentioned.

Other questions

  1. What implications would implementing this have for other OGE users who want to run the OGE pipeline locally on their computer (vs us running on a server where we have the room to store all of the inputs and outputs)? Would they still be able to do so?
  2. What implications would this have for the readability of our code? Is there a new snakemake format/language that would have to be learned to understand what is going on?
  3. What would the data pipeline end up looking like? Would we be breaking off chunks into separate scripts?

@gailin-p
Copy link
Collaborator Author

gailin-p commented Feb 23, 2023

Code changes and re-running

A job is one piece of the pipeline that we decide is conceptually independent and choose to separate into a snakemake rule. In this PR, that would be step 18 in the pipeline: calculating consumed emissions. The biggest practical difference from our current pipeline steps is that a job must read and write files (can't pass in-memory outputs from or to another job).

Because each job reads its input data from files, if there have been no changes to the data in those files and no changes to the code that the job runs, we can be confident that the job would produce the same output file that it created last time. Snakemake detects that (and also does the same check for all the jobs that produced dependencies to this job) and doesn't re-run.

I'm not sure what algorithm snakemake uses to detect code changes. I suspect it's just file timestamps, and so all jobs that import data_cleaning.py would be triggered on any change to data_cleaning.py, instead of just the job using the function that changed. We could improve this by splitting up data_cleaning.py into smaller files (which will be good for packaging anyways, and for helping people new to the project find code they need)

Other questions

What implications would implementing this have for other OGE users who want to run the OGE pipeline locally on their computer (vs us running on a server where we have the room to store all of the inputs and outputs)? Would they still be able to do so?

Yes, this should be fine. Most computers have way more disk space than memory, so writing files currently held in-memory to disk will be fine, though it would probably slow down the first run (later runs would be faster because many jobs would not have to re-run). Also, using snakemake means that users won't have to write eg. plant-level files if they only want BA-level outputs, as long as we make those jobs independent.

What implications would this have for the readability of our code? Is there a new snakemake format/language that would have to be learned to understand what is going on?

Snakemake is a python-based language. I think it's simpler than data_pipeline.py, because it declares dependencies for each job, while in data_pipeline.py you have to carefully track a variable through the code to understand where it's used and which steps change it. You can look at workflow/Snakefile in this PR to see what it looks like.

What would the data pipeline end up looking like? Would we be breaking off chunks into separate scripts?

The logic would be split between workflow/Snakefile, which would have the name of each job and its dependencies (this declaration is called a "rule"), and a set of scripts, each of which would hold the logic for a single job. For very simple jobs (load a file, run a single function, write a file), logic can be written directly in the Snakefile rule, but it's not good practice to hold a lot of logic in a rule.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants