New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gailinpease/snakemake proof of concept #283
base: development
Are you sure you want to change the base?
Conversation
@miloknowles this may also be interesting to you -- this is a proof of concept for one option for moving away from a monolithic script. |
This is awesome! Would love to see this implemented in OGE and used in other projects too. |
This is really interesting! Thanks for looking into this. I agree that this sounds really exciting/promising. A couple of reactions/questions:
What would be an example of this? What is the definition of a "job"?
I think with snakemake implemented, we could probably get rid of both the --small option and the --skip_outputs options for the reasons you mentioned. Other questions
|
Code changes and re-runningA job is one piece of the pipeline that we decide is conceptually independent and choose to separate into a snakemake rule. In this PR, that would be step 18 in the pipeline: calculating consumed emissions. The biggest practical difference from our current pipeline steps is that a job must read and write files (can't pass in-memory outputs from or to another job). Because each job reads its input data from files, if there have been no changes to the data in those files and no changes to the code that the job runs, we can be confident that the job would produce the same output file that it created last time. Snakemake detects that (and also does the same check for all the jobs that produced dependencies to this job) and doesn't re-run. I'm not sure what algorithm snakemake uses to detect code changes. I suspect it's just file timestamps, and so all jobs that import data_cleaning.py would be triggered on any change to data_cleaning.py, instead of just the job using the function that changed. We could improve this by splitting up data_cleaning.py into smaller files (which will be good for packaging anyways, and for helping people new to the project find code they need) Other questions
Yes, this should be fine. Most computers have way more disk space than memory, so writing files currently held in-memory to disk will be fine, though it would probably slow down the first run (later runs would be faster because many jobs would not have to re-run). Also, using snakemake means that users won't have to write eg. plant-level files if they only want BA-level outputs, as long as we make those jobs independent.
Snakemake is a python-based language. I think it's simpler than data_pipeline.py, because it declares dependencies for each job, while in data_pipeline.py you have to carefully track a variable through the code to understand where it's used and which steps change it. You can look at workflow/Snakefile in this PR to see what it looks like.
The logic would be split between workflow/Snakefile, which would have the name of each job and its dependencies (this declaration is called a "rule"), and a set of scripts, each of which would hold the logic for a single job. For very simple jobs (load a file, run a single function, write a file), logic can be written directly in the Snakefile rule, but it's not good practice to hold a lot of logic in a rule. |
This PR demonstrates how we might use Snakemake as a workflow manager by implementing a snakemake rule for the final step in the pipeline (calculating consumed emissions).
Why snakemake?
How does it work?
snakemake -n data/results/base/2019/carbon_accounting
(-n signals dry run, so snakemake won't actually produce anything)If we instead ask for
data/results/small/2019/carbon_accounting
, snakemake will detect based on theinput
for the rule that we need to use the downloaded 930 file instead of the outputs 930 file, replicating the logic currently in data_pipeline.py:Advantages of snakemake over data_pipeline.py
results/2019
, for exampleinput
andoutput
files (this could also eventually support validation and/or manual inspection of intermediate output files)workflow
folder, which will eventually support separately packagingsrc
Disadvantages
--rerun-trigger mtime
skip_outputs
(implemented in this PR as a config option, see config.yaml), since it deletes the output files for a job before running it. I thinkskip_outputs
should be less necessary though if snakemake helps us avoid running the entire pipeline