Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Paramspace to automate the file naming scheme based on wildcards #40

Open
wants to merge 89 commits into
base: main
Choose a base branch
from

Conversation

kelly-sovacool
Copy link
Member

@kelly-sovacool kelly-sovacool commented Jan 28, 2023

Paramspace() is a class provided by Snakemake that provides an automated way to build a file naming scheme with wildcards based on a data frame. I implemented a custom way to create a paramspace from the configfile so the user doesn't have to manually create a csv file of their parameters, which would be somewhat redundant with the configfile anyway.

This implementation assumes that for each list in your configfile, you will want all-vs-all pairwise combinations of parameters in the list. Users can bypass this custom config->paramspace implementation in case they would like to make their own paramspace some other way by providing paramspace_csv in their configfile.

Issues

Change(s) made

  • Now use a Paramspace to define the wildcard pattern in the run_ml rule.
  • New functions in workflow/scripts/functions.py:
    • get_paramspace_from_config() - takes a config dictionary and returns a Paramspace.
    • pattern_drop_wildcard() - get the wildcard pattern from paramspace without this wildcard. Needed by rule combine_hp_performance.
    • pattern_tame_wildcard() - get the wildcard pattern from paramspace with all wildcards escaped with curly braces except this wildcard. Needed by rule combine_hp_performance.
    • instances_drop_wildcard() - get a list of all wildcard instances from paramspace without this wildcard. Needed by rule plot_hp_performance.
    • set_default() - helper function to get a value from the config file if it exists, or return a default value. Reduces repetitive code when setting variables from the config.
    • I wrote unit tests for the above functions in workflow/scripts/test_functions.py and setup GitHub Actions to test them.
  • Move code related to parsing the configfile to rules/config.smk.
  • paramspace_csv is a new key in the configfile. If it exists and is not empty, the paramspace will be created by parsing this CSV file. Effectively this is a way for users to bypass this custom config->paramspace implentation. If this key doesn't exist or is empty, the parameters will include all keys listed in the configfile except those listed in exclude_param_keys.
  • exclude_param_keys is a new key in the configfile that lists all keys in the configfile that should be excluded from the paramspace. For the default config, the keys excluded are all except dataset, method, and kfold. This way a new key added to the configfile is automatically included in the paramspace unless the user intentionally excludes it by adding it to the exclude_param_keys list.
  • Note that the pandas Python package is now a dependency that must be installed alongside snakemake.

Checklist

(Strikethrough any points that are not applicable.)

  • Update the docs if there are any API changes (README.md, config/README.md, & quick-start.md).
  • Re-run the pipeline on an HPC and update the example report.
  • The checks succeed on your most recent commit. This is always required before the PR can be merged.

kelly-sovacool and others added 30 commits January 18, 2023 17:13
Can't have hyphens.
Better to use hyphens to separate params.
to give users control over paramspace wildcards
Also fix instances_drop_wildcard() so it returns a unique set
Have to use output filepath, not rules.output,
because the plot is blank when find_feat_imp is False
Otherwise, will get a ModuleNotFoundError when deploying
this module with snakedeploy
Just use all columns that match
@kelly-sovacool
Copy link
Member Author

kelly-sovacool commented Jan 30, 2023

@sklucas Let me know your thoughts on this! In some ways it makes understanding the workflow more complicated, but with the goal of making execution more flexible for different use-cases. We can iterate on this to find a balance between flexible execution with paramspaces without sacrificing too much understandability/readability of the workflow.

kelly-sovacool and others added 25 commits January 30, 2023 17:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use Paramspace to automate file naming scheme based on wildcards
1 participant